MbitAI
· 3 min read

Three failure modes I've seen in enterprise lakehouses (and the cheap fixes)

Three patterns that derail lakehouses in production: the swamp lakehouse, the performance mirage, and the metadata ghost town.

lakehouse data-platform failure-modes

I’ve reviewed enough enterprise lakehouse implementations to have opinions about where they tend to go wrong. Three patterns come up more than anything else. Each is fixable if you catch it early.

Failure mode 1: the swamp lakehouse

Symptom: everything goes into the lakehouse with no distinction between raw, cleaned, and curated data. Users can’t trust anything because they don’t know what state it’s in.

Root cause: no zoning strategy. Teams treat the lakehouse as a dumping ground rather than implementing the medallion architecture properly.

The fix is clear zone boundaries with automated validation:

  • Bronze: raw ingestion (immutable, exactly as received)
  • Silver: cleaned and validated (business rules applied, basic quality checks)
  • Gold: business-ready (dimensionally modeled, performance optimized)

Add zone enforcement at the pipeline level:

-- Example: prevent silver-to-gold promotion without quality checks
CREATE OR REPLACE TRUSTED SILVER_TO_GOLD_CHECK AS
CASE
  WHEN (SELECT failed_checks FROM silver_quality_metrics WHERE table_name = CURRENT_TABLE()) = 0
  THEN 'ALLOW'
  ELSE BLOCK
END;

Failure mode 2: the performance mirage

Symptom: works fine in development with 10GB datasets, crawls in production with 1TB. Costs explode because patterns that look reasonable at small scale don’t hold up.

Root cause: development uses unrealistic data volumes. Teams optimise for developer convenience rather than production economics.

The fix: production-realistic testing from the start.

  1. Clone the production schema with 1% of real data (statistically valid sample)
  2. Automate performance regression testing in CI/CD
  3. Set cost monitoring alerts at 80% of budget
  4. Require performance justification for any new pipeline

One thing worth remembering: a pipeline that’s 2x slower but 10x cheaper to run is often the better business choice.

Failure mode 3: the metadata ghost town

Symptom: data exists but nobody knows what it means, where it came from, or how to use it correctly. This leads to misinterpretation and wrong business decisions.

Root cause: metadata management treated as an afterthought. No investment in documentation, lineage, or a semantic layer.

Lightweight but effective fixes:

  • Automatically capture technical metadata (schema, size, update frequency)
  • Require business owners to add semantic descriptions during data onboarding
  • Use open-source tools like Amundsen or DataHub for discovery
  • Implement simple data contract validation at pipeline boundaries

The most effective technique I’ve seen: a “data passport” that travels with each dataset, updated automatically by pipelines and manually enriched by data owners.

Spotting them early

The implementation work is mostly straightforward once you know what you’re looking for. What takes time to develop is pattern recognition, catching these before they cause damage rather than remediating after the fact. When I start a new engagement, I can usually identify the problem within the first week from three tells:

  1. Missing zone enforcement in pipeline orchestration
  2. Performance tests that only run on tiny datasets
  3. Zero business documentation on core datasets

Catching these early typically saves three to six months of remediation time. The loss of executive trust that follows a failed data platform launch is much harder to recover from than fixing the architecture before anyone notices.

What I’d do differently next time: I would create a standardised lakehouse health check that runs in the first two weeks of any engagement, providing a clear, actionable scorecard before significant resources are committed.

Eddie Beloiu

Eddie Beloiu

Freelance Data Platform Engineer · Munich