Four failure modes I've seen in enterprise lakehouses (and the cheap fixes)

I’ve reviewed enough enterprise lakehouse implementations to have opinions about where they tend to go wrong. Four patterns come up more than anything else. None of them are exotic, and each one is cheap to fix if you catch it early and expensive to unwind if you don’t.

Failure mode 1: lakehouse

Symptom: everything lands in the lakehouse with no distinction between raw, cleaned, and curated data. Nobody trusts anything, because nobody can tell what state a given table is in.

Root cause: no zoning strategy. The lakehouse gets treated as a dumping ground instead of a layered system. The medallion pattern gets name-checked in the design doc and ignored in the pipelines.

The fix is clear zone boundaries with automated validation:

Bronze: raw ingestion, immutable, exactly as received
Silver: cleaned and validated, business rules applied, quality checks enforced
Gold: business-ready, modeled, performance-optimized

Zones are only real if the pipeline enforces them. The gate that matters is silver to gold: no gold table gets built from silver data that has not passed its checks. That is a query, not a wish.

-- Promote silver -> gold only when there are no failed checks.
SELECT count(*) AS failed_checks
FROM silver_quality_metrics
WHERE table_name = 'orders' AND failed_checks > 0;

If that returns anything but zero, the promotion step stops. In dbt it is a test, in Airflow a task that raises, in Databricks a job that fails the run. The mechanism is whatever you already use; the rule is what counts.

One more thing about silver. “Cleaned and validated” usually means rule-based cleaning, and rules only catch what you anticipated. The long tail of messy master data, the same supplier spelled three ways, five encodings of a missing value, a country written as “Germany” in one row and “DE” in the next, slips straight through, so silver is never quite as clean as the label promises. Update: That gap is exactly what I built a local SLM data cleaner for: it absorbs the long tail the rules miss, with the rules kept as a floor.

Failure mode 2: performance

Symptom: it flies in development against 10GB and crawls in production against 1TB. Costs explode, because patterns that look reasonable at small scale fall apart at large scale.

Root cause: development runs on unrealistic data volumes. The team optimises for the developer’s convenience instead of production economics, and nobody finds out until the bill or the SLA breaks.

The fix is production-realistic testing from the start:

Clone the production schema and load a statistically valid sample, around 1 percent of real data
Run performance regression tests in CI, not by hand before a release
Alert on cost at 80 percent of budget, not at 100
Require a performance justification for every new pipeline, the same way you require a code review

Worth remembering: a pipeline that is twice as slow but ten times cheaper to run is usually the better business decision. Speed is a cost, not a virtue.

Failure mode 3: metadata

Symptom: the data exists, but nobody knows what it means, where it came from, or how to use it correctly. People guess, and the guesses turn into wrong decisions that carry the authority of a dashboard.

Root cause: metadata treated as an afterthought. No lineage, no documentation, no semantic layer, because none of it ships a visible feature.

The lightweight fixes are real and cheap:

Capture technical metadata automatically: schema, size, update frequency, owner
Require a business description at onboarding; a dataset with no owner and no definition does not get promoted to gold
Use an open-source catalog like DataHub or OpenMetadata for discovery
Validate simple data contracts at pipeline boundaries so a schema change breaks loudly instead of silently

But the catalog is the easy half. A catalog stores schemas; it does not reconcile contradictions. The knowledge that actually explains a dataset, why revenue is computed this way, which source wins when two systems disagree, which questions should not be answered from this table at all, lives in people’s heads and old Slack threads, not in a metadata field. Update: That is the harder gap, and it is the one I built ECL to close: cited, conflict-aware synthesis of the tribal knowledge a catalog was never designed to hold.

Failure mode 4: tables

Symptom: a pipeline that ran fine for months gets slower and more expensive for no reason anyone can point to. Queries scan more than they should. Storage creeps up even though the data barely grew.

Root cause: the table formats that make a lakehouse a lakehouse, Delta and Iceberg, give you ACID transactions and time travel, but they do not maintain themselves. Streaming and micro-batch writes leave thousands of tiny files behind. Updates and deletes leave dead files that time travel keeps alive. Somebody set up the ingestion and nobody set up the housekeeping.

The fix is the cheapest one in this whole post, because it is a scheduled job you write once:

Compact small files on a schedule (OPTIMIZE on Delta, rewrite_data_files on Iceberg)
Expire old snapshots and vacuum dead files, so time travel does not pin storage forever
Cluster or Z-ORDER the columns you actually filter on, so scans skip what they can

A lakehouse table is not a set-and-forget object. It is closer to a garden. Leave it alone for a year and the problem is not that it stopped working, it is that it slowly got worse while everyone was looking somewhere else.

Spotting them early

The implementation is mostly straightforward once you know what you are looking for. What takes time to build is the pattern recognition, catching these before they cause damage instead of remediating after. When I start an engagement I can usually spot the trouble in the first week from four tells:

No zone enforcement in the pipeline orchestration
Performance tests that only ever run on tiny datasets
Zero business documentation on the core datasets
Core tables that have never been compacted or vacuumed since the day they were created

Catching these early typically saves three to six months of remediation. And the executive trust you lose when a data platform launch visibly fails is much harder to win back than the architecture is to fix before anyone notices.

What I’d do differently next time: I would create a standardised lakehouse health check that runs in the first two weeks of any engagement, providing a clear, actionable scorecard before significant resources are committed.

If any of these failure modes look familiar in your own lakehouse, catching them early is far cheaper than a rescue later. Get in touch.