The Data Contract Problem: Why Your Lakehouse Keeps Breaking

Lakehouses do not break because of bad tooling. They break because nobody owns the schemas at the seam between producers and consumers.

Lakehouse architectures - Delta, Iceberg, Hudi, take your pick - have solved the storage and compute layer of analytical data infrastructure. They have not solved the harder problem: keeping the schemas at the seams between systems usable by people who did not write them. That problem has a name now: data contracts.

The pattern is consistent across the engagements we run. A platform team spends eighteen months migrating from a legacy warehouse to a lakehouse, hits the storage cost reductions they promised the CFO, and then watches their on-call burden increase rather than decrease. The MTTR on data incidents climbs. Analysts lose trust. Within two years the platform team is spending more time firefighting than building. The technology worked. The interface discipline did not.

What a data contract actually is

A data contract is the explicit, versioned agreement between a producer of data and its consumers about schema, semantics, freshness, and quality. It is not a wiki page. It is a machine-readable artifact, usually JSON Schema, Avro, or Protobuf with attached SLAs, that lives in source control alongside the code that produces the data, gets reviewed when it changes, and is enforced at the ingestion boundary.

Most data platforms have implicit contracts that exist only in the heads of the data engineers who built them. When those engineers leave, the contracts evaporate. Every "data quality" outage you have ever experienced is the cost of that evaporation. Industry surveys put the cost in concrete terms: Gartner has estimated poor data quality costs organizations an average of $12.9 million per year, and our experience suggests the figure is conservative for any business above about $500M in revenue with material analytics dependencies.

A useful mental model: a data contract is to a lakehouse table what an OpenAPI spec is to an HTTP service. You would not ship a public REST API without a schema, a versioning policy, and an error contract. The reason data teams routinely ship the equivalent on the analytics side is historical accident, not engineering judgment.

Why this matters more in a lakehouse than in a warehouse

Traditional data warehouses had a forcing function: schemas were enforced at write time, and a producer that broke a schema got an immediate error. Lakehouses are schema-on-read by default. That flexibility is the architecture's greatest strength, and the reason every lakehouse eventually accumulates a backlog of silently broken pipelines that consumers only notice when their dashboards go blank.

The flexibility also interacts badly with the way modern data lands in the lakehouse. CDC streams from operational systems, semi-structured event payloads from product telemetry, and third-party SaaS extracts all arrive with schemas that drift on someone else's release schedule. Delta Lake's schema evolution features - mergeSchema, overwriteSchema, column mapping - make it trivial to absorb that drift silently. Trivial absorption is exactly the problem. The breaking change does not surface; it propagates. Three weeks later a finance model is off by 4% and nobody can trace why.

The counter-take we will defend: schema-on-read was a mistake for analytical platforms past a certain scale. The original argument - that schema-on-write was too rigid for exploratory analytics - was correct in 2012 when the alternative was a six-month Teradata change request. It is no longer correct in 2024, when the alternative is a pull request reviewed in a day. Lakehouses that treat the bronze layer as schema-on-read and the silver layer as contractually schema-on-write outperform lakehouses that maintain flexibility throughout. We have not seen a counterexample at scale.

The four levels of contract maturity

We assess client data platforms against four levels:

Level 0: Tribal. Schemas live in tribal knowledge. Producers change fields without notice. Consumers find out when something breaks. Most organizations sit here.
Level 1: Documented. Schemas exist in a catalog (Atlan, Alation, DataHub) but are not enforced. Better than nothing for discovery; not enough to prevent breakage.
Level 2: Validated. Schemas are checked at ingestion. Bad data is rejected or quarantined. Producers learn about schema breakage immediately rather than days later.
Level 3: Contracted. Schemas, SLAs, and semantic expectations are versioned, reviewed at change time, and enforced. Consumers can build on the contract with confidence.

The jump from Level 1 to Level 2 is the highest-ROI move most platforms can make in a quarter. Level 3 is a multi-quarter cultural and tooling investment. In our practice, we see roughly 70% of mid-to-large enterprises at Level 0 or Level 1, perhaps 25% at Level 2 for at least some of their critical pipelines, and a small minority - mostly digital natives and a handful of regulated firms with strong data governance mandates - operating at Level 3 across the platform.

A note on what Level 3 actually feels like: a producer attempting to drop a column or change a type opens a pull request that triggers a CI job comparing the proposed schema against the registered contract. If the change is breaking, the build fails until either the contract is bumped to a new major version with a deprecation window, or downstream consumers explicitly sign off. Consumers receive automated notice when contracts they depend on enter deprecation. This is not exotic. It is the same change-management discipline that any competent platform engineering team applies to internal APIs.

The tooling landscape, briefly

The data contract space has matured rapidly: Soda, Great Expectations, Monte Carlo, and Databricks Unity Catalog all have credible offerings, as does the open-source Open Data Contract Specification maintained under the Linux Foundation's Bitol project. dbt's recent model contracts feature handles a meaningful slice of the problem for teams already invested in dbt. Confluent's Schema Registry remains the reference implementation for streaming contracts and has done so for nearly a decade.

Tooling is rarely the bottleneck. The bottleneck is organizational: who owns the contract when it spans three engineering teams and two business units? In a typical enterprise engagement, we find that the median critical table has between four and seven upstream producers contributing fields, and between fifteen and forty downstream consumers. The contract is not a two-party agreement; it is a multilateral one. No tool resolves that on your behalf.

The ownership question

The most successful data contract programs we have seen treat producers as the contract owners and the central data platform team as the enforcement layer, not the authoring layer. Inverting this - having a central team write contracts on behalf of producers - reproduces the same coupling problem the contracts were meant to solve. Producers who feel contracts are imposed on them will route around them; producers who own them have a reason to keep them current.

This maps directly onto the data mesh thesis Zhamak Dehghani originally articulated in 2019, though we are deliberately careful not to require organizations to buy the full mesh package to get the contract benefits. You do not need domain-oriented decentralized ownership of all data products to get value from contracts. You need producers to own the schemas they emit. That is a smaller and more tractable cultural shift.

The hard part, in practice, is incentives. Producer teams are typically measured on shipping product features, not on the stability of analytical interfaces downstream of them. Until a VP of Engineering writes "data contract stability" into a producer team's quarterly objectives, contracts will lose to feature work every time. We have seen this play out at every scale from 50-engineer startups to 5000-engineer enterprises. The pattern does not change with size; only the political effort required to fix it does.

Regulatory pressure is changing the calculus

For regulated industries, the contract conversation is no longer purely an engineering optimization. The EU AI Act, in force since August 2024 with high-risk system obligations phasing in through 2026, requires documented data governance for training and evaluation datasets used in high-risk AI systems. The NIST AI Risk Management Framework similarly emphasizes data provenance and documented quality criteria. Both regimes assume something that looks much like a data contract: a machine-readable, versioned artifact that describes what the data is, where it came from, and what guarantees apply.

Financial services clients are seeing parallel pressure from BCBS 239 lineage and data-quality requirements, which most large banks are still not fully compliant with more than a decade after the principles were issued. A platform with Level 2 or Level 3 contracts produces the audit evidence largely as a byproduct. A platform without them generates it through quarterly archaeology projects.

What to put in place this quarter

If your data platform sits at Level 0 or 1, three actions deliver outsized returns. Pick five highest-value tables - the ones that feed executive dashboards or revenue-impacting models. Define explicit JSON Schema or Avro contracts for them. Wire ingestion to fail loudly when a contract is violated, with clear ownership for each producer. Stop there for a quarter. The discipline of operating those five well will teach your organization more than rolling out contracts to fifty tables it cannot maintain.

A few practical notes from running this play repeatedly:

Choose tables that already hurt. The ROI argument for contracts is much easier when the first five tables are ones that have caused recent visible incidents. Picking obscure tables for a clean technical pilot wastes the political capital you need for the next phase.
Define the SLA in business terms. "Freshness within 60 minutes of source commit, 99.5% of the time, measured monthly" is enforceable. "Near real-time" is not.
Set a deprecation window. Six weeks is the minimum we have seen work for breaking changes; twelve weeks is more typical for tables with broad consumer surface area.
Instrument violations from day one. Track the number of contract violations per producer per week. Within a quarter, the chart will tell you which producers have actually internalized the discipline and which are still treating it as theater.

One pattern to avoid: do not start by buying a tool. We have watched several platforms spend six months and seven figures on a data observability platform before defining a single contract. The tool then surfaces thousands of "issues" against schemas nobody has agreed on, and the program collapses under the noise. Define five contracts in a Git repo with CI checks first. Buy tooling once you know what you would use it for.

The bottom line

Lakehouses give you a flexible storage substrate. Data contracts give you a reliable interface to it. Without contracts, every data platform reaches a complexity ceiling where new use cases cost more than they return. With contracts, the platform compounds rather than calcifies.

The firms pulling ahead on analytics and AI are not the ones with the most exotic infrastructure. They are the ones whose producers know what they owe their consumers, and whose consumers can build on that obligation without checking. That is an organizational property the storage layer cannot provide. The sooner platform leaders accept that the contract problem is theirs to solve - not their vendor's, not their catalog's, not next year's - the sooner the lakehouse investment starts paying back at the rate the original business case promised.

// Related

Continue reading

APPLICATION MODERNIZATION AND DATA MANAGEMENT

Strangler-Fig Modernization for Legacy Java and .NET Estates

Big-bang rewrites still fail at the same rate they did a decade ago. A practical strangler-fig sequence for Java and .NET estates that have to keep ru…