Designing AI Governance That Survives Contact With Production

Most enterprise AI policies collapse the moment a model leaves the lab. A practical framework for governance that holds up under real production pressure.

Almost every Fortune 1000 enterprise has an AI governance policy on paper. few have one that holds up the first time a model produces an unexpected output for a real customer at 2 AM. The gap between policy and operating reality is where most AI programs lose their executive sponsorship. Gartner has projected that at least 30% of generative AI projects will be abandoned after proof of concept by the end of 2025, with poor risk controls cited alongside data quality and unclear business value as primary causes. In our work with regulated enterprises, the projects that survive are not the ones with the longest policies; they are the ones whose policies map cleanly onto runbooks an on-call engineer can actually execute.

The three questions your governance must answer

A governance framework is only worth the cost of running it if it can answer three questions in production: _Who approved this model for this use case? What is its current performance against the thresholds it was approved against? Who is on the hook when it drifts?_ If any of those three lack a clear, automatable answer, you have policy theater rather than governance.

These questions are not abstract. They map directly to the evidence requested in any serious audit, including the documentation expectations set out in the NIST AI Risk Management Framework and the technical documentation requirements in Article 11 and Annex IV of the EU AI Act for high-risk systems. We have watched well-funded programs spend six to nine months drafting principles documents only to fail their first internal audit because no system of record could tie a specific production inference back to a specific approval decision. The principles were not wrong; they were unenforceable.

Why the standard model registry is not enough

Most organizations adopt MLflow, SageMaker Model Registry, or a vendor equivalent and consider the governance question solved. In our engagements we routinely find that registries track artifacts but not _decisions_. They record that a model was registered; they rarely record who approved it for which population, under what fairness or accuracy thresholds, and with what fallback if those thresholds are breached. That decision context is what an auditor, a regulator, or your own incident review will demand, and it cannot be reconstructed after the fact.

The gap is wider for foundation models accessed through an API. When your "model" is a vendor endpoint that the provider may swap, retrain, or deprecate on their schedule, the registry concept itself starts to fray. OpenAI, for example, has retired or deprecated multiple model snapshots within 12 months of release, and Anthropic publishes similar lifecycle policies. A registry entry pointing at gpt-4-turbo or claude-3-opus is a label, not an artifact. Governance has to bind to the use case, the prompt template, the evaluation set, and the threshold contract, not just the model name.

Here is the counter-take that we find most clients resist: for generative AI use cases, the model registry is often the wrong primary control. What you actually need to version, approve, and monitor is the _prompt-plus-retrieval-plus-tool-configuration_ as a unit. The underlying model is one input among several, and frequently not the one most likely to cause an incident. Teams that insist on shoehorning LLM applications into a classical ML registry pattern end up with governance metadata that is technically present and operationally useless.

An operating model in four layers

The governance programs that survive production share a layered structure rather than a single committee or single tool:

Intake. A lightweight risk classification at the use-case level, not the model level. A churn predictor and a credit decision are not the same risk class even if they share the same algorithm. We typically recommend a four-tier scheme aligned to the EU AI Act's risk categories, with internal dollar-loss and customer-impact thresholds attached so the classification is mechanical rather than political.
Approval. Tied to the risk class. Low-risk uses get a documented self-attestation; high-risk uses get a cross-functional review with mandatory legal, security, and business sign-off captured as machine-readable evidence. The most effective programs we see cap approval cycle time at 10 business days for tier 2 and below; anything slower drives shadow AI inside business units.
Monitoring. Performance, drift, fairness, and abuse signals streaming to the same observability stack as the rest of your production telemetry. Governance metrics that live in a separate dashboard get ignored within a quarter. Pipe them into the same Datadog, Splunk, or Grafana instance the SRE team already watches at 3 AM.
Response. Pre-authorized rollback, throttling, or human-in-the-loop fallbacks. The model's own owners must be able to invoke these without a steering committee meeting. A governance regime that requires a committee to authorize a rollback during an incident is a governance regime that will be bypassed during an incident.

The layering matters because each layer has a different time constant. Intake decisions might take days; approval decisions, weeks; monitoring runs in seconds; response must execute in milliseconds for some use cases. Trying to enforce governance with a single tool or single committee always optimizes for the wrong time scale.

Common failure modes we see

Three patterns recur across engagements.

The first is treating generative AI as a separate governance regime from traditional ML. This duplicates work, creates seams attackers exploit, and produces the inevitable conversation in year two about "harmonizing the frameworks." Risk classification, approval workflow, monitoring infrastructure, and incident response should be one system with parameters, not two systems with overlapping scope. The threat model differs, prompt injection and jailbreaks are not on the classical ML risk register, but the control plane should be unified.

The second is governance run by a function that does not own production reliability. Risk or compliance teams writing policies that platform engineers cannot operationalize is the single most common failure mode in our practice. The fix is to require that every governance artifact ship with a corresponding technical control: if the policy says "models must be monitored for drift," the policy is not done until there is a library, a default threshold, and a paged alert wired into the on-call rotation. We have seen organizations cut their internal audit findings by more than half within two cycles simply by enforcing this pairing.

The third is using model cards as a destination rather than a checkpoint. Model cards age out within weeks of deployment if they are not regenerated automatically from monitoring data. The original Model Cards for Model Reporting paper by Mitchell et al. framed cards as living documents tied to evaluation results, but most enterprise implementations freeze them at approval and never refresh them. A model card that describes performance on a six-month-old evaluation set against a population that has since shifted is worse than no model card; it provides false assurance to downstream consumers.

A fourth pattern is increasingly common with the rise of agentic systems: governance scoped to a single model when the actual production system is a chain of models, retrievers, and tools. The risk lives in the composition, not the components. We expect this to be the dominant governance gap of the next 24 months as more enterprises move from single-call inference to multi-step agent architectures.

What to put in place this quarter

If you are starting from a near-zero baseline, three steps deliver disproportionate value.

First, publish a one-page risk taxonomy and tie it to dollar thresholds for approval. The taxonomy does not need to be perfect; it needs to be applied consistently across at least 80% of in-flight use cases within a quarter. Iteration on a real taxonomy beats waiting for a perfect one. Aligning the categories to the EU AI Act's risk tiers gives you free use if you have any European exposure, since the AI Act's obligations for high-risk systems begin applying in August 2026 and your taxonomy will need to map to them anyway.

Second, instrument every production model with the same five signals, latency, accuracy proxy, drift, abuse, and fallback engagement, regardless of vendor or framework. The specific thresholds matter less than the consistency. A platform team that can answer "what is the drift on every model in production right now?" in under 30 seconds has more effective governance than one with a 200-page policy and no instrumentation.

Third, run a tabletop exercise of a model failure within sixty days. Pick a plausible scenario: a recommendation model amplifying a controversial output, an LLM-powered support agent leaking PII, a fraud model rejecting a protected class at twice the baseline rate. Walk through detection, escalation, customer communication, rollback, root cause, and disclosure. The gaps it surfaces, usually around who has authority to take the model offline, who notifies customers, and how you reconstruct what the model actually did, are worth more than another month of policy drafting.

A fourth step worth adding once the basics are in place: pre-negotiate your vendor contracts for model lifecycle and incident cooperation. If your foundation model provider deprecates an endpoint with 90 days of notice, your governance regime needs to know whether re-approval is required and who pays for the re-evaluation work. These terms are negotiable at procurement and effectively impossible to change later.

The bottom line

AI governance is not a document. It is a set of decisions made repeatedly under time pressure, supported by evidence, and tied to clear ownership. The organizations getting it right have stopped trying to write the perfect framework and started instrumenting the imperfect one they already operate.

The pattern we see in mature programs is convergence rather than expansion: fewer committees, fewer bespoke tools, fewer separate frameworks for separate model classes. Governance that survives production looks more like SRE than like compliance, runbooks, dashboards, on-call rotations, and post-incident reviews, with the legal and risk functions embedded as customers of that operational stack rather than parallel owners of a competing one. Build that, and the policy document writes itself from the artifacts your platform already produces. Skip it, and no amount of board-approved principles will help when the model misbehaves at 2 AM.

// Related

Continue reading

AI & INTELLIGENT AUTOMATION