From Pilots to Platform: Operationalizing Generative AI
Pilots prove a model can work. Platforms prove the organization can. The architectural and operational shifts that close the gap.
The first generative AI pilot is easy. The fortieth one, running concurrently with shared evaluation infrastructure, shared guardrails, and a shared cost line, is where most enterprises stall. Closing the gap is a platform problem disguised as an AI problem. The industry data backs this up: a 2024 McKinsey global survey found that while 65% of organizations regularly use generative AI in at least one function, far fewer report material EBIT impact from it. The gap between adoption and outcome is the platform gap.
The pilot trap
Pilots succeed because they are scoped to forgive almost every operational sin. A single team owns the use case, the data, the prompts, and the user feedback loop. Costs are absorbed in an innovation budget. Latency is acceptable because the user is the team that built it. None of those conditions hold once you have a portfolio. Pilot velocity becomes pilot debt.
In our work with large enterprises, we've seen the same anti-patterns repeat. Each pilot team negotiates its own contract terms with a model provider. Each one stores prompts in a private Git repo, or worse, in a Notion page. Each one defines "good enough" against a private rubric that no one else can reproduce. When a regulator, a CISO, or a CFO asks a portfolio-level question - what models are we using, what data flows through them, what does it cost per business outcome - the organization cannot answer in any timeframe shorter than weeks of forensic work.
The counter-take worth stating plainly: most enterprises should run fewer pilots, not more. The instinct to fund a hundred experiments and "let a thousand flowers bloom" produces a portfolio that is impossible to operate, secure, or measure. We have repeatedly seen better outcomes from organizations that capped active pilots at ten or fewer until the platform substrate was in place. Restraint at the front of the funnel is what makes scale at the back of the funnel possible.
Five capabilities a generative AI platform must own
The platforms that scale share a common spine. Each capability is independently boring; together they remove the friction that kills portfolios.
- Model gateway. A single point of egress to model providers, with routing, retries, fallback chains, and cost attribution per business unit. Without this, every team rebuilds the same client wrapper and your procurement team cannot answer a basic spend question. A gateway also enables a quietly important capability: graceful provider failover. The well-publicized OpenAI outages of 2023 and 2024, some lasting several hours, have made multi-provider routing a reliability requirement rather than an architectural luxury.
- Prompt and tool registry. Versioned, environment-promoted prompts and tool definitions. Treat them as configuration, not code, but with the same review and rollback discipline. The registry should expose a diff view so reviewers can see what changed between v17 and v18 of a prompt without spelunking through commit history.
- Evaluation harness. Offline benchmarks plus online A/B with statistical rigor. Most teams stop at vibes-based comparisons; that is not enough once a model change moves real revenue. The harness must support LLM-as-judge evaluations, human-labeled golden sets, and regression tests that run on every prompt or model change. Anthropic's published guidance on building evaluations is a reasonable starting point for teams new to the discipline.
- Safety and policy enforcement. Input and output filtering, PII redaction, jailbreak detection, and audit logging. Centralized so security reviews scale linearly with use cases instead of multiplicatively. The NIST AI Risk Management Framework provides a useful taxonomy for mapping enforcement controls to risk categories, and aligning your platform's policy layer to it shortens the path through internal risk and audit functions.
- Observability. Token-level cost and latency telemetry, drift detection on inputs, and feedback collection on outputs. The same operational signals you would expect from any production system, applied to non-deterministic ones. Latency budgets matter more than teams expect: at typical streaming rates of 40-80 tokens per second on frontier models, a 1,500-token response is already pushing 20-30 seconds of user wait time before any tool calls.
A capability map is not a roadmap. The sequencing matters. We recommend building the gateway first, observability second, and the registry third. Evaluation and safety enforcement layer cleanly on top of those three; trying to build them first usually produces tooling no one is forced to use.
Build, buy, or assemble
Few enterprises should build all five capabilities from scratch in 2026. Vendors like Vertex AI, Bedrock, and Azure AI Foundry cover three to four of them well; tools like Langfuse, Weights & Biases, and Helicone fill the gaps. The right architecture is usually a thin internal platform layered on top of one hyperscaler primitive plus one or two best-of-breed tools, with strict interface contracts so any single component can be swapped without breaking the others.
The interface discipline is the part most teams skip. We've seen platforms built around Bedrock's native APIs that became unmovable within eighteen months because every application in the enterprise had taken a direct dependency on the vendor's request schema. The fix is unglamorous: define your own internal request and response shapes, even if they look almost identical to the underlying provider's, and translate at the gateway boundary. The cost is one extra adapter layer. The benefit is the option to migrate workloads when pricing or model quality shifts - and pricing does shift. GPT-4-class input token prices have dropped by roughly an order of magnitude between 2023 and 2025, and any platform that cannot capture those reductions by re-routing traffic is leaving real money on the table.
A counter-take here as well: the popular advice to "stay model-agnostic" is half-right at best. True agnosticism is expensive and produces lowest-common-denominator capabilities. The pragmatic stance is to be portable at the gateway and registry layer, while permitting individual use cases to depend on provider-specific features (structured outputs, prompt caching, computer use, long-context retrieval) when the business value justifies the lock-in. Make the trade-off explicit and reviewable, not implicit and ambient.
Organizational design that holds up
The most common failure pattern is putting a centralized AI platform team inside the data organization, where it is structurally insulated from the application engineers it must serve. The teams that scale put the platform inside the same engineering organization that owns Kubernetes, observability, and developer tooling. Generative AI infrastructure is platform engineering with a probabilistic execution layer; treating it as a separate discipline reproduces the data-versus-engineering schisms of the last decade.
The headcount math matters too. A functional generative AI platform team for a Fortune 500 engineering org is typically eight to fifteen engineers in its first year, scaling to twenty to thirty as the surface area grows. We've seen organizations attempt this with three or four people and a charter borrowed from the data science group; the result is a platform that exists on paper but that no application team actually uses. Conversely, organizations that staff fifty-plus engineers on day one tend to over-engineer abstractions before any real workload has stress-tested them.
Reporting lines matter as much as headcount. The platform lead should report to the same VP or SVP who owns the rest of the developer platform, not to a Chief AI Officer or Chief Data Officer who sits outside the engineering line. We have nothing against the CAIO role for strategy and external posture, but operational platform ownership belongs in engineering. Where governance functions sit is a separate question, and the EU AI Act's staged obligations - with general-purpose AI provisions taking effect in August 2025 and high-risk system requirements following in 2026 - have made it worth establishing a clear AI governance function distinct from the platform team itself.
What success looks like at twelve months
An enterprise running generative AI as a platform, not a portfolio of pilots, exhibits four traits: any product team can ship a new use case in days, not quarters; the security team can answer a regulator's question about model usage in under an hour; the finance team can attribute spend to specific business outcomes; and the AI platform team is shrinking as a percentage of total engineering headcount because the patterns are absorbed into shared tooling.
We would add three quantitative markers we look for in mature platforms:
- Time to first token in production for a new use case under thirty days, measured from the day a product team is approved to use the platform.
- Cost per use case attributable to within 5% of actual spend, with daily granularity. If finance cannot reconcile within that band, the gateway's attribution model is broken.
- Evaluation coverage above 80% for production traffic - meaning at least four of every five inference calls flow through a use case that has a registered evaluation suite running on a defined cadence. Below this threshold, the organization is shipping prompt and model changes blind.
These are not soft targets. The teams that hit them are the ones whose generative AI investments survive the budget conversation in year two, when the novelty premium has worn off and the CFO is asking what 18-24 months of spend has actually produced.
The bottom line
Generative AI does not reward the most ambitious roadmap. It rewards the organization that builds the dullest, most reliable substrate underneath it. The model providers will keep shipping faster, cheaper, more capable systems; capability is no longer the constraint. The constraint is operational maturity - the ability to take a model that exists in the world today and put it in front of a customer, a regulator, an auditor, and a CFO without any of them flinching.
The pilots prove the technology works. The platform proves the company can.
Continue reading
Securing the LLM Supply Chain: Threat Models for AI-Powered Apps
Most enterprise application security programs were designed for deterministic systems. LLM-powered applications break enough of those assumptions that…
From RPA to Agentic Automation: When to Graduate, When to Stay
Every major RPA vendor, UiPath, Automation Anywhere, Blue Prism, is repositioning around agentic AI. Their message: deterministic RPA is yesterday; LL…
Designing AI Governance That Survives Contact With Production
Most enterprise AI policies collapse the moment a model leaves the lab. A practical framework for governance that holds up under real production press…

