You can build a working AI agent in an afternoon. Pick a framework, wire up some tools, write a system prompt, point it at an LLM. It works. It demos well. Your team is impressed.
Then you try to run it in production. And it falls apart in ways the demo never warned you about.
80.3% of AI projects fail to deliver intended business value, according to RAND Corporation. Not because the models aren't capable. They are. The models have been good enough for most business applications since mid-2025. The failures come from everything around the model — the part that nobody built because nobody thought it was the hard part.
OpenAI's engineering team has a name for this. They call it harness engineering.
What Is Harness Engineering?
In early 2026, OpenAI published details of an internal experiment: over one million lines of production code, zero written by humans. Three engineers orchestrated Codex agents that produced roughly 1,500 pull requests, processing a billion tokens per day. The engineers didn't write code. They designed the system that made code-writing agents reliable.
Ryan Lopopolo, who led the experiment, put it plainly: 'The only fundamentally scarce thing is the synchronous human attention of my team.'
The harness is everything that wraps around the model to make it work: the evaluation framework that catches failures before users do, the monitoring that tracks quality over time, the scoring that quantifies whether outputs are actually good, the guardrails that prevent the system from doing something catastrophic, and the feedback loops that make it better.
Three core pillars define it:
-
Context engineering — managing information flow. What goes into the prompt, what gets compressed, what gets cached, what gets thrown away. Priority scoring for retrieval. Knowledge persistence outside the context window.
-
Architectural constraints — tool access controls, output validation, scope boundaries, safety guardrails. The boundaries that prevent an agent from going rogue.
-
Entropy management — monitoring for drift, regression detection, periodic audits, automated cleanup. Systems degrade over time. The harness catches it.
None of this is glamorous. All of it is what separates a working demo from a working product.
Why Are Models No Longer the Bottleneck?
One of the clearest shifts in the past twelve months has been model commoditisation. As of April 2026, frontier models from Anthropic, OpenAI, Google, and a growing list of open-source alternatives are within striking distance of each other on most benchmarks. The performance gap between providers is narrowing quarterly.
The question has changed. It's no longer 'which model is best?' but 'what can you build on top of commoditised intelligence?' Bessemer Venture Partners frames it this way: differentiation has shifted to the memory and context layer. The model is a component. The harness is the product.
This matters practically because most production failures have nothing to do with model capability. 78% of AI failures are invisible — no alerts fire, no users complain. The system confidently produces wrong answers (the Confidence Trap), gradually drifts off-topic (the Drift), or misunderstands the request but produces something plausible enough to pass (the Silent Mismatch). More powerful models don't fix these. 93% of invisible failures persist even after upgrading to a better model, because they stem from interaction dynamics, not raw intelligence.
If your system is failing silently and you don't have monitoring to detect it, a smarter model just fails silently with better grammar.
How Bad Is the Capability-to-Production Gap?
The numbers are sobering.
| Source | Finding |
|---|---|
| MIT Sloan | 95% of GenAI pilots fail to scale to production |
| RAND Corporation | Only 19.7% of AI projects achieve or exceed business objectives |
| McKinsey | 88% of organisations use AI, but only 6% qualify as high performers |
| Gartner | Only 28% of AI use cases fully succeed and meet ROI expectations |
| Cleanlab | Just 5.2% of surveyed practitioners have AI agents live in production |
The cost of getting it wrong isn't trivial. MIT Sloan found that cost overruns average 380% at production scale compared to pilot projections. Deloitte reports the average sunk cost per abandoned AI initiative at $7.2 million. And 64% of scaling failures trace back to infrastructure limitations, not model limitations.
These aren't failures of ambition. They're failures of engineering. Teams build the agent, skip the harness, and wonder why production breaks.
What Does Production AI Engineering Actually Require?
Four disciplines separate production systems from prototypes.
Evals: The Foundation You Can't Skip
Evaluations are the test suite for non-deterministic systems. Unlike traditional software where you test deterministic inputs against expected outputs, AI systems need probabilistic evaluation across varied scenarios.
Anthropic's engineering team recommends starting with 20-50 simple tasks drawn from real failures. Not synthetic benchmarks. Real things your system got wrong. Build three types of graders: code-based (fast, deterministic, cheap), model-based (flexible, scalable, handles nuance), and human review (gold standard for subjective quality).
Two metrics matter for non-deterministic behaviour. pass@k measures whether the system succeeds at least once in k attempts. pass^k measures whether it succeeds every time across k trials — the consistency metric. A system that works 70% of the time is useless if you need it reliable.
The payoff is concrete. When new models drop, teams without evals spend weeks testing manually. Teams with evals run the suite, identify strengths and regressions, tune their prompts, and upgrade in days. Evals are a competitive advantage that compounds.
As of early 2026, 52% of organisations run offline evaluations and 89% have some form of observability — but only 37% run online evals against production traffic. The gap between 'we have monitoring' and 'we actually evaluate quality' is where most teams sit.
Monitoring and Observability: Seeing What's Actually Happening
You can't improve what you can't measure. Production AI systems need telemetry across multiple dimensions:
- Latency — P95, not averages. The response time your slowest 5% of users experience is the one that matters.
- Cost per query — token usage, API costs, infrastructure overhead. These compound fast.
- Quality scores — automated scoring on every response, not spot checks.
- Hallucination rates — even frontier models exceed 10% hallucination on difficult tasks, per Vectara's 2026 benchmarks.
- Error categorisation — not just 'it failed' but why: hallucination, refusal, off-topic, tool failure, context overflow.
The tooling landscape for this has matured rapidly. Axiom is building an AI engineering toolkit on OpenTelemetry with trace waterfalls and cost tracking. Langfuse offers open-source LLM observability with self-hosting for data sovereignty. Braintrust ties evals directly into CI/CD pipelines, blocking merges when quality degrades. Arize Phoenix provides vendor-agnostic tracing with over 7,800 GitHub stars.
Despite this, fewer than one in three teams report satisfaction with their observability and guardrail solutions. The tools exist. Adoption and implementation depth lag behind.
Scoring: Quantifying Quality at Scale
Hallucinations alone cost businesses $67.4 billion globally in 2024, with domain-specific incidents ranging from $18,000 to $2.4 million each. Per-employee costs hit $14,200 annually just on fact-checking AI outputs. You need automated scoring, not manual review.
The emerging standard is running lightweight evaluation models against 100% of production traffic. Galileo's approach uses small language models that evaluate at $0.02 per million tokens with 152ms latency, making full-traffic monitoring economically viable. Pre-production evals automatically become production governance rules — the same criteria that gate your CI/CD pipeline also monitor your live system.
RAG implementations that include proper scoring and retrieval quality metrics can reduce hallucination by up to 71%. But only if you're measuring it. Most teams ship RAG and assume it works because the demo looked good.
Prompt Management as Engineering Discipline
Production teams treat prompts like code: version-controlled, regression-tested, deployed through CI/CD. The practices are specific:
- Shadow mode — run new prompts against production traffic without affecting users. Compare outputs.
- Canary deployments — test with a small user segment, monitor for quality degradation, roll back if needed.
- A/B testing — validate in production with statistical significance before full rollout.
- Regression suites — every prompt change runs against a test set. Builds fail on quality regression.
70% of regulated enterprises rebuild their AI agent stack every three months or faster, according to Cleanlab's 2025 production survey. This isn't instability. It's iteration speed. The teams that ship reliable AI systems are the ones that can change them safely. And you can't change anything safely without tests.
Why Do Most Teams Skip This?
Because the demo worked.
The fundamental trap is that AI demos are convincing. You show an agent doing something impressive, stakeholders sign off, and the project moves to 'production' — which means running the demo at scale without the engineering infrastructure that production demands.
85% of in-depth case studies show production teams using custom in-house implementations, abandoning frameworks they prototyped with. 68% follow bounded workflows rather than open-ended planning. Production AI looks nothing like the demo. The demo is unconstrained, impressive, and fragile. Production is constrained, reliable, and boring. Boring is what you want.
The other reason teams skip it: they don't know this discipline exists. AI engineering as a field is younger than the models it supports. Most engineering teams have deep experience building deterministic software and almost none building probabilistic systems that need continuous evaluation. The skills are different. The mental models are different. The tooling is different.
What Should You Actually Do?
If you're running AI in production — or about to — here's the minimum viable harness:
-
Build evals before you build features. Start with 20 real failure cases. Grade them. Automate the grading. Run them on every change.
-
Instrument everything. Latency, cost, token usage, quality scores, error types. You need this data to make any informed decision about your system.
-
Monitor production quality continuously. Not weekly reviews. Automated scoring on every response, with alerts when quality degrades below your threshold.
-
Version your prompts. Store them in Git. Test them in CI. Deploy them through a pipeline. Treat a prompt change like a code change.
-
Assume invisible failure. 78% of AI failures don't trigger alerts. Design your monitoring to catch the silent ones — confidence calibration, drift detection, output sampling and human review.
Building agents has never been more accessible. The models are extraordinary. The frameworks are mature. The barrier to 'getting something working' is the lowest it's ever been.
The barrier to getting something working reliably, at scale, in production hasn't moved. That's where the engineering is. That's where the value is.
The harness is the product. The model is a dependency.
If you're building AI systems and want to talk about what production-grade infrastructure looks like for your use case, we'd like to hear from you.


