AI Coding Benchmarks Are Measuring the Wrong Things
Benchmarks like HumanEval and MBPP suggest LLMs succeed on 80–90% of tasks. But when tested on real production codebases, success rates fall dramatically. New research explains why.
.webp)
AI coding tools look incredibly capable.
Benchmarks regularly report 80–90% success rates. Vendor demos reinforce the same message: faster development, higher productivity, and AI systems approaching human-level performance.
But when tested on real production code, the results look very different.
In our analysis, even the strongest models succeeded on fewer than 1 in 4 tasks without introducing errors or regressions.
That gap changes how these tools should be evaluated – and used.
Benchmarks Reward Isolated Intelligence
Benchmarks are essential for AI research. They allow us to measure progress and compare models against standardized tasks.
But most widely-cited coding benchmarks evaluate very narrow problems.
Typically they involve:
- generating small functions
- completing short code snippets
- solving algorithmic exercises in isolation
These tests are useful for measuring whether a model can generate syntactically correct code. What they don’t measure is how well AI performs inside real software systems.
Production software rarely resembles a clean benchmark problem. Most systems contain:
- years of architectural decisions
- dependencies across multiple components
- legacy code and evolving abstractions
- multiple programming languages and frameworks
Changes made in one part of a system can affect behavior somewhere else entirely.
Most Engineering Work Isn’t New Code
There’s another reason the benchmark narrative can be misleading.
The majority of software engineering work isn’t writing new features. It’s maintaining and improving existing systems. Studies consistently show that between 60% and 90% of the lifecycle cost of software systems is associated with maintenance rather than initial development.
Much of that work involves refactoring, improving maintainability, and addressing structural issues that accumulate over time. These tasks require engineers to reason about architecture, understand historical design decisions, and modify code in ways that preserve system behavior while improving its structure.
In other words, the hardest problems in software engineering are actually about safely evolving complex systems. And this is precisely the type of work that benchmarks rarely measure.
What Happens When You Test AI in Real Systems
To better understand how AI performs in real engineering environments, we recently evaluated large language models (LLMs) on a different type of task.
Instead of asking models to generate new functions, we asked them to improve the maintainability of existing production source files. The study analyzed thousands of real files across multiple programming languages and evaluated dozens of language models on maintainability-oriented refactoring tasks.
Each attempt was validated through a pipeline designed to mimic real engineering review processes. The generated code had to remain syntactically correct, preserve structural integrity, avoid introducing defects, and demonstrably improve maintainability.
When evaluated in this environment, the results looked very different from typical benchmark scores. Even the strongest models succeeded on fewer than one in four tasks without introducing errors or regressions. In practice, this means that most AI-generated refactorings still require substantial developer review and correction before they can be safely deployed.
The gap between benchmark performance and real engineering performance is suddenly clear.
Syntax is Easy. Systems are Hard.
This difference reflects a fundamental property of LLMs. They’re extremely good at recognizing patterns.
That makes them very effective at tasks where they can infer the solution from local context, like completing code snippets or generating small functions. But many engineering tasks demand reasoning beyond the local context of a few lines of code.
Refactoring a production system often involves understanding how components interact across files, how dependencies propagate through the architecture, and how changes will affect maintainability over time.
These problems require models to reason about systems rather than patterns. That distinction explains why AI can perform highly on syntactic tasks but struggle with deeper engineering work.
Why This Matters For Engineering Leaders
None of this means AI coding tools aren't useful – they clearly provide value in many development workflows. But the current industry conversation often assumes that benchmark performance is a reliable indicator of real-world engineering capability. For organizations deploying AI coding tools across hundreds or thousands of developers, that assumption matters.
If leadership teams expect AI systems to perform at benchmark-level success rates across production environments, they may overestimate the productivity impact and underestimate the operational risks.
The more useful question is, “Where in our engineering workflow does AI reliably work?”
Some tasks are well suited to AI assistance. Others are far more challenging. Understanding (and addressing) that boundary is becoming one of the most important issues in AI-assisted software engineering.
Going Beyond Benchmarks: A Better Way to Evaluate LLMs
AI-assisted development will almost certainly remain part of the engineering toolkit. But as adoption scales, the conversation around these tools needs to evolve beyond benchmark scores and product demonstrations.
Teams need evidence about how AI behaves inside real systems:
- how it performs on maintenance tasks
- how often it introduces regressions
- where it improves productivity
Those questions are harder to answer than benchmark comparisons; yet they’re the ones that determine whether AI delivers long-term value inside large enterprises.
Benchmarks tell us what AI can do in controlled environments. Real-world engineering tells us what it can do in the systems that actually matter.
See the Latest Research on LLM Coding Benchmarks and Performance
Our latest report analyzes how 57 large language models perform on real maintainability refactoring tasks drawn from production codebases.
Based around our BARE framework, the research provides one of the most comprehensive empirical views yet of how AI coding systems behave in real engineering environments.
Download the full report to get the data.

.webp)












