AI Coding Performance Depends on Your Tech Stack
AI coding performance varies significantly across programming languages, with success rates differing by more than 8×. See where AI coding tools work best, and where they struggle in real systems.
.webp)
AI coding tools feel inconsistent.
In some teams, they noticeably speed up development. In others, they introduce more review effort than they save. The difference is often explained as being down to "prompt quality” or “developer skill.”
But there’s a more fundamental factor that gets overlooked:
The programming language and system you’re working in have a major impact on how well AI performs.
Our recent analysis of LLMs used alone on one-shot real-world refactoring tasks shows that success rates can vary by 8.6× depending on the language.
That gap helps explain why experiences with AI coding tools vary so widely across teams.
AI Coding Performance Isn’t Uniform Across Languages
When models are evaluated on real-world code, where changes must preserve behavior and improve maintainability, the differences are stark:
JavaScript: ~32% success; C: ~3–4% success.
This reflects systematic differences in how models handle different environments.

At a high level, coding LLM tools perform better when:
- Code is loosely structured
- Context is easier to infer locally
- Dependencies are relatively shallow
They struggle when:
- Changes affect multiple layers of a system
- Type constraints and memory handling are more important
- Small mistakes have wider system-level consequences
The same model can perform well in one language and fail in another.
Why this Happens
Most AI coding tools are strong at pattern recognition. They can generate syntactically correct code and follow familiar structures with high reliability.
That works well in environments where problems are scoped locally, code patterns are widely represented in training data, and the cost of a mistake is low.
It's less effective when working with systems that require precise control over behavior or careful handling of edge cases and side effects.
Lower-level languages and complex backend systems tend to quickly expose these limitations. The model can produce code that looks correct but fails when integrated into the wider system.
What this Means for Your Team
If you’re evaluating or scaling AI coding tools, the key question to ask is “Where in our stack will this reliably work?”
The easiest way to see this is to look at how the same tool plays out in different teams.
Take a frontend team working mostly in JavaScript. They start using an AI coding assistant and quickly find a rhythm. Generating components, wiring up API calls, handling common patterns – most of the output is usable with minimal changes. Code reviews move faster because there’s less to fix. The tool becomes something they rely on for day-to-day work.
Now compare that to a team working on a C++ service or a tightly coupled backend system.
They try the same tool with the same expectations. At first, it looks promising: the code compiles, the structure seems reasonable. But once they begin integrating those changes, problems surface. Edge cases aren’t handled correctly. Small mistakes propagate into larger issues. Review cycles get longer, not shorter, because every change needs careful validation.
Over time, the two teams come to very different conclusions about the same technology. One sees clear productivity gains. The other is more cautious, using it sparingly or not at all.
Neither experience is wrong. They’re just operating in different parts of the performance curve.
It’s also important to interpret these results in context.
Agent-based systems, tool integrations, and multi-step workflows can improve outcomes, but they also introduce additional complexity, infrastructure, and variability.
Although agent-based workflows are evolving quickly, most teams today still rely primarily on direct LLM interaction inside developer workflows.
Three Things to Do
1. Apply Coding Tools Selectively
A useful starting point is to look at where your engineering work sits along two dimensions:
- How local the change is
- How sensitive the system is to errors
Tasks that stay within a single file or component are far more predictable. Tasks that touch multiple services, shared interfaces, or critical logic are not.
In practical terms, this often leads to patterns like:
- AI is used heavily for writing and modifying self-contained code
- It’s used more cautiously for changes that affect system behaviour
- It’s avoided or tightly controlled in areas where correctness is critical
→ Match the tool to the type of work.
2. Don’t Assume Gains will Generalize
One of the easiest mistakes to make is to observe success in one part of the system and assume it will translate everywhere.
For example:
- Strong results in frontend development don’t necessarily carry over to backend services
- Gains in greenfield code don’t always apply to legacy systems
- Improvements in one language don’t predict performance in another
Before scaling usage, it’s worth validating performance across:
- Different languages in your stack
- Different types of tasks (feature work vs refactoring vs maintenance)
- Different parts of the system (isolated vs highly coupled)
→ Get a much clearer picture of where AI is adding value and where it isn’t.
3. Plan for Uneven Adoption
The variation in performance also has implications for how teams adopt AI.
Instead of expecting uniform productivity gains, it’s more realistic to expect:
- Some teams benefiting significantly
- Others seeing marginal improvements
- A few encountering more friction than benefit
This affects:
- How you measure impact
- Where you invest in tooling and training
- How you set expectations with leadership
→ Treat AI as uneven, but predictable.
The Gap Isn’t Disappearing Quickly
There’s a tendency to assume that these differences will fade as models improve.
But current evidence suggests performance is already converging within a relatively narrow range on complex refactoring tasks.
That makes this less of a temporary limitation and more of a structural constraint, at least in the near term.
In practice, that means teams need to design workflows that account for variability, rather than waiting for it to disappear.
A More Useful Way to Think About AI Coding Performance
LLM coding tools are often discussed as if they provide a consistent layer of acceleration across development.
In reality, they behave more like a tool with high variance depending on context.
Understanding that variance (especially across languages and system types) is what allows teams to use them effectively.
Download the full report for the data that will help drive decisions.

.webp)
.webp)











