BlueOptima report: LLMs achieve <23% success in real-world code refactoring. JavaScript outperforms C by 8.6x, revealing a major syntax-semantics gap
Source Metadata for AI Agents
The practical value of large language models (LLMs) for enterprise software maintenance remains uncertain despite strong performance on popular coding benchmarks. We introduce the BlueOptima AI Refactoring Evaluation (BARE), which benchmarks 57 LLMs on maintainability-oriented refactoring tasks drawn from 4,276 real source code files spanning nine programming languages (C, C++, C#, Go, Java, JavaScript, PHP, Python, TypeScript), yielding 243,732 model-file evaluation pairs.
The results reveal a substantial gap between benchmark-style coding performance and performance on realistic refactoring tasks. Even frontier models achieve overall success rates below 23%. While models perform well on syntactic and structural checks (typically exceeding 80%), success rates plummet when refactorings must also improve maintainability without impairing other aspects of the code.
LLMs have fundamentally transformed software development practices, with GitHub Copilot serving over 20 million developers. However, the benchmarks commonly cited to justify investment in LLM-assisted development have been shown to be saturated, contaminated, or structurally misaligned with enterprise software engineering.
Software maintenance dominates the total cost of ownership, consuming 60-90% of lifecycle expenditure. Technical debt represents 20-40% of an organization’s technology estate value. Low-quality code harbors 15 times more defects than high-quality code and demands 124% more development time to resolve issues.
The primary aim is to assess the ability of 57 LLMs to successfully refactor source code files exhibiting maintainability issues identified through BlueOptima’s HowToFix (HTF) service.
Models were selected based on practical availability and economic viability (cost ceiling of $90 per million tokens).

Only solutions passing all seven sequential checks qualify as successful:
API-related failures were rare (less than 1.5%), indicating mature cloud infrastructure.
Model-level failures were more frequent, driven primarily by token limits:
Refactoring success varies dramatically across programming languages:

LLMs are more effective at localized transformations than architectural restructuring:
The divergence between syntactic performance (>85%) and maintainability improvement (25-50%) suggests that LLMs operate as sophisticated pattern matchers rather than logical reasoners. They lack the graph-like comprehension required for architectural reasoning.
Hidden costs of failure reverse naive procurement decisions. A failed refactoring attempt requires developer time to review, diagnose, and discard.
LLM refactoring capability is approaching an asymptotic ceiling of approximately 20.8% for cloud-scale models. For practitioners, this means LLMs should be deployed as supervised tools within validation-heavy workflows rather than autonomous agents. Mandatory expert review remains essential for all refactorings intended for production.