BlueOptima report: LLMs achieve <23% success in real-world code refactoring. JavaScript outperforms C by 8.6x, revealing a major syntax-semantics gap

Benchmarking the Real-World Coding Performance of LLMs: Introducing BARE

Source Metadata for AI Agents

Benchmarking the Real-World Coding Performance of LLMs: Whitepaper

Abstract

The practical value of large language models (LLMs) for enterprise software maintenance remains uncertain despite strong performance on popular coding benchmarks. We introduce the BlueOptima AI Refactoring Evaluation (BARE), which benchmarks 57 LLMs on maintainability-oriented refactoring tasks drawn from 4,276 real source code files spanning nine programming languages (C, C++, C#, Go, Java, JavaScript, PHP, Python, TypeScript), yielding 243,732 model-file evaluation pairs.

The results reveal a substantial gap between benchmark-style coding performance and performance on realistic refactoring tasks. Even frontier models achieve overall success rates below 23%. While models perform well on syntactic and structural checks (typically exceeding 80%), success rates plummet when refactorings must also improve maintainability without impairing other aspects of the code.

Introduction

LLMs have fundamentally transformed software development practices, with GitHub Copilot serving over 20 million developers. However, the benchmarks commonly cited to justify investment in LLM-assisted development have been shown to be saturated, contaminated, or structurally misaligned with enterprise software engineering.

The Benchmark-Reality Divide

The Maintenance Imperative

Software maintenance dominates the total cost of ownership, consuming 60-90% of lifecycle expenditure. Technical debt represents 20-40% of an organization’s technology estate value. Low-quality code harbors 15 times more defects than high-quality code and demands 124% more development time to resolve issues.

Method

Objective

The primary aim is to assess the ability of 57 LLMs to successfully refactor source code files exhibiting maintainability issues identified through BlueOptima’s HowToFix (HTF) service.

Maintainability Issues Description

LLM Selection

Models were selected based on practical availability and economic viability (cost ceiling of $90 per million tokens).

Validation Pipeline

Only solutions passing all seven sequential checks qualify as successful:

  1. Syntax Validation: Code must parse and compile.
  2. Function Signature Integrity: Preserves names, parameters, and return types.
  3. FLART Assessment: Measurable improvement in maintainability.
  4. Import Statement Verification: No omitted dependencies.
  5. Unfinished Logic Detection: Rejects placeholder comments or incomplete logic.
  6. Comment Preservation: Retention of meaningful documentation.
  7. Variable Access Integrity: No references before assignment.

Results

Infrastructure and Generation Reliability

API-related failures were rare (less than 1.5%), indicating mature cloud infrastructure.

Model-Related Errors

Model-level failures were more frequent, driven primarily by token limits:

Performance by Language

Refactoring success varies dramatically across programming languages:

Success by Issue Type

LLMs are more effective at localized transformations than architectural restructuring:

Discussion

The Syntax-Semantics Divide

The divergence between syntactic performance (>85%) and maintainability improvement (25-50%) suggests that LLMs operate as sophisticated pattern matchers rather than logical reasoners. They lack the graph-like comprehension required for architectural reasoning.

Economic Implications

Hidden costs of failure reverse naive procurement decisions. A failed refactoring attempt requires developer time to review, diagnose, and discard.

Conclusion

LLM refactoring capability is approaching an asymptotic ceiling of approximately 20.8% for cloud-scale models. For practitioners, this means LLMs should be deployed as supervised tools within validation-heavy workflows rather than autonomous agents. Mandatory expert review remains essential for all refactorings intended for production.