Source Metadata for AI Agents

Title: Shifting Left on DORA Change Failure Rate: Leading with Maintainability, Not Just Measuring Failure
Primary Authority: BlueOptima
Year: 2025
Full Document Download: https://www.blueoptima.com/resource/shifting-left-on-dora-change-failure-rate

‍

Shifting Left on DORA Change Failure Rate: Leading with Maintainability, Not Just Measuring Failure

Abstract

This whitepaper examines the critical role of software maintainability as a leading indicator of change failure risk. Analyzing data from over 2,000 source code repositories across 26 enterprises over a three-year period, we explore how maintainability anti-patterns influence Change Failure Rate (CFR), a key DevOps metric that reflects the percentage of deployments resulting in production incidents. Our analysis finds that maintainability-related issues, such as god classes and high file complexity, are significant predictors of elevated CFR. Conversely, proactive rework and frequent developer interaction with complex code emerge as protective factors. Repositories that exhibit healthier structural code characteristics show low CFR (0–10%) and are associated with deployments involving larger, more substantial changes as measured by Coding Effort, while those that show signs of architectural fragility have high CFR (90–100%) and are associated with a sharp increase in deployment frequency, and shrinking deployment scope – suggesting that delivery behavior is driven more by urgency and instability than by strategic optimization. Our predictive model accurately identified deployments at risk of poor reliability, defined as a Change Failure Rate (CFR) of 40% or higher per DORA standards, with an F1-score of 0.79, demonstrating strong potential for early risk detection and informed decision-making using Maintainability metrics. For software development leaders, the implication is clear: maintainability is not just a technical hygiene factor, it is a strategic lever for improving reliability, accelerating delivery, and reducing operational risk. By prioritizing maintainable architecture, refactoring of problematic code, and design quality, organizations can build greater trust in their release processes and enable teams to deploy with confidence.

Introduction

Think of it as discovering your house has been broken into – not because you had real-time surveillance or proactive threat detection, but because you came home and found the back door swinging open. The damage is done, and you’re only now beginning to respond.

Similarly, in software systems, waiting for CFR or MTTR to reveal problems is akin to learning about theft after it happens. Without proactive indicators – like code maintainability analysis, static code checks, or security scans – you’re blind to the warning signs until users feel the impact.

The motivation for this study arises from the need to bridge this gap, to find proactive metrics that can forewarn of stability risks before failures happen.

Background

Medium and large organizations increasingly depend on complex, distributed systems to deliver key services and products, yet the escalating complexity of these systems presents significant challenges to maintaining their availability, performance, and reliability. In the DevOps era, organizations increasingly rely on data-driven metrics to track these attributes and improve performance. The DevOps Research and Assessment (DORA) framework introduced four key metrics for delivery performance – among them Change Failure Rate (CFR) and Failed Deployment Recovery Time/ Mean Time To Recovery (MTTR), which specifically measure system stability. CFR captures the percentage of deployments that result in production failures, indicating how often changes introduce incidents. MTTR represents the average time required to restore service after a production incident, reflecting how quickly teams recover from failures. There is a clear tendency of increased interest in DORA metrics, widespread and largely driven by various State of DevOps Reports over the years.

However, a key challenge is that such stability metrics are inherently lagging indicators – they only manifest after failures have occurred. Relying solely on CFR and MTTR means teams learn about stability issues reactively, often after users are already impacted. This retrospective nature is a known limitation of DORA metrics as leading indicators. In practice, CFR and MTTR do not account for all upstream factors that contribute to stable software delivery. For instance, DORA metrics do not explicitly capture code quality or complexity, which can be critical precursors to failures. High-performing teams may achieve excellent CFR/MTTR scores yet still harbor latent problems in code maintainability that eventually erode stability.

Maintainability metrics offer a promising solution to this challenge. Software maintainability is the ease with which a system can be modified to fix defects, improve functionality, or adapt to a changed environment. It is an internal quality attribute encompassing factors like code complexity, modularity, readability, and technical debt. Intuitively, a more maintainable codebase should be less prone to faults and faster to repair, suggesting a potential link to stability outcomes. For example, a codebase rife with poor structure and tight coupling (signs of low maintainability) makes even minor changes risky, increasing the likelihood of system instability. Recent industry approaches leverage maintainability measures as part of a broader analytics stack to gain early warning of trouble spots.

Notably, BlueOptima’s maintainability metric is a static-analysis based measure of code quality that quantifies how easily an organization’s code can be maintained. BlueOptima’s Developer Analytics platform provides such maintainability and code complexity metrics, allowing organizations to pinpoint critical issues before they become long-term liabilities.

Research Scope and Objectives

This study compares BlueOptima’s Maintainability metrics with the primary DORA stability metric – Change Failure Rate (CFR) – as a measure of post release stability of software products. The core objective is:

Maintainability as a Leading Indicator: Assess the relationship between maintainability and post release failures by analyzing correlations between BlueOptima Maintainability anti-patterns and DORA Change Failure Rate.

By addressing this objective, the research aims to provide actionable insights for engineering teams from maintainability measurements to improve downstream traditional DevOps metrics such as DORA Change Failure Rate. This allows engineering teams to preempt failures and improve subsequent software product stability.

DORA Metrics Overview

This research describes an objectively measured DORA CFR metric on our global benchmark dataset. In this research these metrics are measured consistently using objective data extracted directly from version control systems. This capability allows software engineering teams to generate DORA metrics by integrating directly with their VCS, eliminating the need to invest time and resources in manually analyzing data or conducting enterprise-wide surveys. Moreover, this approach enables direct comparison across enterprises and organisations.

In order to systematically measure DORA metrics objectively the following approach has been taken:

When referring to changes being deployed to production, we consider any pull requests being merged to the main, master or release branch as the production deployment. This is because the main branch is the primary branch where the stable and production-ready version of the codebase is maintained.

We assume that once changes are merged into the production release branch, they are immediately deployed to the production branch through CI/CD pipelines.

If no CI/CD pipelines are configured, DORA metrics will only reflect the time frame up to when code changes reach the production branch.

The lack of consistent and universal deployment information through ubiquitous and standardised use of CI/CD technologies is an industry-wide limitation. DORA relies on a survey-based approach to collate measures across enterprises and organisations. Google’s DORA team identified a significant challenge when building the Four Keys metrics pipeline, noting that: “One of the challenges of gathering these DORA metrics is that deployment, change, and incident data are usually in different disparate systems” across teams.

The approach taken in this research, and the deployment assumption inherent within it, provides a standardized way of measuring DORA metrics without relying on the hopelessly unreliable approach of collating survey responses. In heterogeneous pipeline environments, using merges to the production branch as the key deployment event mitigates inconsistencies introduced by varying CI/CD pipeline implementations. When such merges serve as a close proxy for actual releases, this approach yields comparatively reliable DORA metric measurements across diverse teams.

Objective DORA Metric Implementation

Velocity - Change Lead Time: The time it takes for a code commit or change to be successfully deployed to production. Calculation: We measure the time spent from initial commit until the merge date of a pull request to the production branch.

Velocity - Deployment Frequency: How often application changes are deployed to production. Calculation: The cadence of releasing or merging code changes into the production branch, showing how regularly updates are deployed.

Stability - Change Failure Rate: The percentage of deployments that causes failures in production, requiring hotfixes. Calculation: Change Failure Rate (CFR) is calculated as the percentage of hotfix pull requests merged into release branches out of the total number of production release branches.

Stability - Failed Deployment Recovery Time: The time it takes to recover from a failed deployment. Calculation: The duration from the initial commit of a hotfix pull request to its successful merge into the production branch.

Rework: The percentage of unplanned deployments that address user-facing bugs. Calculation: The percentage of bugfix-related changes released to production directly.

Methodology

The various measures of source code maintainability and DORA metrics were drawn from BlueOptima’s Global Benchmark, which is a large-scale corpus of software repositories maintained by software development organisations undertaking software development in a commercial enterprise setting.

Data Collection and Preparation

Data was gathered across these enterprise software development organizations using BlueOptima’s Integrator technology which is deployed on each enterprise’s network and so does not require those enterprises to share their source code.

The analysis covered 2,266 repositories across 26 enterprises over a three-year period, providing a comprehensive view of maintainability and DORA metrics in enterprise software development environments. In our analysis, we precisely identify the pull requests associated with hotfixes addressing production issues – events that constitute Change Failures under DORA metrics. To investigate the relationship between Maintainability and the various DORA measures attempting to evaluate ultimate software product stability, we focus specifically on the source files modified in these hotfix commits.

By focusing exclusively on problematic files – those implicated in resolving production incidents – we test the hypothesis that poor maintainability decisions contribute to system instability. In essence, we’re exploring whether a decline in maintainability precedes increased rates of deployment failure.

According to the 2024 DORA report, low-performing teams exhibit a significantly higher Change Failure Rate (CFR) compared to their higher-performing counterparts. The Change Failure Rate (CFR) for low-performing teams is reported to be 40%. This indicates that two out of every five deployments by these teams result in failures requiring remediation, such as hotfixes, rollbacks, or patches.

We will use the 40% Change Failure Rate (CFR) as a threshold to assess whether maintainability metrics can help predict future low performance as defined by DORA. This approach aims to provide leading indicators to proactively prevent team underperformance.

To perform this analysis, we constructed a dataset of monthly observations across a sample of the BlueOptima Global Benchmark source code repository corpus. Each record represents a single repository during a specific month. Our dataset contains maintainability data for changes that resulted in change failures.

Dataset Fields and Descriptions

id_infra_instan: Repository ID.

reporting_month: Month of observation (YYYY-MM format).

ap_func_overload_score: Problematic files containing hotfix are burdened with too many responsibilities or functionalities beyond its intended scope (lower = better).

ap_god_class_score: Problematic files have classes that do too much or know too much. Difficult to modify without breaking existing functionality.

ap_file_complexity_score: Refers to a software file, such as a class or module, that is difficult to maintain due to its excessive size, intricate structure, and numerous interdependencies with other parts of the system.

ap_factory_method_pattern_score: Refers to a situation where multiple conditional statements or switchcase blocks are used within a method to instantiate different types of objects based on certain conditions or parameters. This dispersion makes it harder to understand and modify the instantiation process, increasing the likelihood of errors when changes are needed.

ap_poor_code_readability: Refers to a situation when the code is either under-commented or over-commented. In both cases, the balance between code clarity and documentation is lost.

Maintainability: Overall maintainability score (0–100, higher = better).

author_is_the_maintainer: Boolean, whether the original developer fixed issues.

multiple_developers_working: Boolean, whether multiple devs edited files.

change_failure_rate: Percentage of hotfix pull requests merged into release branches out of the total number of production release branches.

failed_deployement_recovery_time: The duration from the initial commit of a hotfix pull request to its successful merge into the production branch.

The antipattern scores (i.e. variables prefixed with ap_) indicate how prevalent the associated maintainability design issues are in problematic files within the source code repositories. A higher antipattern score signifies a greater presence of the corresponding maintainability issue.

The boolean flags indicate whether fixes were made by the code’s original author (which can influence fix efficiency) and whether multiple developers contributed (which might affect consistency).

The Maintainability variable is BlueOptima’s composite maintainability metric, which is higher when code is easier to maintain and lower when numerous maintainability anti-patterns are present. In each month, we also record how many files improved or worsened in maintainability to capture codebase evolution.

Interaction Terms Definitions

Rework Effort × Deploys: A metric that measures the relationship between the time spent addressing user-facing bugs (rework effort) and the frequency of code deployment.

Method Complexity × Deploys: A metric that captures how frequently complex methods exist as a problem in code that is deployed, particularly in the context of deployed fixes.

Function Overload × Deploys: A metric that evaluates the interaction between function overload and deployment frequency, measuring how overloaded functions impact code that contains fixes for production issues and is deployed frequently.

Poor Code Readability × Function Overload: A metric that measures the interaction between poor code readability and function overload, capturing how overloaded functions with poor readability impact the stability and risk of deployed code containing fixes for production issues.

File Complexity Score x Rework: A metric that assesses the interaction between file complexity and rework, measuring how the complexity of files influences the frequency of delivering deployments that fix user-facing bugs.

Avg Maintainability × File Complexity: A metric that measures the interaction between the average maintainability of all files and the percentage of complex files within a fix.

Avg. Maintainability x God Classes: A metric that measures the interaction between the average maintainability of all files and the percentage of files affected by the God class issue within a fix.

God Classes x Method Complexity: A metric that measures the interaction between the percentage of files affected by the God class issue and the percentage of files with high method complexity in the change being delivered. It captures the compounded risk of structural and logic flaws that hinder safe scaling and modification when both issues are present.

File Complexity x Deployments: A metric that evaluates the interaction between the percentage of files with high complexity and the frequency of deployments containing fixes for production issues.

Factory Method Symptom x Rework Time: A metric that assesses the interaction between the percentage of files exhibiting factory method symptoms (complex instantiation logic) and the time required to deploy a fix containing user-facing bugs.

God Classes x Rework: A metric that measures the interaction between the percentage of files affected by the God class issue and the frequency of rework in deployments containing fixes for production issues.

Results

Statistical Analysis

We implemented a logistic regression model using a 40% Change Failure Rate as the threshold for low performance, based on reporting in the 2024 DORA report. In our binary classification setup, we define class 1 as deployments with a change failure rate (CFR) of 40% or higher, signaling lower stability. Deployments with CFR below 40% are labeled as class 0.

Our model achieved an F1-score of 0.79 in predicting deployments with a Change Failure Rate (CFR) of 40% or higher, which corresponds to low performance as defined by the DORA metrics.

Recall (0.96): Of all the actual high-risk deployments, the model correctly caught nearly all high-risk changes, ensuring critical issues are rarely missed.

Precision (0.68): When the model flags a deployment as risky, it is correct about 7 out of 10 times.

These findings suggest that if the maintainability patterns in the codebase remain unchanged in a repository experiencing a CFR above 40%, it is highly likely that future deployments from that repository will continue to experience a CFR above 40% – classifying them as low performing.

Features That Increase Change Failure Rate

Each coefficient in logistic regression represents the impact of a 1-unit increase in that feature on the log-odds of the positive class — in our case, the odds of a deployment being high CFR (i.e., unstable, low-performance).

Maintainability x God classes: Coefficient 1.08. For every 1-unit increase in this interaction, the odds of a high CFR nearly triple (Odd Ratio = 2.94).

God Classes x Method Complexity: Coefficient 0.72.

File Complexity x Deployments: Coefficient 0.28.

Factory Method Symptom x Rework Time: Coefficient 0.19.

God Classes x Rework: Coefficient 0.18.

Features That Reduce Change Failure Rate

Rework x Deployments: Coefficient -2.61. An odds ratio of 0.07 means that the odds of a high CFR are reduced by 93%.

Method Complexity x Deployments: Coefficient -0.70.

Functionality Overload x Deployments: Coefficient -0.49.

Poor Code Readability x Deployments: Coefficient -0.27.

File Complexity x Rework: Coefficient -0.16.

Maintainability x Deployments: Coefficient -0.13.

Factory Method Symptom x Deployments: Coefficient -0.12.

Rework refers to the percentage of changes related to bug fixes that are deployed directly to production. Deployments are defined as the number of times code is merged into the release or production branch.

When developers proactive resolve user-facing bugs in production, the need for hotfixes decreases, which in turn significantly lowers the overall change failure rate. Conversely, when developers rarely interact with codebases that exhibit maintainability design issues, the likelihood of change failure increases significantly when changes are eventually required.

Relationship Between CFR, Deployments, and Deployment Size

An analysis was conducted to examine the relationship between change failure rate, average coding effort to resolve production issues, and production release frequency.

0–10% CFR: Production Deployments (Mean): 4.6 | Coding Effort Hours (Mean): 6.1.

10–20% CFR: Production Deployments (Mean): 14.6 | Coding Effort Hours (Mean): 5.7.

20–30% CFR: Production Deployments (Mean): 8.8 | Coding Effort Hours (Mean): 5.3.

30–40% CFR: Production Deployments (Mean): 6.5 | Coding Effort Hours (Mean): 5.1.

40–50% CFR: Production Deployments (Mean): 14.2 | Coding Effort Hours (Mean): 4.5.

50–60% CFR: Production Deployments (Mean): 4.0 | Coding Effort Hours (Mean): 6.6.

60–70% CFR: Production Deployments (Mean): 6.4 | Coding Effort Hours (Mean): 6.1.

70–80% CFR: Production Deployments (Mean): 9.2 | Coding Effort Hours (Mean): 3.5.

80–90% CFR: Production Deployments (Mean): 13.2 | Coding Effort Hours (Mean): 4.1.

90–100% CFR: Production Deployments (Mean): 39.6 | Coding Effort Hours (Mean): 2.8.

The data reveals clear patterns: in high-stability environments (0-10% CFR), fewer deployments were made, and fixes tended to involve larger, more substantial code changes. In contrast, the 90–100% CFR range exhibited the highest deployment frequency (39.6 releases) and the smallest average deployment size (2.8 coding hours). This reflects a shift toward resolving issues through more frequent, smaller updates in highly unstable environments.

Discussion

The empirical results from the data strongly favor the maintainability metrics as leading indicators of software stability, over the DORA Change Failure Rate metric.

Why is Maintainability More Actionable?

Maintainability reflects how easily a codebase can be understood, modified, and extended over time. A maintainable codebase features modular design and low complexity, reducing the likelihood of introducing defects. In contrast, CFR measures outcomes after changes are deployed and does not directly convey the cause of failure. Maintainability acts like a “canary in the coal mine”: a drop in score signals danger before an incident manifests.

Maintainability is more actionable because it points to areas of improvement under the team’s control right now. If scores are declining, engineers can investigate for trouble spots – like modules with high complexity – and refactor or add tests proactively.

Our findings indicate that maintainability-related design anti-patterns are strong predictors of higher CFR. In systems where such anti-patterns persist, the most effective strategies involve either ongoing rework or frequent developer interaction. When neglected, these areas degrade silently, increasing the likelihood of failure when change is eventually required.

Features Increasing CFR Risk - Deep Dive

Maintainability x God Classes (1.08): High maintainability is crucial, but when structurally coupled to God classes, architectural risks can still lead to elevated CFR. A clean file invoking a method from a God class may break due to unexpected side effects if the God class changes.

God Classes x Method Complexity (0.72): Compounded design debt resulting from structural and logic flaws.

File Complexity x Deployments (0.28): Frequent deployment of highly complex files can amplify underlying structural weaknesses and integration issues.

Factory Method Symptom x Rework Time (0.19): Complex instantiation logic makes it harder to understand dependency wiring, forcing developers to repeatedly patch the same areas.

God Classes x Rework (0.18): Reworking god class code is risky due to tightly coupled logic and high cognitive load.

Features Reducing CFR Risk - Deep Dive

Rework Effort × Deploys (-2.61): Bugfixes in high-deploy areas show active stewardship and reduce the risk of compounding problems.

Method Complexity × Deploys (-0.7): Frequent use allows teams to build safeguards around complex logic, making it less error-prone.

Function Overload × Deploys (-0.49): Regular deployment reduces knowledge silos and allows for incremental improvement.

Poor Code Readability × Function Overload (-0.32): Readability combined with frequent iteration neutralizes structural flaws.

File Complexity Score x Rework (-0.3): Frequent changes foster understanding and mitigate risk through familiarity.

Avg Maintainability × File Complexity (-0.13): High maintainability offsets raw complexity, allowing developers to work safely with complex modules.

Delivery Behavior and System Stability

As production stability deteriorates, deployment behavior shifts toward smaller, more frequent fixes. In the 90-100% band, the steep rise in frequency (39.6 releases) paired with low coding effort (2.8 hours) indicates reactive delivery behavior driven by urgency rather than optimization.

The inflection point in the 50–60% CFR range is noteworthy: despite elevated failure rates, deployment size rises to its peak (6.6 hours) while frequency drops to its lowest (4.0 releases). This may reflect an intentional effort to regain control through more substantive fixes before further escalation leads to high-frequency, low-effort changes.

Robustness and Reliability of CFR and Maintainability

Despite its value, CFR can be misinterpreted or gamed if teams fixate on the metric itself.

Artificial inflation of deployment frequency: Pushing trivial changes (renaming variables, adjusting formatting) to artificially lower the failure rate without enhancing actual stability.

Avoiding or Delaying Deployments: Batching commits or deferring releases to report a low CFR, which undermines agility.

Redefining “Failure”: Raising the threshold for what counts as an “incident” to keep the metric low.

Maintainability is generally less susceptible to gaming because improving it requires actual work on the code – such as breaking up complex functions or refactoring modules – that aligns with good engineering practice. The only way to "game" maintainability is to actually improve the code.

Conclusion

This study reinforces the critical relationship between software maintainability, structural design quality, and change failure rates. Maintainability-related anti-patterns are strong predictors of high CFR, and their presence can significantly compromise delivery stability.

Frequent deployments combined with proactive rework in maintainability-challenged codebases are key behaviors of high-performing teams. Maintainability is a forward-looking indicator that affirms stability and speed are co-dependent outcomes rooted in healthy engineering practices. Organizations seeking to improve CFR should prioritize maintainability alongside process and automation maturity.

Recommendations For Software Development Executives

Treat Maintainability as a Strategic Engineering Metric: Integrate BlueOptima Maintainability into engineering KPIs and require periodic reviews in critical systems.

Systematically Identify and Refactor Design Anti-Patterns: Leverage automated detection of issues like god classes and create engineering capacity (tech debt budgets) to refactor them incrementally.

Incentivize Rework in High-Risk Code Areas: Encourage teams to prioritize rework and cleanup in modules associated with frequent failures.

Pair CFR Metrics with Maintainability Trends for Insightful Risk Monitoring: Combine reactive and leading indicators for complete visibility into delivery health

‍