AI-Era Software Incident Management Needs to Move Upstream

Error - Could not copy link. Try again

Page link copied

Most software incidents don't begin in production.

They begin earlier: in the pull request that takes too long to close, the file nobody wants to touch, the repository that gets a little harder to change every quarter, the review where everyone senses risk but nobody can point to it.

Most teams are good at responding once something breaks, but by the time an incident hits production, the cost has already started climbing.

We analyzed more than 666,000 source code revisions across 13 enterprises, linking incident records to the code changes used to resolve them. The pattern was striking: incidents cluster where poor maintainability, repository fragility, and delivery friction overlap.

For engineering leaders, that means the real work of incident management starts upstream: spotting the conditions that make incidents more likely, harder to fix, and slower to close, before they show up in the queue.

The bottleneck isn't writing the fix any more

AI has made code production faster. That speed moves downstream into review, validation, and the growing suspicion that the codebase is getting harder to trust.

PR Time to Close was the strongest overall risk driver in our data, carrying a 1.34x risk multiplier. A slow-moving PR isn't automatically dangerous. But at scale, it's a signal.

Maintainability dictates recovery time

When an incident hits the least maintainable quartile of a codebase, the median PR time to resolve is 65.2 hours. In the most maintainable quartile, it's 1.7 hours.

That's a 38x gap. If your incident dashboard only tells you what happened during the outage, it's missing the part that determines how long you're stuck in it.

Risk sits at the repo level

A careful, well-reviewed change can still land in a fragile repository and inherit its risk. Good engineers end up working inside code that's already hard to reason about.

Repository Lifetime Maintainability Deficit was one of the strongest maintainability-related risk drivers we found, at 1.24x. It's a good argument for keeping a watchlist: repositories with declining maintainability, slow PRs, aging files, or narrow ownership warrant more scrutiny before something breaks, not after.

Old files need more care

File age carried a 1.22x risk multiplier. Old code isn't inherently risky – plenty of old files are stable and rarely touched. The risk appears when an aging file needs active modification and the original design intent, dependencies, or owners have moved on.

A useful discipline here is touch-and-clean: when engineers modify an aging file, they clean up what's safe and proportionate on the way through. A missing test, a confusing conditional, dead code, an undocumented behavior. It's not a full refactor, just enough to stop the debt from compounding.

Three changes to make now

Add maintainability checks to CI/CD. Flag changes that reduce maintainability, especially in fragile or critical repositories. It doesn't need to block every merge, it can just trigger extra review when risk is higher.
Treat long-running PRs as a risk signal. A PR sitting open past your team's normal threshold (especially in a fragile repository or aging file) deserves a second look before merge.
Bring maintainability into incident reviews. After an incident, ask whether the affected area was already hard to change. If it was, "Reduce what made this bug hard to fix" is your action point.

AI raises the stakes

As AI-assisted code output grows, review and validation carry more weight, not less. Teams that measure AI success purely on speed or throughput can miss where the real cost hits: in rework, review load, and incident recovery time.

What's important is whether AI-assisted code is maintainable and easy to change later, not just how fast it shipped.

The signals that matter, before the outage

Maintainability decline. Long-running PRs. Aging files. Fragile repositories. Review friction.

These show up before the post-mortem, before the 65-hour recovery window, while there's still time to act. That's what turns incident management from something you do after a fire into something that prevents one.

See the full breakdown of these risk drivers here.

‍

Copy Link

Software incident management shouldn’t start after production failures. Discover how maintainability, PR friction, and repository health help engineering teams spot incident risk earlier.

AI-Era Software Incident Management Needs to Move Upstream

Software incident management shouldn’t start after production failures. Discover how maintainability, PR friction, and repository health help engineering teams spot incident risk earlier.

The bottleneck isn't writing the fix any more

Maintainability dictates recovery time

Risk sits at the repo level

Old files need more care

Three changes to make now

AI raises the stakes

The signals that matter, before the outage

Other Articles

AI-Era Software Incident Management Needs to Move Upstream

What Is AI Technical Debt, and How Do You Measure It?

AI Coding Tool Cost Is Getting Harder to Forecast. Here’s What To Measure.

How Is the Geography of Enterprise Software Productivity Changing?

The Biggest Security Risk Isn’t Your System. It’s Where You Store Your Secrets

Why Leaked Credentials Are More Dangerous in the Age of Autonomous AI

Cisco SD-WAN Zero-Day Attack: Why “Moderate” Vulnerabilities Are a Bigger Risk Than You Think

AI Coding Performance Depends on Your Tech Stack

AI Coding Benchmarks Are Measuring the Wrong Things

Two Approaches to Detecting AI -Generated Code

Your AI Adoption Strategy Has a Blind Spot

From Vulnerability Overload to Clear Priorities: Software Composition Analysis in Code Insights

What Curl's Bug Bounty Teaches Us About Code Security in the AI Era

VS Code Extension Security Risks: The Supply Chain That Auto-Updates on Your Developers’ Laptops

CVE-2025-46295: Why You Don’t Need to Panic as a Developer

How To Drive Sustainable IT: Turn Laptops Into Infrastructure

A Guide to Capitalizing Internally Developed Software

Why Software Teams Need a More Strategic Approach to Secrets Scanning