AI-Era Software Incident Management Needs to Move Upstream
Software incident management shouldn’t start after production failures. Discover how maintainability, PR friction, and repository health help engineering teams spot incident risk earlier.
.webp)
Most software incidents don't begin in production.
They begin earlier: in the pull request that takes too long to close, the file nobody wants to touch, the repository that gets a little harder to change every quarter, the review where everyone senses risk but nobody can point to it.
Most teams are good at responding once something breaks, but by the time an incident hits production, the cost has already started climbing.
We analyzed more than 666,000 source code revisions across 13 enterprises, linking incident records to the code changes used to resolve them. The pattern was striking: incidents cluster where poor maintainability, repository fragility, and delivery friction overlap.
For engineering leaders, that means the real work of incident management starts upstream: spotting the conditions that make incidents more likely, harder to fix, and slower to close, before they show up in the queue.
The bottleneck isn't writing the fix any more
AI has made code production faster. That speed moves downstream into review, validation, and the growing suspicion that the codebase is getting harder to trust.
PR Time to Close was the strongest overall risk driver in our data, carrying a 1.34x risk multiplier. A slow-moving PR isn't automatically dangerous. But at scale, it's a signal.
Maintainability dictates recovery time
When an incident hits the least maintainable quartile of a codebase, the median PR time to resolve is 65.2 hours. In the most maintainable quartile, it's 1.7 hours.
That's a 38x gap. If your incident dashboard only tells you what happened during the outage, it's missing the part that determines how long you're stuck in it.
Risk sits at the repo level
A careful, well-reviewed change can still land in a fragile repository and inherit its risk. Good engineers end up working inside code that's already hard to reason about.
Repository Lifetime Maintainability Deficit was one of the strongest maintainability-related risk drivers we found, at 1.24x. It's a good argument for keeping a watchlist: repositories with declining maintainability, slow PRs, aging files, or narrow ownership warrant more scrutiny before something breaks, not after.
Old files need more care
File age carried a 1.22x risk multiplier. Old code isn't inherently risky – plenty of old files are stable and rarely touched. The risk appears when an aging file needs active modification and the original design intent, dependencies, or owners have moved on.
A useful discipline here is touch-and-clean: when engineers modify an aging file, they clean up what's safe and proportionate on the way through. A missing test, a confusing conditional, dead code, an undocumented behavior. It's not a full refactor, just enough to stop the debt from compounding.
Three changes to make now
- Add maintainability checks to CI/CD. Flag changes that reduce maintainability, especially in fragile or critical repositories. It doesn't need to block every merge, it can just trigger extra review when risk is higher.
- Treat long-running PRs as a risk signal. A PR sitting open past your team's normal threshold (especially in a fragile repository or aging file) deserves a second look before merge.
- Bring maintainability into incident reviews. After an incident, ask whether the affected area was already hard to change. If it was, "Reduce what made this bug hard to fix" is your action point.
AI raises the stakes
As AI-assisted code output grows, review and validation carry more weight, not less. Teams that measure AI success purely on speed or throughput can miss where the real cost hits: in rework, review load, and incident recovery time.
What's important is whether AI-assisted code is maintainable and easy to change later, not just how fast it shipped.
The signals that matter, before the outage
Maintainability decline. Long-running PRs. Aging files. Fragile repositories. Review friction.
These show up before the post-mortem, before the 65-hour recovery window, while there's still time to act. That's what turns incident management from something you do after a fire into something that prevents one.
See the full breakdown of these risk drivers here.


.webp)
.webp)
.webp)
.webp)
.webp)
.webp)
.webp)
.webp)




