Recent DORA reports announced a change in MTTR. Instead of calling and measuring the MTTR now DORA academy focused on failed deployment recovery time. I would like to put all metrics with explanations, calculations and examples to understand how to clarify the so close but different metrics.
Let’s start with the terminology,
Change Failure Rate (CFR):
CFR measures the percentage of deployments that cause failures in production requiring immediate remediation.
It focuses on changes that degrade service quality, such as bugs introduced by new code or performance issues from recent deployments.
CFR counts all failures/bugs that need to be fixed, no matter how serious they are.
Imagine your team has the following deployment record over two weeks:
Press enter or click to view image in full size
Total Deployments: 10
Failed Deployments: 3
Mean Time to Recover (MTTR) (legacy metric)
MTTR calculates the average time it takes to restore service after any production incident, not limited to deployment-related failures.
This includes infrastructure issues, external service disruptions, or operational problems. MTTR often measures the service outages that impact users.
Total incidents : 5
Total downtime: 2 + 4 + 3 + 1 + 5 = 15 Hours
Failed Deployment Recovery Time
In recent years, the DevOps Research and Assessment (DORA) group has shifted from MTTR to Failed Deployment Recovery Time.
This metric specifically measures the time to recover from failures introduced by deployments. It aligns closely with CFR, as both focus on deployment-induced failures.
Shift from MTTR to Failed Deployment Recovery Time
As the industry evolves, so does our understanding of effective metrics. Many professionals argue that MTTR can be ambiguous and sometimes misleading due to the wide variance in incident types and severities.
Averaging recovery times of minor glitches with major outages may not provide meaningful insights. Recognizing these limitations, organizations like DORA have shifted focus from MTTR to Failed Deployment Recovery Time, which specifically measures recovery from deployment-related failures.
This metric offers a more precise assessment of how quickly a team can respond to issues caused by their own changes, providing actionable insights for improving the software delivery process.
Key take aways
- Measure each failure, if you are using conventional commits for example, every branch starting with
fix:prefix is a failure and we should measure the time from bug identified to fixed and deployed to production. - Failed recovery time is precisely focus teams on higher quality programming activity.
- Mean time to recover is still a good metric to follow if service providers focus on availability of the services.
- Deployment failures can be anything from mistakes on CI pipeline to reveal problems before reaching to deployment stage like linter errors to logic errors that have deployed and impact user experience as well.
If you are more interested on those metrics and how your team/organization performs you can try using iftrue for free.