Error budget
An error budget is the allowable reliability gap between the SLA (customer contract) and the SLO (operational target). If your SLO is 99.9% and you're meeting 99.95%, you have a 0.05% error budget to spend on risky changes — new features, infrastructure migrations, schema rewrites. Error budgets convert reliability from a yes/no debate into a tradeable resource.
The team's relationship with the error budget shapes release cadence: when there's budget, ship boldly; when budget is depleted, slow down and prioritise reliability work. The Google SRE handbook treats burn-down of the error budget as a primary on-call signal — burning faster than expected triggers a freeze on non-reliability work. The tradeable-resource framing is what makes the concept stick organisationally; without it, reliability vs. velocity becomes a recurring philosophical debate.
Long-form posts that explore error budget in depth — when to use it, common failure modes, how AI helps.
Related terms
- SLO
A Service-Level Objective is a target reliability metric for a service — typically expressed as a percentage over a time window.
- MTTR
Mean Time To Recovery is the average elapsed time between an incident's detection and its resolution.
- Postmortem
A postmortem is a structured retrospective on an incident or failure — capturing what happened, why, what was learned, and what will change.