Site reliability engineering
Site reliability engineering (SRE) is a discipline, originated at Google around 2003 and codified in the 2016 SRE book, that applies software-engineering practices to operations. Rather than treating uptime as an absolute, SRE expresses reliability as service level objectives (SLOs) and uses an error budget — the difference between actual and target reliability — to govern the trade-off between reliability work and feature delivery.
SRE's central insight: 100% uptime is the wrong goal — it's both unachievable and unnecessarily expensive. Define an SLO (e.g., 99.9% request success), measure it via SLIs (service level indicators), and treat the gap to 100% as a budget. When the budget is healthy, ship features; when it's exhausted, halt feature work to invest in reliability. SRE practices that have spread beyond Google: toil reduction (engineering effort to eliminate repeated manual work), blameless postmortems, error-budget policies, on-call rotations with strict guardrails (no more than 25% on-call per quarter). The model has been adopted at scale by Meta, Netflix, LinkedIn, and most modern infrastructure organisations.
Related terms
- SLO
A Service-Level Objective is a target reliability metric for a service — typically expressed as a percentage over a time window.
- Error budget
An error budget is the allowable reliability gap between the SLA (customer contract) and the SLO (operational target).
- Toil
Toil, as defined in Google's SRE practice, is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth — patching servers, manually resolving alerts, hand-editing configs.
- Blameless postmortem
A blameless postmortem is an incident review structured to identify systemic causes — flawed processes, missing alerts, fragile dependencies — rather than individual fault.