All glossary terms
Verify

Site reliability engineering

Site reliability engineering (SRE) is a discipline, originated at Google around 2003 and codified in the 2016 SRE book, that applies software-engineering practices to operations. Rather than treating uptime as an absolute, SRE expresses reliability as service level objectives (SLOs) and uses an error budget — the difference between actual and target reliability — to govern the trade-off between reliability work and feature delivery.

SRE's central insight: 100% uptime is the wrong goal — it's both unachievable and unnecessarily expensive. Define an SLO (e.g., 99.9% request success), measure it via SLIs (service level indicators), and treat the gap to 100% as a budget. When the budget is healthy, ship features; when it's exhausted, halt feature work to invest in reliability. SRE practices that have spread beyond Google: toil reduction (engineering effort to eliminate repeated manual work), blameless postmortems, error-budget policies, on-call rotations with strict guardrails (no more than 25% on-call per quarter). The model has been adopted at scale by Meta, Netflix, LinkedIn, and most modern infrastructure organisations.

Related terms