On-call rotation
An on-call rotation is the scheduled assignment of engineers to be the primary responder for production incidents during a defined window — usually 1 week per engineer, 24/7, with a secondary backup who escalates if the primary doesn't acknowledge inside the agreed window.
Healthy rotations share several properties: enough engineers that each person is on-call no more often than 1 week in 6 (preferably 1 in 8+); explicit compensation or time-back for off-hours pages; a hard cap on actionable pages per shift (5-7 is the common ceiling before burnout sets in); a no-blame culture for pages that escalate; and a feedback loop into the alert design (every page that turned out to be non-actionable becomes a candidate for an alert edit). Teams that rotate engineers through new services every 6 months tend to have healthier runbooks because the rotation surfaces stale ones; teams with deeply specialised on-call build single-points-of-failure that paginate badly.
Related terms
- Runbook
A runbook is a step-by-step operational document that describes how to diagnose and resolve a specific failure mode — what alert fires, what to check first, which commands to run, when to escalate.
- Incident commander
The incident commander is the single individual with end-to-end authority during a production incident — coordinating responders, deciding on mitigation actions, communicating to stakeholders, and declaring when the incident is over.
- Toil
Toil, as defined in Google's SRE practice, is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth — patching servers, manually resolving alerts, hand-editing configs.