Runbook
A runbook is a step-by-step operational document that describes how to diagnose and resolve a specific failure mode — what alert fires, what to check first, which commands to run, when to escalate. Runbooks are linked from alert payloads so the responder reaches the procedure within seconds of the page firing.
Runbooks degrade quickly when not exercised. The two failure modes are absence (the alert fires, the responder has nothing to follow) and rot (the runbook references a deprecated tool, a renamed dashboard, or a person who left two years ago). Healthy runbook practice: every alert links to a runbook; every game day surfaces runbook gaps; every postmortem produces a runbook edit; the runbook lives in version control alongside the service it covers, not in a wiki nobody updates. A runbook is the difference between a 5-minute MTTR and a 50-minute MTTR for routine incidents.
Related terms
- MTTR
Mean Time To Recovery is the average elapsed time between an incident's detection and its resolution.
- On-call rotation
An on-call rotation is the scheduled assignment of engineers to be the primary responder for production incidents during a defined window — usually 1 week per engineer, 24/7, with a secondary backup who escalates if the primary doesn't acknowledge inside the agreed window.
- Incident commander
The incident commander is the single individual with end-to-end authority during a production incident — coordinating responders, deciding on mitigation actions, communicating to stakeholders, and declaring when the incident is over.