Game day
A game day is a scheduled exercise in which an engineering team intentionally exercises a failure scenario in a live or production-like environment — pulling a database, killing a region, exhausting a quota — to validate that monitoring fires, runbooks work, and the on-call rotation can respond inside the agreed SLO.
Game days originated at Amazon as a way to keep recovery procedures from atrophying between real incidents. The typical structure is a 2-4 hour session with a pre-written scenario, an injection point, observers timing each detection and response step, and a post-exercise debrief that produces concrete runbook edits. Game days complement chaos engineering: chaos runs unattended in production at random intervals; game days are scheduled, observed, and focused on validating human + tooling response. Both surface the same kinds of latent failure (stale runbooks, missing alerts, unclear escalation) but game days are higher-resolution because a human team is actively diagnosing.
Related terms
- Chaos engineering
Chaos engineering deliberately injects failures into production (or production-like) systems to validate they recover gracefully.
- Runbook
A runbook is a step-by-step operational document that describes how to diagnose and resolve a specific failure mode — what alert fires, what to check first, which commands to run, when to escalate.
- Incident commander
The incident commander is the single individual with end-to-end authority during a production incident — coordinating responders, deciding on mitigation actions, communicating to stakeholders, and declaring when the incident is over.