Disaster recovery
Disaster recovery is the set of plans, procedures, and infrastructure that restores a service after a major failure — region outage, data corruption, ransomware, deletion of production data. DR is distinct from routine availability: it covers events too rare to design HA against (a region-wide cloud outage) and events that bypass HA (a corrupt deployment replicated to all instances).
A DR plan has three concrete numbers: RTO (how fast must we restore?), RPO (how much data can we lose?), and recovery cost. The cheapest plan is backups + a documented restore procedure — fine if RTO is days. The most expensive is hot multi-region with continuous replication — required if RTO is minutes. Most teams sit in the middle: nightly backups to a different region, weekly restore tests, documented runbook, and an annual game day that actually executes the runbook. The single biggest DR mistake is having the plan and never testing it; when the disaster hits, the plan is six months stale.
Related terms
- RTO and RPO
Recovery Time Objective (RTO) is the maximum acceptable time between a disaster and service restoration.
- High availability
High availability is the design objective of keeping a system continuously operational for a defined uptime target — typically expressed in nines (99.
- Game day
A game day is a scheduled exercise in which an engineering team intentionally exercises a failure scenario in a live or production-like environment — pulling a database, killing a region, exhausting a quota — to validate that monitoring fires, runbooks work, and the on-call rotation can respond inside the agreed SLO.