All glossary terms
Verify

Disaster recovery

Disaster recovery is the set of plans, procedures, and infrastructure that restores a service after a major failure — region outage, data corruption, ransomware, deletion of production data. DR is distinct from routine availability: it covers events too rare to design HA against (a region-wide cloud outage) and events that bypass HA (a corrupt deployment replicated to all instances).

A DR plan has three concrete numbers: RTO (how fast must we restore?), RPO (how much data can we lose?), and recovery cost. The cheapest plan is backups + a documented restore procedure — fine if RTO is days. The most expensive is hot multi-region with continuous replication — required if RTO is minutes. Most teams sit in the middle: nightly backups to a different region, weekly restore tests, documented runbook, and an annual game day that actually executes the runbook. The single biggest DR mistake is having the plan and never testing it; when the disaster hits, the plan is six months stale.

Related terms