Verify

Disaster recovery

Disaster recovery is the set of plans, procedures, and infrastructure that restores a service after a major failure, region outage, data corruption, ransomware, deletion of production data. DR is distinct from routine availability: it covers events too rare to design HA against (a region-wide cloud outage) and events that bypass HA (a corrupt deployment replicated to all instances).

May 23, 2026

A DR plan has three concrete numbers: RTO (how fast must we restore?), RPO (how much data can we lose?), and recovery cost. The cheapest plan is backups + a documented restore procedure, fine if RTO is days. The most expensive is hot multi-region with continuous replication, required if RTO is minutes. Most teams sit in the middle: nightly backups to a different region, weekly restore tests, documented runbook, and an annual game day that actually executes the runbook. The single biggest DR mistake is having the plan and never testing it; when the disaster hits, the plan is six months stale.