RTO and RPO
Recovery Time Objective (RTO) is the maximum acceptable time between a disaster and service restoration. Recovery Point Objective (RPO) is the maximum acceptable data loss measured backwards from the disaster — RPO of 1 hour means up to 1 hour of recent transactions may be lost. Together they define the disaster recovery contract.
RTO drives the recovery architecture: 4-hour RTO can be served by backup-and-restore; 15-minute RTO requires hot standby; sub-minute RTO requires active-active. RPO drives the replication architecture: 24-hour RPO is met by nightly snapshots; 1-hour RPO requires hourly snapshots or continuous WAL shipping; near-zero RPO requires synchronous multi-AZ replication (which adds latency to every write). The mistake: setting tight RTO/RPO without budgeting for the architecture and operational discipline to meet them. The honest exercise is to ask 'when did we last actually achieve this in a test?' — if the answer is 'never', the numbers are aspirational.
Related terms
- Disaster recovery
Disaster recovery is the set of plans, procedures, and infrastructure that restores a service after a major failure — region outage, data corruption, ransomware, deletion of production data.
- High availability
High availability is the design objective of keeping a system continuously operational for a defined uptime target — typically expressed in nines (99.
- Game day
A game day is a scheduled exercise in which an engineering team intentionally exercises a failure scenario in a live or production-like environment — pulling a database, killing a region, exhausting a quota — to validate that monitoring fires, runbooks work, and the on-call rotation can respond inside the agreed SLO.