Verify

Fault tolerance

Fault tolerance is the property of a system to continue operating, possibly in a degraded state, when one or more of its components fail. A fault-tolerant system has redundancy at every layer where failure is plausible, multiple replicas, multiple availability zones, fallback paths, and detects and routes around failure automatically.

May 23, 2026

The opposite of fault-tolerant is fragile: a system where any single component failure cascades to user-visible outage. Fault tolerance is built in layers: hardware redundancy (RAID, ECC memory), network redundancy (multi-path, multi-AZ), service redundancy (replicas, leader election), and request-level redundancy (retries with backoff, circuit breakers, graceful degradation). The trap is achieving fault tolerance at one layer while leaving fragility at another, multi-AZ replicas behind a single load balancer in a single AZ is not fault-tolerant. The discipline of chaos engineering exists to surface these hidden fragilities.