All glossary terms
Verify

Fault tolerance

Fault tolerance is the property of a system to continue operating, possibly in a degraded state, when one or more of its components fail. A fault-tolerant system has redundancy at every layer where failure is plausible — multiple replicas, multiple availability zones, fallback paths — and detects and routes around failure automatically.

The opposite of fault-tolerant is fragile: a system where any single component failure cascades to user-visible outage. Fault tolerance is built in layers: hardware redundancy (RAID, ECC memory), network redundancy (multi-path, multi-AZ), service redundancy (replicas, leader election), and request-level redundancy (retries with backoff, circuit breakers, graceful degradation). The trap is achieving fault tolerance at one layer while leaving fragility at another — multi-AZ replicas behind a single load balancer in a single AZ is not fault-tolerant. The discipline of chaos engineering exists to surface these hidden fragilities.

Related terms