Fault tolerance
Fault tolerance is the property of a system to continue operating, possibly in a degraded state, when one or more of its components fail. A fault-tolerant system has redundancy at every layer where failure is plausible — multiple replicas, multiple availability zones, fallback paths — and detects and routes around failure automatically.
The opposite of fault-tolerant is fragile: a system where any single component failure cascades to user-visible outage. Fault tolerance is built in layers: hardware redundancy (RAID, ECC memory), network redundancy (multi-path, multi-AZ), service redundancy (replicas, leader election), and request-level redundancy (retries with backoff, circuit breakers, graceful degradation). The trap is achieving fault tolerance at one layer while leaving fragility at another — multi-AZ replicas behind a single load balancer in a single AZ is not fault-tolerant. The discipline of chaos engineering exists to surface these hidden fragilities.
Related terms
- High availability
High availability is the design objective of keeping a system continuously operational for a defined uptime target — typically expressed in nines (99.
- Graceful degradation
Graceful degradation is the design property of a system that, when a dependency fails or saturates, returns reduced functionality rather than no functionality.
- Chaos engineering
Chaos engineering deliberately injects failures into production (or production-like) systems to validate they recover gracefully.