Chaos engineering

Chaos engineering is the practice of deliberately injecting failures into production (or production-like) systems to validate they recover gracefully. Pioneered by Netflix with Chaos Monkey in 2010, the practice catches reliability assumptions that only show up under failure — single-points-of-failure, missing retries, runaway memory under partial outages, etc. — before real customers hit them.

Mature chaos programs run scheduled experiments (kill a node, throttle a service, drop a network segment) against staging or off-peak production. The discipline is keeping experiments scoped + reversible — chaos engineering is not 'break things for fun.' Each experiment has a hypothesis ('the system will reroute traffic within 30 seconds'), success criteria, and an abort path. Tools: Gremlin, AWS Fault Injection Service, LitmusChaos, Toxiproxy.

Related terms