Chaos engineering
Chaos engineering is the practice of deliberately injecting failures into production (or production-like) systems to validate they recover gracefully. Pioneered by Netflix with Chaos Monkey in 2010, the practice catches reliability assumptions that only show up under failure — single-points-of-failure, missing retries, runaway memory under partial outages, etc. — before real customers hit them.
Mature chaos programs run scheduled experiments (kill a node, throttle a service, drop a network segment) against staging or off-peak production. The discipline is keeping experiments scoped + reversible — chaos engineering is not 'break things for fun.' Each experiment has a hypothesis ('the system will reroute traffic within 30 seconds'), success criteria, and an abort path. Tools: Gremlin, AWS Fault Injection Service, LitmusChaos, Toxiproxy.
Long-form posts that explore chaos engineering in depth — when to use it, common failure modes, how AI helps.
- The connected delivery graph: one source of truth from PRD to prodMost teams ship software with five tools that don't talk to each other. The friction isn't any individual tool — it's the missing graph between them. This is the case for one connected graph.9 min read
- Are AI-generated test cases worth shipping?Yes, with a sharp caveat — when they're tied to AC and reviewed by a human. Five categories where AI test generation is great, five anti-patterns to catch.9 min read
Related terms
- MTTR
Mean Time To Recovery is the average elapsed time between an incident's detection and its resolution.
- Postmortem
A postmortem is a structured retrospective on an incident or failure — capturing what happened, why, what was learned, and what will change.
- Idempotency
An operation is idempotent if calling it multiple times has the same effect as calling it once.
- SLO
A Service-Level Objective is a target reliability metric for a service — typically expressed as a percentage over a time window.