Toil
Toil, as defined in Google's SRE practice, is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth — patching servers, manually resolving alerts, hand-editing configs. SRE teams target an upper bound on toil (Google: 50% of an SRE's time) so that the remaining time goes to engineering work that reduces future toil.
The definition matters because it draws a sharp line between unavoidable operational work (toil) and engineering work that pays down toil. Two cultural moves make the distinction operational: tracking toil hours (SREs log time against toil categories vs engineering categories), and capping toil percentage (when toil exceeds the cap, the team pauses feature work to automate). Anti-patterns: counting on-call as automatically being toil (much of it is incident response, which is engineering); treating one-off operational work as toil (it isn't — toil is repetitive); using 'toil reduction' as the SRE-team excuse to avoid customer feature requests (the goal is to reduce toil, not refuse work).
Related terms
- Site reliability engineering
Site reliability engineering (SRE) is a discipline, originated at Google around 2003 and codified in the 2016 SRE book, that applies software-engineering practices to operations.
- Error budget
An error budget is the allowable reliability gap between the SLA (customer contract) and the SLO (operational target).