All glossary terms
Verify

Toil

Toil, as defined in Google's SRE practice, is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth — patching servers, manually resolving alerts, hand-editing configs. SRE teams target an upper bound on toil (Google: 50% of an SRE's time) so that the remaining time goes to engineering work that reduces future toil.

The definition matters because it draws a sharp line between unavoidable operational work (toil) and engineering work that pays down toil. Two cultural moves make the distinction operational: tracking toil hours (SREs log time against toil categories vs engineering categories), and capping toil percentage (when toil exceeds the cap, the team pauses feature work to automate). Anti-patterns: counting on-call as automatically being toil (much of it is incident response, which is engineering); treating one-off operational work as toil (it isn't — toil is repetitive); using 'toil reduction' as the SRE-team excuse to avoid customer feature requests (the goal is to reduce toil, not refuse work).

Related terms