Rate limiting
Rate limiting caps the number of requests a client can make to a service within a defined window — typically expressed as 'N requests per second' or 'N requests per minute per API key'. Excess requests are either delayed (queued), rejected (429 response), or shaped (returned with degraded quality).
Rate limiting serves two purposes: protecting the service from abusive or runaway clients, and ensuring fair access among well-behaved clients. Common algorithms include token bucket (refill rate + burst capacity), leaky bucket (smooths bursts to a constant rate), and sliding window log (precise but memory-heavy). Distributed rate limiting (across multiple service instances) typically uses Redis with atomic counters or a sidecar like Envoy. The hard problem is choosing limits: too tight and legitimate traffic hits 429s; too loose and the rate limit provides no protection. The pragmatic approach is to instrument rejection rate, watch the p99 client's burst pattern, and set limits at the 99th percentile of legitimate usage plus a 2-3x safety margin.
Related terms
- Backpressure
Backpressure is the signal a downstream component sends upstream to indicate that it cannot accept more work — explicitly slowing or rejecting incoming requests so the queue doesn't grow unboundedly.
- Circuit breaker
A circuit breaker is a pattern that monitors calls to a downstream service and 'trips' (stops calling) when failures exceed a threshold, returning a fallback or error immediately.
- Saturation
Saturation is the measure of how full the most-constrained resource of a system is — CPU, memory, IOPS, network bandwidth, queue depth, file descriptors.