Horizontal autoscaling
Horizontal autoscaling adds or removes service instances in response to load — typically CPU, memory, or a custom metric like queue depth. In Kubernetes this is the HorizontalPodAutoscaler resource; on AWS it's an Auto Scaling Group. The trade-off vs vertical autoscaling: horizontal scales further but requires the workload to be stateless or session-affine.
Horizontal autoscaling works when the bottleneck is parallelisable: more web servers handle more requests, more workers drain a queue faster. It fails when the bottleneck is centralised — adding web servers doesn't help if the database is the constraint. Healthy autoscaling configuration: scale up fast (within 1-2 minutes of saturation) to handle traffic spikes; scale down slowly (over 10-15 minutes) to avoid thrashing on noisy signals; pin a minimum floor that absorbs base traffic without cold-start latency; pin a maximum ceiling that protects downstream dependencies from sudden 10x load. The pessimistic scaling configurations win in production; the aggressive ones look efficient on paper and explode on incident day.
Related terms
- Vertical autoscaling
Vertical autoscaling resizes an existing instance — adding CPU, memory, or storage — rather than adding more instances.
- Saturation
Saturation is the measure of how full the most-constrained resource of a system is — CPU, memory, IOPS, network bandwidth, queue depth, file descriptors.
- Rate limiting
Rate limiting caps the number of requests a client can make to a service within a defined window — typically expressed as 'N requests per second' or 'N requests per minute per API key'.