Verify

Horizontal autoscaling

Horizontal autoscaling adds or removes service instances in response to load, typically CPU, memory, or a custom metric like queue depth. In Kubernetes this is the HorizontalPodAutoscaler resource; on AWS it's an Auto Scaling Group. The trade-off vs vertical autoscaling: horizontal scales further but requires the workload to be stateless or session-affine.

May 23, 2026

Horizontal autoscaling works when the bottleneck is parallelisable: more web servers handle more requests, more workers drain a queue faster. It fails when the bottleneck is centralised, adding web servers doesn't help if the database is the constraint. Healthy autoscaling configuration: scale up fast (within 1-2 minutes of saturation) to handle traffic spikes; scale down slowly (over 10-15 minutes) to avoid thrashing on noisy signals; pin a minimum floor that absorbs base traffic without cold-start latency; pin a maximum ceiling that protects downstream dependencies from sudden 10x load. The pessimistic scaling configurations win in production; the aggressive ones look efficient on paper and explode on incident day.