17: Scaling & Resilience (HPA, Probes, PDB, Retries, Graceful Shutdown)¶
This section teaches you how to make your microservices resilient, scalable, and stable under load, traffic spikes, network failures, and rolling updates.
We cover:
• Horizontal Pod Autoscaling (HPA)
• Resource Requests & Limits
• Liveness & Readiness Probes
• PodDisruptionBudget (PDB)
• Retry & backoff strategies
• Graceful shutdown & connection draining
• How Envoy balances traffic under scaling events
• Anti-patterns to avoid
This is essential for taking LocalCloudLab from “works” to “production-grade.”
17.1 Why Scaling & Resilience Matter¶
A microservice is resilient when it can:
✓ survive high traffic
✓ recover from failures automatically
✓ scale up when needed
✓ scale down to save costs
✓ restart safely without losing work
✓ avoid cascading failures
✓ stay responsive under partial outages
A microservice is scalable when:
✓ horizontal scaling is predictable
✓ CPU/memory patterns are understood
✓ workloads are measurable and observable
Kubernetes provides these guarantees only if services are configured correctly.
17.2 Resource Requests & Limits¶
Every Deployment must define:
• requests (minimum resources guaranteed)
• limits (maximum allowed resources)
Example for Search API:
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "600m"
memory: "512Mi"
Why we need requests?¶
Kubernetes uses requests to schedule pods onto nodes. If you don’t define them, pods may be:
✗ scheduled poorly
✗ evicted due to low memory
✗ competing for CPU
Why we need limits?¶
Limits prevent:
✗ runaway CPU spikes
✗ memory leaks killing the node
✗ noisy neighbors
For LocalCloudLab, recommended defaults:
• Search API: CPU 200m–600m, RAM 256–512Mi
• Checkin API: CPU 150m–500m, RAM 256–512Mi
17.3 Horizontal Pod Autoscaling (HPA)¶
HPA adjusts the number of pods based on metrics.
Enable metrics server (k3s includes it by default):
kubectl get deployment metrics-server -n kube-system
Example: HPA for Search API¶
Create:
k8s/search-api/hpa.yaml
Contents:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: search-api-hpa
namespace: search
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: search-api-deployment
minReplicas: 2
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Meaning:
• Start with 2 pods
• Scale up to 6
• Trigger scaling when average CPU > 70%
Search traffic is often spiky → HPA helps smooth load.
17.4 Readiness & Liveness Probes¶
Probes tell Kubernetes when:
• the service is ready to accept traffic (readiness)
• the service is healthy (liveness)
Readiness Probe¶
Search API should NOT accept traffic until:
• Database connection is initialized
• Redis is reachable
• Startup completed
Example:
readinessProbe:
httpGet:
path: /health/ready
port: 80
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
Liveness Probe¶
Detects crashes or deadlocks:
livenessProbe:
httpGet:
path: /health/live
port: 80
periodSeconds: 10
failureThreshold: 5
Why both matter?¶
If readiness fails → pod removed from load balancer If liveness fails → pod restarted
17.5 Startup Probe (Recommended)¶
Protects against slow startup causing false failures.
Example:
startupProbe:
httpGet:
path: /health/startup
port: 80
failureThreshold: 30
periodSeconds: 5
This gives up to 30 × 5 = 150 seconds before Kubernetes kills the pod during slow boot.
17.6 PodDisruptionBudget (PDB)¶
Prevents Kubernetes or cluster events from killing too many pods at once.
Example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: search-api-pdb
namespace: search
spec:
minAvailable: 1
selector:
matchLabels:
app: search-api
Meaning:
• At least 1 pod must always be running
• Protects against node drains, updates, auto-repairs
17.7 Retry & Backoff Strategy (Application & Envoy)¶
Application-Level Retry (C#)¶
Use Polly or built-in HttpClient retry handler:
Example:
services.AddHttpClient("RemoteSearch")
.AddTransientHttpErrorPolicy(p => p.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt => TimeSpan.FromMilliseconds(200 * attempt)
));
Envoy BackendPolicy Retry¶
retry:
attempts: 3
retryOn:
- gateway-errors
- connection-failure
perTryTimeout: 2s
Use retries carefully:
✔ transient failures
✗ persistent failures (causing overload)
17.8 Graceful Shutdown in .NET¶
A service must:
• finish in-flight requests
• reject new requests
• close RabbitMQ connections
• flush logs
Program.cs:
builder.WebHost.ConfigureKestrel(opt =>
{
opt.AddServerHeader = false;
opt.Limits.KeepAliveTimeout = TimeSpan.FromSeconds(30);
});
app.Lifetime.ApplicationStopping.Register(() =>
{
// cleanup logic
rabbitMqChannel?.Close();
rabbitMqConnection?.Close();
});
.NET automatically waits:
• 5 seconds grace period (configurable via Kubernetes)
17.9 Envoy + HPA Interaction¶
When traffic spikes:
1. Envoy receives more incoming requests
2. CPU rises on pods
3. HPA detects > 70% CPU
4. HPA creates new pods
5. Readiness probe ensures new pods are warmed up
6. Envoy distributes traffic to new pods
This delivers smooth, automated scaling.
17.10 Anti-Patterns to Avoid¶
❌ No resource limits → Risk: Node OOM kills entire cluster
❌ No readiness probes → Risk: Unhealthy pods receive traffic → timeouts
❌ Retry everything indefinitely → Risk: Cascading failures & overload
❌ Log every request → Risk: Loki storage explosion
❌ HPA with maxReplicas=1 → Zero scalability
17.11 Summary of Section 17¶
You now have:
✔ Resource requests & limits for stability
✔ Horizontal autoscaling with metrics
✔ Full probe strategy (startup, readiness, liveness)
✔ Pod disruption protection (PDB)
✔ Retry & backoff strategies
✔ Graceful shutdown and connection draining
✔ A resilient, production-ready microservice cluster
Next section will begin automatically:
Section 18 — Storage, Backups & Disaster Recovery (PostgreSQL, Redis, RabbitMQ)