Load Balancing for System Design Interviews: L4 vs L7, Algorithms, and Failover
The load balancer slide is a staff-level smell test. Here is how to pick L4 vs L7, name the algorithm, handle health checks, and not get caught on sticky sessions.
Load Balancing for System Design Interviews: L4 vs L7, Algorithms, and Failover
Ask a mid-level candidate to draw a system and they put a box labeled "LB" in front of the web tier and move on. Ask a staff candidate and they say "an L7 reverse proxy with consistent hashing on session ID, with a health check that actually exercises the downstream DB." The gap is enormous and the interviewer is listening for exactly that gap.
This is the load balancer conversation for FAANG-tier system design loops. It assumes you know what a load balancer does. The signal is in the details: layer, algorithm, failover, health checks, and the surprisingly ugly edge cases like slow-start and hot shards.
L4 vs L7: the decision the interviewer wants you to justify
The first question the interviewer is waiting for you to answer is what layer you're balancing at. Get this wrong and everything downstream is nonsense.
L4 (transport layer) load balancing operates on TCP or UDP. It forwards packets based on the 5-tuple (source IP, source port, dest IP, dest port, protocol). It does not look at the payload. This is what AWS Network Load Balancer, Google Maglev, and HAProxy in TCP mode do. Pros: extremely low latency (microseconds), high throughput (millions of connections per second), protocol-agnostic, and cheap. Cons: no content-based routing, no TLS termination at the balancer (usually), no HTTP-aware features like path-based routing or header manipulation.
L7 (application layer) load balancing operates on HTTP, gRPC, or another application protocol. It reads the request, can route based on path, headers, cookies, or method, can terminate TLS, do HTTP/2 multiplexing, and apply WAF rules. AWS Application Load Balancer, Envoy, NGINX, HAProxy in HTTP mode, and Google Cloud HTTP(S) Load Balancer all operate here. Pros: rich routing, observability, protocol intelligence. Cons: higher latency (hundreds of microseconds to low milliseconds), more CPU per request, more complex failure modes.
The interview-worthy answer is usually "L4 at the edge for raw throughput, L7 inside the mesh for smart routing." This is how Netflix, Stripe, and Cloudflare structure it. The L4 tier absorbs connection floods; the L7 tier does the intelligent work.
Use L4 when you're balancing non-HTTP traffic (Postgres, MySQL, MQTT, custom binary protocols), when latency matters more than flexibility, or when you need to preserve the client IP trivially (NLB with target type ip preserves source IP without PROXY protocol hackery).
Use L7 when you need path-based routing (/api/ to one backend, /static/ to another), canary deploys (route 5% of traffic to the new version via header), A/B testing, or per-request authentication. If the interviewer mentions microservices, you need L7 internally — probably Envoy in a service mesh.
Algorithms and when each wins
Know at least these six and when to suggest each:
- Round robin. Distributes requests cyclically. Simple, stateless, terrible when request costs vary. Fine for cache-like workloads with uniform requests, bad for search or ML inference.
- Weighted round robin. Same, but backends get different shares based on capacity. Useful during canary deploys (weight the new version at 5%) or mixed instance types.
- Least connections. Routes to the backend with the fewest open connections. Good for long-lived connections (websockets, gRPC streams). Requires the LB to track connection state, so doesn't scale as cleanly as round robin.
- Least request / least outstanding. Like least connections but counts in-flight requests. Envoy's default for HTTP/2. Wins when request cost is variable.
- Random with two choices (P2C). Pick two backends at random, route to the less loaded of the two. Famous result from the power-of-two-choices literature: nearly matches least-connections with almost no coordination cost. The algorithm Envoy and most modern meshes use. Cite the paper if you want to show off (Mitzenmacher, 2001).
- Consistent hashing. Maps request keys to backends on a ring so that most keys stay on the same backend even as the pool changes. Critical for cache affinity (route by user ID so the user's session ends up on the same cache-warm server) and for sharded workloads. The 2016 Maglev paper is the reference; Ketama was the older cannon. Know what "virtual nodes" are and why naive consistent hashing has load imbalance issues without them.
Special mentions: latency-based routing (route to the backend with the lowest observed p95), resource-aware routing (the backend publishes CPU/memory and the LB weights accordingly; see Google's Maglev and Slicer papers), and geographic routing (AWS Route 53 latency-based or geolocation routing).
When an interviewer asks "which algorithm," the bad answer is "round robin." The staff answer is "P2C least-request for stateless HTTP, consistent hashing with virtual nodes for sharded backends, least-connections for long-lived streams."
Health checks and failover
This is the section that separates candidates who have operated systems from those who have only drawn diagrams.
Your health check must actually exercise the dependency chain, not just return 200 from a trivial endpoint. A web server that responds 200 while the DB connection pool is exhausted is useless. The canonical pattern:
- Liveness: "am I running?" Cheap. Checks process is alive.
- Readiness: "am I ready to take traffic?" Checks DB connectivity, cache reachability, downstream dependencies. Returns non-200 during warmup, config reload, or dependency failure.
- Deep health: "can I actually do my job?" Exercises a canary request end-to-end. Run less often to avoid load, but essential for catching silent failures.
The LB should pull from readiness, not liveness. Kubernetes splits these into livenessProbe (restart the pod) and readinessProbe (pull from Service endpoints). Know the distinction.
Failover specifics:
- Outlier detection. Envoy and Istio evict a backend after N consecutive failures. Set thresholds that avoid flapping.
- Slow start / warm-up. When a new backend joins, route a small fraction of traffic for the first 30-60 seconds so it can warm caches and JITs. NGINX has
slow_start, Envoy hasslow_start_config. - Panic mode. When too many backends are unhealthy, stop ejecting and spray traffic across everything rather than overload the survivors. Envoy has this built-in at 50% by default.
- Connection draining. On deploy, the LB stops sending new requests but lets in-flight requests complete. Default timeout is usually 30 seconds.
- Retries and hedging. The LB can retry on idempotent requests, or hedge by sending the same request to two backends and taking the first response (famous from Google's Jeff Dean "The Tail at Scale" paper).
For multi-region failover, name the tools: Route 53 health checks with latency-based routing, GCP Global LB with automatic regional failover, or a dedicated control plane like Cloudflare's Anycast network. Know that DNS-based failover has TTL-driven lag (clients cache DNS) and that true fast failover requires Anycast or BGP.
When you should NOT add a load balancer
Not every box diagram needs an LB. The cases where it's wrong:
- Intra-datacenter service mesh with sidecars already. Envoy running as a sidecar is already doing client-side load balancing. Putting an additional LB in front is a wasted hop.
- A single instance system that's nowhere near capacity. The LB adds latency, a failure mode, and operational cost. For a 100 req/s internal tool, skip it.
- UDP streaming with session affinity needs. Sometimes direct server return (DSR) or client-side discovery is better than an inline LB.
- When the "load balancer" is actually a proxy. If the only thing you need is TLS termination, that's a proxy concern, not a balancing concern. Don't conflate.
- gRPC with HTTP/2 multiplexing. A naive L4 LB will pin an entire gRPC client to one backend for the lifetime of the connection. You either need L7 balancing or client-side load balancing (gRPC's built-in
round_robinresolver) — using an L4 balancer is the wrong answer and interviewers love catching this.
Real-world example: Google Maglev
Maglev is the L4 software load balancer Google runs at the edge. It routes traffic for Google Search, Gmail, and YouTube. The 2016 paper is worth reading before any interview that might touch load balancing.
Key ideas: packet-level consistent hashing (via a lookup table rather than a ring), so connection state doesn't need to be shared across Maglev instances. Each Maglev independently picks the same backend for a given flow. This lets Google scale the LB tier horizontally without ECMP-induced connection breakage when an LB instance fails. The lookup table is designed so that removing a Maglev only disrupts 1/N of connections instead of rehashing everyone.
Other canonical examples:
- AWS ALB. L7, path and header routing, integrates with WAF and Cognito.
- AWS NLB. L4, millions of connections per second, preserves source IP, TLS termination option.
- Envoy. The de facto service mesh data plane. Powers Istio, Consul Connect, and AWS App Mesh.
- HAProxy. The old warhorse. Still excellent at L4 and L7. Cloudflare and GitHub use it heavily.
- Cloudflare's Unimog. L4 consistent-hash LB similar to Maglev. Read the blog post if you want a modern treatment of connection persistence under scaling events.
- Facebook's Katran. L4 LB built on XDP/eBPF, pushing packet forwarding into the kernel for massive throughput.
Common candidate mistakes
The patterns that drop your rating:
- Putting an LB in the diagram without naming the layer. "LB" with no qualification is a red flag.
- Choosing sticky sessions without justifying it. Session stickiness is a last resort for legacy apps. Real systems push session state to Redis or a JWT and keep the app stateless. If you say "sticky sessions," the interviewer will push.
- Ignoring TLS termination and re-encryption. Where does TLS terminate? At the edge LB? At each backend? Answer explicitly, and know about mTLS inside the mesh.
- Forgetting health checks or making them trivial.
/pingreturning 200 is not a health check. - Missing connection draining on deploy. Users get 502s during rolling deploys if you don't mention this.
- Not knowing that gRPC over L4 is broken. A common gotcha. Speak to it preemptively.
- Assuming one global LB solves everything. Cross-region traffic has 100ms+ RTT. Keep traffic regional when possible.
- No capacity math. "How many requests per second can this LB handle?" The answer for ALB is around 100K RPS per ALB, NLB scales to millions. Envoy on modern hardware does 50K-100K RPS per core. Know orders of magnitude.
Advanced follow-ups interviewers will ask
- "How do you handle a thundering herd after a backend restarts?" Slow start, warm-up traffic weighting, outlier detection thresholds tuned to avoid flapping.
- "What happens when the LB itself fails?" Anycast IPs, active-active LB pairs, ECMP at the router level, or DNS-based fallback.
- "How do you do canary deploys at the LB layer?" Header-based routing, weighted backends, or a dedicated canary subset. Mention feature flags as a complement.
- "How do you handle long-lived connections like WebSockets or SSE?" Least-connections algorithm, longer idle timeouts, and be aware of connection draining complexity.
- "How do you prevent a single bad client from overwhelming the LB?" Per-IP rate limits, connection limits, token bucket at the edge, and WAF rules.
- "How does DNS fit in?" Client first resolves DNS, which may return an Anycast IP or regional IPs. TTL matters for failover speed. Short TTL equals fast failover but higher DNS QPS.
The candidate who wins the load balancer section is the one who treats it as five decisions stacked on top of each other: layer, algorithm, health check, failover, and capacity. Name each one explicitly, cite a real system when useful, and acknowledge the failure modes. Do that and you'll be in the top quartile of every system design loop you run.
Related guides
- Caching Strategies for System Design Interviews: Write-Through, Write-Back, and TTL Patterns — The caching section of a FAANG system design loop is where mediocre candidates blur together. Here's how to name tradeoffs, pick a pattern on purpose, and survive the hot-key follow-up.
- Rate Limiting for System Design Interviews: Token Bucket, Leaky Bucket, and Sliding Window — Rate limiting questions separate candidates who memorized a diagram from engineers who've actually run one in production. Here's how to pick an algorithm on purpose and survive the distributed-coordination follow-up.
- Backend System Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A backend System Design interview cheatsheet for 2026 with the core flow, architecture patterns, capacity heuristics, reliability tradeoffs, and traps that separate senior answers from vague box drawing.
- CQRS Interview Guide: When to Split Commands and Queries in a System Design — CQRS is the pattern candidates propose to sound sophisticated and then can't justify. Here is when to actually split reads and writes, what it buys you, and the price you pay for it.
- Design System Interview Guide — Tokens, Components, and Governance Questions — A practical design system interview guide covering design tokens, component APIs, accessibility, governance, adoption, contribution models, and how to answer system-maturity questions with credibility.
