Skip to main content
Guides Skills and frameworks Rate Limiting for System Design Interviews: Token Bucket, Leaky Bucket, and Sliding Window
Skills and frameworks

Rate Limiting for System Design Interviews: Token Bucket, Leaky Bucket, and Sliding Window

9 min read · April 25, 2026

Rate limiting questions separate candidates who memorized a diagram from engineers who've actually run one in production. Here's how to pick an algorithm on purpose and survive the distributed-coordination follow-up.

Rate Limiting for System Design Interviews: Token Bucket, Leaky Bucket, and Sliding Window

Every senior and staff system design loop at Stripe, Cloudflare, Google, AWS, and the usual FAANG-tier shops asks you to design a rate limiter at some point. It is usually framed as a standalone problem ("design a rate limiter for a public API") or smuggled into a larger question ("we have a spam problem on this endpoint — what do you do?"). Either way, candidates who treat it as "we'll count requests in Redis" fall apart the moment the interviewer asks about clock skew, bursts, or multi-region consistency.

This guide is the version of the rate-limiting conversation I wish every staff-plus candidate walked into. The goal is not to memorize every algorithm — it is to pick one on purpose, name what it breaks on, and have an answer ready when the interviewer pushes on coordination.

What interviewers actually want to hear

The senior signal is distinguishing four decisions as four separate decisions: the algorithm, the storage backend, the coordination model, and the enforcement point.

Most candidates merge all four into "Redis with INCR and a TTL." That works for a toy. It does not work for Stripe's public API, and interviewers know it.

The algorithms you must be able to name and defend

  • Fixed window counter. Increment a counter keyed by (user_id, floor(now / window_size)). Simple, O(1), low memory. Classic failure mode: two bursts at the boundary can produce 2x the configured rate — if your limit is 100/minute, a client can send 100 at 10:00:59.999 and 100 at 10:01:00.001 and sail through.
  • Sliding window log. Store a sorted set of request timestamps per client and count entries newer than now - window. Precise, but memory is O(N) per client per window — Redis sorted sets at scale get painful.
  • Sliding window counter. Weighted blend of the current and previous fixed windows. Approximates a true sliding window with O(1) memory. This is what Cloudflare describes in their 2017 blog post and what most production systems actually use.
  • Token bucket. A bucket holds up to capacity tokens and refills at rate tokens per second. Every request consumes a token; if the bucket is empty, the request is rejected or queued. Allows bursts up to capacity. This is what AWS API Gateway, Envoy's local_ratelimit, and Stripe's public API use.
  • Leaky bucket. Requests enter a queue that drains at a constant rate. Smooths bursts into a steady output rate. Used in shaping, not usually in API rate limiting — if a user blasts 1000 requests, you do not want to hold them in memory for five minutes.

If you can name these five, explain the tradeoff in one sentence each, and pick one for the stated workload, you are already in the top quartile.

When to pick what

  • Bursty but bounded API traffic (most public APIs) — token bucket. Allows a human-friendly burst (double-clicking a button, retry storms) without tripping.
  • Hard steady-state throughput (video encoding, batch pipelines) — leaky bucket or strict sliding window.
  • Coarse, cheap, good-enough protection (internal endpoints) — fixed window. It's fine.
  • Precise billing or compliance-grade (PCI, regulated quotas) — sliding window log. Pay the memory cost.
  • Large fleet, eventual consistency acceptable — sliding window counter with local approximations synced to a central store.

The architecture you need to draw

        +------------+        +------------+
client -> edge (L7) --> service A --> service B
             |              |              |
             v              v              v
         [local LRU]    [local LRU]    [local LRU]
             \              |              /
              \             v             /
               +-----> Redis cluster -----+
                       (or Memcached,
                        DynamoDB, etc.)

Name three enforcement points and pick one:

  1. Edge / API gateway (Envoy, Kong, Cloudflare, AWS API Gateway, nginx limit_req). Cheapest to run, furthest from the user's real identity, great for crude per-IP DDoS absorption.
  2. Per-service middleware (interceptor, servlet filter, Express middleware). Knows user identity and endpoint cost. Default place for API quotas.
  3. Downstream resource limits (database connection pools, bulkheads in Hystrix/Resilience4j). Last line of defense.

A good answer names multiple layers. "Cloudflare handles volumetric IP-based limiting; our Envoy sidecars enforce per-tenant token buckets; Postgres connection pools are the final backstop."

Distributed coordination — where interviews get hard

If you run one process, a rate limiter is a hashmap. The entire difficulty of the problem is that you run N processes and they must agree, approximately, on how many requests a client has made.

Three real approaches:

Centralized counter in Redis. Every request hits Redis. Use INCR with EXPIRE for fixed window, or a Lua script for atomic token-bucket updates. Simple, accurate, but every request pays one round trip to Redis (~1ms if colocated, much worse cross-AZ), and Redis becomes a single point of failure. Stripe's early rate limiter was this. They wrote a blog post in 2018 about the Lua script they use.

Local counters with periodic sync. Each instance keeps a local count and flushes deltas to a central store every T ms. Cheap per request but accuracy degrades as T grows. Cloudflare, Envoy, and Google's quota system all do variations of this.

Gossip / CRDT-based. PN-Counters or G-Counters replicated across nodes. Accurate eventually, complex to operate. Rare in production rate limiters but shows up in some mesh implementations.

The tradeoff you must name: accuracy vs. latency vs. availability. You cannot have all three. If you need exactly-enforced limits, you pay latency to a central store. If you need ultra-low latency, you accept over-limits during sync lag. If you need availability when Redis dies, you pre-commit to fail-open or fail-closed and say so out loud.

A token bucket in Redis Lua

Atomic, single-round-trip, handles bursts correctly:

-- KEYS[1] = bucket key
-- ARGV[1] = capacity  ARGV[2] = refill_rate (per sec)
-- ARGV[3] = now_ms    ARGV[4] = cost
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or tonumber(ARGV[1])
local ts = tonumber(data[2]) or tonumber(ARGV[3])
local delta = (tonumber(ARGV[3]) - ts) / 1000.0 * tonumber(ARGV[2])
tokens = math.min(tonumber(ARGV[1]), tokens + delta)
local cost = tonumber(ARGV[4])
if tokens < cost then
    redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
    return 0
end
tokens = tokens - cost
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[3])
redis.call('PEXPIRE', KEYS[1], 60000)
return 1

Write this out in the interview. It takes two minutes and immediately signals that you have actually shipped this before.

When you should NOT build your own

Senior candidates push back when building is wrong. The bad cases:

  • You're on a cloud provider with a managed edge. AWS API Gateway, Cloudflare, Fastly, and GCP Cloud Armor all ship rate limiting. Use them unless you have a specific reason not to.
  • You need per-user quotas on a tiny service. A fixed-window counter in the process memory is fine. Don't stand up Redis for 50 RPS.
  • The real problem is abuse, not load. Rate limiting throttles; it does not identify bad actors. If your goal is stopping credential stuffing, you want a bot detection system, not a bucket algorithm.
  • You need per-tenant fairness under contention. This is actually a scheduler problem (weighted fair queueing, deficit round robin), not a rate limiter.

Real-world examples

  • Stripe's API returns 429 Too Many Requests with a Retry-After header and uses a token bucket per secret key. Their 2018 engineering blog post is worth reading before any Stripe interview.
  • GitHub's GraphQL API uses a point-based cost model — not every request is equal. A cheap query costs 1 point; a deep recursive query costs 100. This scores points when an interviewer asks about non-uniform cost.
  • Cloudflare implements sliding window counters at the edge with per-zone customization. Their sliding-log-approximation blog post from 2017 is the canonical reference.
  • Envoy has two filters: global_ratelimit (talks to an external gRPC service, usually Lyft's reference implementation) and local_ratelimit (in-process token bucket). The choice between them is exactly the centralized vs. local tradeoff.
  • AWS API Gateway exposes usage plans with burst and steady-state limits — classic token bucket semantics, visible directly in the console.

Common candidate mistakes

  • Not specifying the key. Per-IP? Per-user? Per-API-key? Per-(user, endpoint)? Say it. Different keys have different skew and fairness properties.
  • Ignoring the boundary problem on fixed windows. If the interviewer has done this loop more than ten times, they will ask. Be ready with "sliding window counter or token bucket."
  • Forgetting the response contract. Return 429 with Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset. RFC 6585 covers the status code; the headers are a de facto standard.
  • Failing to state the fail-open vs. fail-closed policy. If Redis is unreachable, do requests get through or not? There is no universally right answer; there is always a required answer.
  • Treating rate limiting as DDoS protection. It isn't. A well-funded attacker will exhaust your limiter's storage, not your backend. Layer rate limiting with connection-level limits, SYN cookies, and upstream scrubbing.
  • Leaking user identity across tenants. If you share a Redis instance across tenants, you need prefixed keys and eviction policies that do not let one tenant starve another's limits out of memory.

Advanced follow-ups interviewers will ask

  • "How do you handle a user distributed across 100 edge POPs?" Answer: local approximations with periodic central reconciliation, or central hard-limit stored in a geo-replicated store (DynamoDB global tables). State that you accept some over-limit during sync lag.
  • "How do you support different costs per request?" Answer: weighted token bucket — consume cost tokens per request. GitHub's GraphQL cost analysis is the canonical reference.
  • "What happens when Redis fails?" Answer: circuit break to fail-open (prefer availability, accept over-limits temporarily) or fail-closed (prefer correctness, reject all traffic) — explicit choice.
  • "How do you bootstrap a new tenant's limits?" Answer: lazy creation on first request with a sane default, or preprovisioned via the control plane on tenant creation. Don't block request flow on control plane writes.
  • "How do you deal with clock skew across hosts?" Answer: use a monotonic source (Redis TIME, a central clock service, or NTP with bounded drift alerts). Do not trust host wall clocks for token bucket refills.
  • "How do you prevent retry storms making rate limiting worse?" Answer: exponential backoff with jitter on the client, and Retry-After with a real number, not a fixed value. The AWS SDKs' default backoff is the reference implementation.
  • "How do you test it?" Answer: load test with a known burst pattern (e.g., vegeta, k6), verify the number of accepted requests in each window, inject Redis latency, inject clock skew, and verify the metric emits the correct rejection count.

The candidates who clear a staff-plus system design loop on rate limiting are not the ones who can draw the most algorithms. They are the ones who can look at a workload, name the enforcement point, pick one algorithm on purpose, and walk through what breaks when Redis hiccups or when traffic spreads across 50 edge POPs. Practice narrating the decision, not reciting the menu.

If you can articulate the algorithm-storage-coordination-enforcement split without hand-waving, and you can write the Lua script on the whiteboard, you have already outperformed the majority of candidates. Rate limiting looks simple and is relentlessly nuanced; interviewers love it for exactly that reason.