Skills and frameworks

Designing a Rate Limiter System Design — Distributed Counters and Consistency

10 min read · April 25, 2026

A system design interview guide for rate limiters: choose token buckets or sliding windows, design distributed counters, handle multi-region consistency, and plan failure behavior.

Designing a Rate Limiter System Design — Distributed Counters and Consistency

Designing a rate limiter system design answer is really about distributed counters, consistency, latency, abuse prevention, and graceful failure. Interviewers are not looking for a memorized Redis snippet. They want to know whether you can protect an API or product surface from overload while keeping legitimate users moving.

A strong answer starts with requirements, chooses the right limiting algorithm, defines keys and quotas, handles distributed enforcement, and explains what happens when the limiter is slow, inconsistent, or unavailable. The details matter because rate limiters sit directly in the request path.

What a rate limiter is solving

A rate limiter controls how many actions a user, client, IP, token, organization, or service can perform in a time window. It is used to prevent:

API abuse and scraping
Brute-force login attempts
Traffic spikes that overload downstream systems
Noisy-neighbor behavior in multi-tenant products
Accidental client loops or retries
Cost explosions for expensive operations like AI inference or email sends
Fairness problems where one customer consumes shared capacity

In a system design interview, clarify the use case first. A public API rate limiter has different constraints than an internal microservice limiter or login-abuse limiter. A strict payment API limit may prefer correctness. A social feed read limiter may prefer availability and approximate enforcement.

Requirements to clarify before designing

Ask these questions early:

| Question | Why it matters | |---|---| | What action is limited? | Read requests, writes, login attempts, messages, uploads, expensive jobs? | | Who is limited? | User, API key, IP, device, tenant, route, or combination? | | What are the quotas? | 100 requests/minute, burst of 20, daily cap, tier-based limits? | | Is enforcement global? | Multi-region systems need shared or replicated counters. | | How strict must it be? | Financial or security actions need tighter consistency than page views. | | What latency budget exists? | A limiter in the hot path may need sub-millisecond local checks. | | What happens on failure? | Fail open, fail closed, local fallback, or degraded quota? | | Do we need auditability? | Enterprise APIs may need logs and customer-visible headers. |

A crisp requirement set might be: "Limit API write requests by API key and route, support 1,000 requests per minute with bursts of 100, enforce globally across three regions, add less than 5 ms p99 latency, and fail open for low-risk reads but fail closed for login attempts." That is the kind of framing interviewers reward.

Rate limiting algorithms

There are four common algorithms. Know the trade-offs.

| Algorithm | How it works | Pros | Cons | |---|---|---|---| | Fixed window counter | Count requests in calendar windows like 12:00:00-12:00:59 | Simple and cheap | Allows boundary bursts | | Sliding window log | Store timestamps for each request and count recent events | Precise | Expensive at high volume | | Sliding window counter | Blend current and previous window counts | Good approximation | More complex, still approximate | | Token bucket | Tokens refill at a steady rate; requests spend tokens | Handles bursts gracefully | Needs atomic token updates | | Leaky bucket | Queue drains at steady rate | Smooths traffic | Adds queueing delay or drops |

For most API interviews, token bucket is the best default because it supports a sustained rate plus a controlled burst. For example, a customer gets 100 requests per minute with a burst capacity of 20. Tokens refill over time. If tokens are available, allow the request and decrement. If not, reject with 429 and a retry-after hint.

Fixed windows are easy but can be unfair. A client can send 100 requests at 12:00:59 and another 100 at 12:01:00, effectively doubling the intended burst. Sliding windows reduce that but require more state. Sliding logs are precise but often too memory-heavy for high-volume systems.

The high-level architecture

A basic distributed rate limiter has these components:

Client or edge receives request. API gateway, load balancer, service mesh, or application middleware.
Rate limit key is computed. Example: api_key:route, user_id:action, or tenant_id:model.
Limiter checks state. Local memory, Redis, dedicated rate-limit service, or distributed store.
Decision is returned. Allow, reject, delay, or shadow-only observe.
Headers and logs are emitted. Remaining quota, reset time, retry-after, and audit events.
Metrics monitor health. Allowed, denied, limiter latency, store errors, hot keys, and false positives.

Placement matters. At the edge, the limiter protects more of the stack and reduces wasted work. In the application, it has richer identity and business context. Many real systems use both: coarse IP or route limits at the edge, precise user or tenant limits in the app.

Distributed counters: Redis, atomicity, and hot keys

The common implementation uses Redis or a Redis-like in-memory store because rate limiting needs fast atomic increments and expirations.

For a fixed window counter:

Key: rl:{api_key}:{route}:{window_start}
Operation: atomic increment
Expire: slightly longer than the window
Decision: allow if count <= limit

For token bucket:

Key stores token count and last refill timestamp.
On each request, calculate tokens added since last check.
Cap at bucket capacity.
If tokens >= cost, decrement and allow.
Otherwise reject and compute retry time.
Use a Lua script or equivalent server-side transaction so read-modify-write is atomic.

Atomicity matters because multiple app servers may check the same key concurrently. Without atomic updates, two requests can both see tokens available and overspend. In interviews, explicitly mention atomic increment or server-side script. That is a key senior signal.

Hot keys are another issue. A very large customer, popular IP block, or global route can create a single counter receiving massive writes. Mitigations include sharded counters, hierarchical limits, local pre-allocated tokens, or tenant-specific capacity planning.

Consistency trade-offs in a multi-region limiter

Consistency is the hard part. If traffic is served from multiple regions, where does the counter live?

Option 1: Centralized counter store. All regions call one primary Redis cluster or rate-limit service.

Pros: strongest enforcement, simple mental model.
Cons: added latency, regional dependency, possible bottleneck.

Option 2: Regional counters with quota partitioning. Each region gets a slice of the global quota.

Pros: low latency and resilient local enforcement.
Cons: unused quota in one region may not be available elsewhere; global limit is approximate.

Option 3: Local fast path with asynchronous reconciliation. Each region enforces local limits and syncs counters periodically.

Pros: high availability and low latency.
Cons: temporary over-limit behavior is possible.

Option 4: Token leasing. A central service leases chunks of tokens to regions or edge nodes. Local nodes spend leased tokens quickly; when empty, they request more.

Pros: balances latency with bounded overrun.
Cons: more complex implementation and lease recovery.

The right answer depends on strictness. For login attempts, payment actions, or fraud-sensitive operations, prefer stronger coordination or fail-closed behavior. For read APIs, search, feed views, or analytics ingestion, approximate limits are often acceptable if they protect capacity.

A strong interview phrase: "I would choose bounded inconsistency rather than global synchronous coordination unless the action is security- or money-sensitive." That shows product judgment.

Designing the API and response behavior

Rate limiting is user-facing when it rejects traffic. Good systems return clear signals:

HTTP 429 Too Many Requests
Retry-After header when a retry time is known
X-RateLimit-Limit
X-RateLimit-Remaining
X-RateLimit-Reset
Error body with route, limit type, and support guidance for enterprise customers

Do not leak sensitive anti-abuse rules. Login and fraud limits may intentionally return generic errors to avoid helping attackers tune behavior. Public API limits can be transparent because developer experience matters.

Also decide whether requests have different costs. A cheap GET may cost 1 token. An expensive export, AI generation, or bulk write may cost 10 or 100 tokens. Weighted tokens are a clean way to align rate limits with infrastructure cost.

Failure modes: fail open, fail closed, or fallback

A limiter can become a dependency that takes down the product if designed poorly. Decide failure behavior by action type.

| Action | Suggested failure behavior | |---|---| | Public read API | Fail open with local emergency limits | | Expensive AI generation | Fail closed or reduced quota to protect cost | | Login attempts | Fail closed or strict local fallback | | Payment creation | Fail closed for suspicious bursts, allow known trusted flows cautiously | | Internal service calls | Local circuit breaker plus backpressure |

A practical design includes local fallback counters. If Redis is unavailable, each app server can enforce a conservative in-memory limit for a short time. This is not globally precise, but it prevents unlimited abuse while keeping the product alive.

Also add circuit breakers. If the limiter store latency spikes, the application should not pile up threads waiting for decisions. Return a safe fallback quickly.

Observability and operations

A rate limiter needs excellent metrics because false positives hurt customers and false negatives hurt systems.

Track:

Allowed and denied requests by route, tenant, region, and limit rule
Limiter decision latency p50/p95/p99
Store errors and timeouts
Hot keys and top limited tenants
Retry-after distribution
Quota utilization by plan tier
Over-limit support tickets or customer complaints
Downstream system saturation

Add a shadow mode for new rules. In shadow mode, the limiter logs what it would have denied but does not block. This is especially useful for enterprise customers where an incorrect limit can break integrations.

Include admin tooling. Support and customer success teams need to see why a request was limited, which rule fired, and whether a temporary quota increase is safe.

Worked design: public API limiter

Prompt: "Design a rate limiter for a public API used by developers."

Requirements: limit by API key and route, support free and paid tiers, allow bursts, global traffic across two regions, return 429 with retry hints, and add less than 5 ms p99 for the limiter check.

Design:

Edge gateway performs coarse IP limits to block obvious abuse.
Application gateway authenticates API key and computes key api_key:route.
Token bucket enforces sustained rate and burst capacity.
Redis cluster in each region handles atomic token bucket updates using server-side scripts.
Paid tiers have larger bucket capacity and refill rates loaded from a config service.
Quota is partitioned by region based on recent traffic, with periodic adjustment.
For enterprise customers, token leasing from a central control plane can reduce overrun while keeping regional latency low.
Responses include 429, retry-after, and quota headers.
Observability tracks denied requests, hot keys, store latency, and downstream saturation.

Consistency trade-off: because this is a public API, small temporary overages are acceptable. Availability and low latency matter more than perfect global enforcement. For expensive endpoints like exports, use stricter limits or central coordination.

Common traps in rate limiter interviews

The first trap is using a single global fixed window counter without discussing bursts. Boundary bursts are real and interviewers expect you to notice them.

Second, candidates forget atomicity. Incrementing and checking in separate operations is race-prone under concurrency. Use atomic increments, transactions, or server-side scripts.

Third, candidates design a limiter that depends on a remote store for every request with no fallback. If that store has a bad day, your API has a bad day. Mention local fallback and circuit breakers.

Fourth, candidates ignore identity. IP limits are easy but weak: NATs, mobile carriers, shared offices, VPNs, and attackers all distort IP meaning. Use API keys, user IDs, tenant IDs, device IDs, and route-specific keys when available.

Fifth, candidates treat all endpoints equally. A password attempt, AI image generation, search query, and profile view do not have the same risk or cost. Weighted limits and route-specific policies are more realistic.

Prep checklist for rate limiter system design

Before your interview, be ready to explain:

Fixed window, sliding window, token bucket, and leaky bucket trade-offs.
Why token bucket is a strong default for bursty APIs.
How to compute rate limit keys.
How atomic counters work in Redis or a rate-limit service.
How to handle multi-region consistency.
When to fail open versus fail closed.
What headers and errors clients should receive.
How to monitor false positives, hot keys, latency, and denied traffic.
How plan tiers, weighted tokens, and admin overrides work.

A strong closing summary: "I would use token buckets keyed by API key and route, enforce coarse limits at the edge and precise limits in the gateway, store counters with atomic updates in a low-latency distributed store, accept bounded regional inconsistency for low-risk APIs, and use stricter coordination for expensive or security-sensitive actions. I would add fallback behavior so the limiter protects the system without becoming the outage." That is the system design answer interviewers are looking for.

Designing a Chat System Design Interview — WebSockets, Presence, and Message Storage — A system design interview guide for chat applications, covering WebSockets, fanout, message ordering, presence, storage, delivery receipts, media, search, scaling, and common trade-offs.
Designing a News Feed System Design — Fanout-on-Write vs Fanout-on-Read — A system design guide for news feeds that explains fanout-on-write, fanout-on-read, hybrid timelines, ranking, caching, and interview tradeoffs. Use it to structure a senior-level feed design answer without getting lost in buzzwords.
Designing a Payment System Design Interview — Idempotency, Ledgers, and Reconciliation — A senior payment-system design answer lives or dies on idempotency, double-entry ledgers, and reconciliation. This guide gives the architecture, state model, failure-mode answers, and interview script.
Designing a Search System Design Interview — Inverted Index, Ranking, and Recall — A practical system design guide for search interviews, covering inverted indexes, crawling and ingestion, query execution, ranking, recall, freshness, personalization, scaling, and evaluation trade-offs.
Designing a URL Shortener System Design Interview: Capacity, Encoding, and Analytics — URL shortener is the most-asked warm-up system design question and the easiest to under-deliver on. Here's how to walk the full loop — capacity math, base62 encoding, caching, and analytics — without hand-waving.

Designing a Rate Limiter System Design — Distributed Counters and Consistency

What a rate limiter is solving

Requirements to clarify before designing

Rate limiting algorithms

The high-level architecture

Distributed counters: Redis, atomicity, and hot keys

Consistency trade-offs in a multi-region limiter

Designing the API and response behavior

Failure modes: fail open, fail closed, or fallback

Observability and operations

Worked design: public API limiter

Common traps in rate limiter interviews

Prep checklist for rate limiter system design

Related guides

More in Skills and frameworks

A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM

API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps

API Design Interview Guide — REST vs GraphQL vs gRPC, Versioning, and Pagination