CAP Theorem Interview Deep-Dive: What to Actually Say When Asked CP vs AP
CAP is the most misused three letters in system design interviews. Here's the precise answer, the PACELC correction most candidates miss, and what to say when asked to pick CP or AP.
CAP Theorem Interview Deep-Dive: What to Actually Say When Asked CP vs AP
The CAP theorem question is the most-asked and most-botched distributed systems question in the interview pipeline. Every candidate knows the three letters. Very few can answer the staff-level version of the question, which is approximately: "is your system CP or AP during a network partition, and what does that actually do to your users?"
This guide is the version of the CAP conversation I wish every candidate walked into. It is not a history lesson on Eric Brewer's 2000 keynote or Gilbert and Lynch's 2002 proof. It is what to actually say when your staff-plus interviewer asks the follow-up.
The theorem, stated precisely
Brewer's conjecture, formalized by Gilbert and Lynch (ACM SIGACT News, 2002):
In the presence of a network partition, a distributed system can provide either consistency or availability, but not both.
Three things to notice that most candidates miss:
- "Consistency" here specifically means linearizability. Not serializability, not eventual consistency, not "data is right." Linearizability: every read returns the value of the most recent write, as if there were a single copy.
- "Availability" means every non-failing node returns a non-error response. A timeout or a 503 is not available. Returning stale data is available.
- The tradeoff only applies during a partition. In the normal case (no partition), you can have both. CAP does not say "pick two forever." It says "when the network goes down, pick one."
Candidates who can state this in one breath are already in the top third.
The PACELC extension — what staff candidates actually cite
Daniel Abadi's PACELC (2012) is the correction CAP needed:
If there is a Partition, choose between Availability and Consistency (PAC). Else, in normal operation, choose between Latency and Consistency (LC).
This is the right model. It says the quiet part out loud: even with no partition, linearizability costs latency, and you're making that tradeoff every day, not just during failures.
Systems classified under PACELC:
- Spanner, CockroachDB, etcd, ZooKeeper: PC/EC. Consistent during partitions (refuse writes), consistent in normal operation (pay latency).
- DynamoDB, Cassandra, Riak: PA/EL. Available during partitions (return stale), low-latency in normal operation (eventually consistent).
- MongoDB (default): PA/EC. Available during partitions with primary, consistent reads by default.
- Aurora: PC/EL. Consistent during partition (single writer), low-latency reads from replicas.
If you mention PACELC in the interview and can classify two real systems by it, you have already separated yourself from 80% of candidates.
What interviewers actually want to hear
The bad answer: "Pick two of three. We'll pick CP because we need consistency."
The good answer: "Our write path needs linearizability on the critical read-my-write for inventory decrement, so that path is CP — during a partition of the cell that holds that shard, we return errors. The read path for browsing the product catalog is AP — we serve stale reads from any replica. The metadata control plane that routes requests is CP on etcd. We pay the latency cost of consensus for control plane operations and the latency cost of a single-shard write for inventory, and we get cheap eventual reads for the 99% of traffic that's catalog browsing."
That paragraph contains four things interviewers score: per-path CAP choice, naming the operation that drives the choice, naming the real store, and naming what users experience during a partition.
The tradeoffs to name out loud
- CP means users see errors during partitions. Your Grafana dashboard will show a spike in 5xxs or 503s. Your on-call will page. The system is doing the right thing.
- AP means users see stale data during partitions. They see a product that says "in stock" when it's not, or a post that disappears and reappears. The system is doing the right thing.
- Latency always costs something. A linearizable read in a 3-node Raft cluster costs at least one RTT to the leader. Cross-region linearizability costs >100ms. Factor this into your latency budget before you claim CP.
- There is no "CA" system. A system is only CA if you assume partitions never happen. In the real world they always happen. Candidates who claim MySQL single-master is CA miss the point — single-master is CP, because during a partition from the primary, replicas refuse writes.
- Partitions aren't binary. There are slow partitions, asymmetric partitions (A can reach B but B can't reach A), and flapping partitions. Jepsen has written extensively about these. Your "CP" system may actually split-brain if the partition detector is wrong.
Real-world partition scenarios
Concrete scenarios that make CAP tangible:
- AWS us-east-1 inter-AZ partition (September 2015). DynamoDB metadata service had cascading failures because the metadata layer could not reach enough quorum members. Classic CP behavior: the service refused writes rather than accept inconsistency.
- GitHub's 2018 MySQL partition. A cross-coast partition caused a failover that violated strict primary invariants. Post-mortem named it explicitly — the tradeoff was chosen for consistency, and they accepted the write unavailability.
- Cloudflare's 2020 backbone issue. A routing loop partitioned data centers; their eventually-consistent edge configuration continued serving stale rules, which was the right call — serving stale WAF rules beats serving no rules at all.
- Kafka's
unclean.leader.election.enable. Flips Kafka between CP and AP on a per-topic basis.false(CP) is the default since 1.0 — a topic can become unavailable if ISR shrinks to zero.true(AP) will elect a stale replica and lose writes to keep publishing.
Naming a real incident when asked "have you dealt with a partition" earns points. Even if you weren't on-call for it, having read the post-mortem is signal.
When you should pick CP
The clear cases for consistency-over-availability:
- Money movement. Debit/credit ledgers, payment processing, double-entry bookkeeping. The business cost of an inconsistency (lost money, duplicate charge) vastly exceeds the cost of downtime.
- Inventory for physical goods. Overselling is a real cost; showing "out of stock" is free.
- Distributed locks and leader election. If two nodes both think they're the leader, everything downstream breaks.
- Service discovery and configuration. etcd, Consul, ZooKeeper. You want your routing table consistent or not available — never stale.
- Regulatory and audit logs. Must be correct or absent.
When you should pick AP
The cases where availability beats consistency:
- Social feeds, timelines, notifications. Users prefer stale content over a blank page.
- Product catalogs, search results. A few seconds of staleness is invisible.
- Analytics, metrics, dashboards. Staleness is already baked into the UX.
- Content delivery (DNS, CDNs). Stale cache beats no cache. DNS is a famously AP system.
- Shopping carts. Add-to-cart staleness is acceptable; decrement-inventory is not.
- Real-time collaboration (Figma, Google Docs). Local-first with CRDT merges — designed to be AP, with conflict resolution.
Common candidate mistakes
- Saying "CAP: pick two." Technically wrong. You pick between C and A during a partition. Most of the time you have both.
- Calling MySQL "CA." There is no CA. A single-master system under partition either refuses writes (CP) or accepts stale reads from lagging replicas (AP). Pick one.
- Claiming linearizability and AP for the same path. They are mutually exclusive during a partition. If you want both, you want a different path.
- Not naming the operation. CAP is per-operation. "Our system is CP" is wrong. "Our write path is CP; our read path is AP" is correct.
- Ignoring PACELC. If you're asked about CAP and don't bring up the latency tradeoff in normal operation, you're missing the more practical half of the question.
- Claiming CAP doesn't apply because you're in one data center. Partitions happen inside data centers too — top-of-rack switches fail, VPC routing glitches, kernel bugs drop packets. Don't hand-wave.
- Reciting the theorem without applying it. Interviewers don't want the textbook. They want the decision for this system.
Interview scripts that land
Here are three scripts you can lift and adapt.
For a "design a payment system" question: "The ledger is CP — during a partition, we refuse writes rather than accept double-spend risk. We use Raft-backed etcd for idempotency keys and a single-shard Postgres or DynamoDB transaction for the balance update. PACELC-wise, we're PC/EC: we pay the latency cost of consensus to get linearizability. The read side for transaction history is AP from read replicas — a few seconds of staleness on a user's transaction list is fine, and we get cheap reads."
For a "design Twitter" question: "Timelines are AP. We serve the precomputed Redis timeline from the nearest cell; during a partition, the user may see a slightly stale feed. Follow counts are eventually consistent from a PA/EL store like Cassandra. The user's own tweet creation path needs read-your-writes — that's a session guarantee, orthogonal to CAP, handled by sticky sessions to the write cell. User authentication is CP on a single-region identity service."
For a "design a distributed lock" question: "Locks must be CP — if two holders both think they own the lock, the invariant they're protecting breaks. I'd use Redlock with caveats (Martin Kleppmann has a famous critique), or preferably a real consensus-backed lock service: etcd's lease API, ZooKeeper ephemeral nodes, or Chubby. During a partition, lock acquisition fails, and clients retry with backoff."
Advanced follow-ups
- "How does your system detect a partition?" Answer: health checks with a quorum-based vote, not pairwise pings. ZooKeeper-style session timeouts. Note that false positives happen (slow GC pause looks like a partition).
- "What if a partition heals and there are divergent writes?" Answer: for CP systems, there are no divergent writes (the minority partition refused them). For AP systems, you need conflict resolution: LWW, vector clocks, or CRDTs.
- "How do you test CAP behavior?" Answer: fault injection with
tc/iptables, Jepsen, Toxiproxy. Chaos engineering is the right answer. - "What's the difference between FLP impossibility and CAP?" Answer: FLP (1985) says in an asynchronous network with one crash failure, consensus is impossible in bounded time. CAP is about tradeoffs during partitions. Related but distinct.
- "Why is etcd CP and DynamoDB default AP?" Answer: different purposes. etcd stores small metadata that must never diverge; DynamoDB stores large user data where stale reads are tolerable and cheap. Each made the right choice for its workload.
The candidates who land the CAP question are the ones who refuse to answer it as "CP or AP" at the system level and instead answer it per-operation, naming the real stores, naming the user-visible impact, and citing PACELC. Practice saying the word "linearizability" out loud. Practice saying "during a partition, this path will return 503s and the on-call will see it on the dashboard." That is what a staff engineer sounds like.
CAP is the most famous and most over-referenced theorem in our field. Candidates who treat it as a slogan lose. Candidates who treat it as a per-operation decision, with latency costs even in normal operation, consistently out-score the room.
Related guides
- React Hooks Interview Deep-Dive — useEffect, useMemo, useCallback Gotchas — A practical React hooks interview deep-dive covering useEffect dependencies, stale closures, useMemo and useCallback tradeoffs, Strict Mode behavior, and how to explain hooks under pressure.
- Deep Learning Interview Questions in 2026 — Backprop, Optimizers, and Regularization — A 2026-ready deep learning interview guide covering backpropagation, optimizers, regularization, debugging, transformers, evaluation, and sample answers that show practical judgment.
- Figma Portfolio Review for the Design Interview — What Reviewers Actually Scan For — A practical guide to preparing a Figma portfolio review for the design interview, including what reviewers scan first, how to structure case studies, and how to present tradeoffs clearly.
- A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM — A tactical guide to A/B testing interview questions in 2026, with answer frameworks for power analysis, peeking, sample-ratio mismatch, guardrails, metrics, and experiment trade-offs. Built for product analysts, data scientists, PMs, and growth roles.
- API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A practical API design interview cheatsheet for 2026: how to scope the problem, choose REST/GraphQL/gRPC patterns, model resources, handle auth, versioning, rate limits, and avoid the traps that cost senior candidates offers.
