Leader Election and Consensus Interview Guide: Raft, Paxos, and ZooKeeper
Consensus questions are the hardest part of staff-level system design loops. Here's how to explain Raft on a whiteboard, name when Paxos actually matters, and answer the split-brain follow-up.
Leader Election and Consensus Interview Guide: Raft, Paxos, and ZooKeeper
Every staff-plus system design loop includes at least one question where the right answer is "we use consensus." Designing a distributed lock, building a replicated state machine, running a service registry, or making a database HA all eventually route through the same three algorithms: Paxos, Raft, and Zab. Most candidates handwave past this with "we'll use etcd" and hope the interviewer moves on. The interviewer will not move on.
This guide is the version of the consensus conversation I wish every candidate walked into. The goal is not to prove the Paxos safety theorems on the whiteboard — it is to know what consensus guarantees, when you actually need it, and how to answer the split-brain question without blinking.
The FLP impossibility you must acknowledge
Fischer, Lynch, and Paterson (1985) proved that in an asynchronous network with even one crash failure, deterministic consensus is impossible in bounded time. Practical systems work around this by:
- Using randomized timeouts (Raft's election timeout, Paxos's backoff).
- Assuming partial synchrony — the network is eventually responsive enough.
- Accepting that liveness can be compromised (the system may not make progress during a partition) while safety is preserved (no two leaders; no lost committed writes).
If you mention FLP without being asked, you've signaled that you've read the papers. Don't belabor it; one sentence is enough.
Consensus, stated clearly
A consensus protocol guarantees:
- Agreement. All non-faulty nodes decide the same value.
- Validity. The decided value was proposed by some node.
- Termination. All non-faulty nodes eventually decide (liveness, not always guaranteed during partitions).
For replicated state machines, we use consensus to agree on the log of operations. Every replica applies the same log in the same order and ends up in the same state. This is the Replicated State Machine pattern (Schneider, 1990).
Leader election is a special case of consensus: agree on which node is leader.
Raft — the one you should draw
Ongaro and Ousterhout (2014, "In Search of an Understandable Consensus Algorithm") designed Raft specifically because Paxos was too hard to teach. It is the algorithm to draw on the whiteboard.
Raft has three roles: Leader, Follower, Candidate, and divides time into terms (monotonic integers).
[Follower] --timeout--> [Candidate] --majority vote--> [Leader]
^ | |
| | |
+---higher term seen----+-----higher term seen--------+
The core protocol, in six bullets:
- Every node starts as a Follower with a random election timeout (150–300ms is typical).
- If a Follower doesn't hear from a Leader before its timeout, it becomes a Candidate, increments its term, votes for itself, and requests votes from peers.
- A Candidate wins the election if it gets votes from a majority. Only one Leader per term.
- The Leader accepts client commands, appends to its log, and replicates via
AppendEntriesRPCs with heartbeats. - A log entry is committed once a majority of nodes have stored it. The Leader then tells replicas to apply it.
- Log matching: if two logs have the same entry at the same index and term, they're identical up to that point. This invariant is what makes Raft safe.
Write that down in an interview and you've already earned the consensus question.
Key safety properties
- Election safety: at most one Leader per term. Guaranteed by majority voting and the rule "each node votes for at most one Candidate per term."
- Leader append-only: a Leader never overwrites its own log entries.
- Log matching: if two logs have the same entry at the same index and term, logs are identical up to that index.
- Leader completeness: a committed entry is present in the logs of all future Leaders. Enforced by voting: a Candidate only wins if its log is at least as up-to-date as the voter's.
- State machine safety: replicas apply the same command in the same order.
Name these when pressed. They are table stakes.
Paxos — what to say if asked
Lamport's original 1998 paper ("The Part-Time Parliament") and the 2001 "Paxos Made Simple" are famously dense. You do not need to implement Paxos in the interview. You need to know:
- Basic Paxos is a single-decree protocol — reach agreement on one value. Two phases: Prepare (proposer gets promises) and Accept (proposer sends the value, acceptors accept if they haven't promised a higher number).
- Multi-Paxos chains Basic Paxos to replicate a log. A stable leader can skip Phase 1 for subsequent decrees, making steady-state performance similar to Raft.
- Fast Paxos (Lamport, 2006) reduces latency by having acceptors send directly to learners, at the cost of more acceptors needed.
- EPaxos (Moraru, 2013) is leaderless — any replica can commit a command — but handles command conflicts explicitly. Used in the research literature; not widely deployed.
The practical use of Paxos in production:
- Google Chubby (Burrows, 2006). Consensus for locks and metadata at Google; the ancestor of ZooKeeper in intent.
- Google Spanner. Uses a Paxos variant per shard for replication.
- Microsoft Azure Cosmos DB. Uses a Paxos variant for replication.
Unless the interviewer specifically asks about Paxos, draw Raft. It's easier to explain and the resulting system is indistinguishable in production.
ZooKeeper and Zab
ZooKeeper uses its own protocol, Zab (ZooKeeper Atomic Broadcast, Junqueira et al. 2011). Zab is similar to Multi-Paxos in shape but has a distinct recovery phase that ensures all committed transactions are delivered before a new leader serves requests.
The ZooKeeper data model is a hierarchical filesystem (znodes) with ephemeral nodes tied to client sessions. This is why ZK is the canonical choice for:
- Service discovery. Each service instance creates an ephemeral znode; discovery reads
/services/foo. - Leader election for application-level leaders. Every candidate creates a sequential ephemeral znode; the lowest-numbered is leader. The next in line watches its predecessor.
- Distributed locks. Same pattern with sequential ephemeral znodes.
- Configuration distribution. Small, infrequently-updated config stored in znodes; clients watch for changes.
If you recommend ZooKeeper in an interview, name the znode pattern. "We'll use ZooKeeper" is vague. "We'll use ephemeral sequential znodes under /leaders/<service>, and each candidate watches its predecessor" is a real answer.
When you actually need consensus
Consensus is expensive (majority round-trip per commit) and operationally complex. Use it when and only when you need:
- Distributed locks where split-brain causes corruption. Two leaders both thinking they own a resource is a real bug. Use etcd, ZooKeeper, or Chubby.
- Service discovery with strong consistency. Routing tables must not diverge. etcd, Consul, ZooKeeper.
- Replicated state machines. CockroachDB, TiDB, and etcd all use Raft to replicate their own data. Databases that need strong consistency across replicas.
- Leader election for application-level singleton tasks. Background jobs that must run exactly once across a fleet.
- Metadata and configuration. Small, important data where every node must see the same thing. Kubernetes' control plane uses etcd for this.
When you should NOT use consensus
- For every write to a large database. Consensus at scale is usually sharded — per-shard consensus, not one global consensus. Spanner is the pattern.
- For high-throughput event streams. Kafka uses a Raft variant (KRaft, since 2.8) for metadata but not for every message — the metadata log is orders of magnitude smaller than the data log.
- For large amounts of data. Consensus replicates the log and majority-commits every entry. It's fine for kilobytes; it's horrifying for terabytes. Use leader-based replication with bulk streaming for bulk data.
- As a general database. etcd is 8GB of data max by default. ZooKeeper can handle more but isn't designed for it. Use a real database.
- When eventual consistency is acceptable. Don't reach for consensus if a CRDT or eventually-consistent store solves the problem.
Common candidate mistakes
- "We'll use Redis for a distributed lock." Redis single-node locks are not fault-tolerant. Redlock has caveats (Martin Kleppmann's 2016 critique is the canonical reference). If you really need correctness, use etcd or ZooKeeper.
- Not naming the quorum. "Majority" in a 5-node cluster is 3. Candidates who say "we need 3 out of 5 to commit" earn points.
- Claiming Raft always makes progress. It does not. During a partition without a majority on either side, the cluster stops accepting writes. That's the right behavior.
- Forgetting that consensus is per-shard. A single Raft group has finite throughput (thousands of commits/sec on good hardware). CockroachDB, etcd-at-scale, and others use per-range consensus.
- Missing the fsync. Raft requires log entries to be durably persisted before acknowledging. If you skip fsync for performance, you've broken safety. The correct optimization is batching entries, not skipping fsync.
- Confusing leader election with consensus on values. Raft does both; Paxos (basic) does only the latter. ZooKeeper does both via Zab.
- Ignoring the slow follower problem. A follower that can't keep up drops out of the majority, potentially triggering unnecessary elections. Production systems monitor replica lag religiously.
Real-world incidents
- GitHub's 2012 MySQL outage. Failover logic gone wrong; two primaries accepted writes. A textbook split-brain case. The fix was moving to Orchestrator with more careful leader election and, eventually, to Vitess.
- Cloudflare's 2020 etcd issue. A bad config change cascaded through the metadata layer. Illustrates that consensus gives you safety but not idiot-proofing.
- Slack's 2021 failure. Not a consensus failure per se, but the post-mortem discusses how their Consul deployment behaved under partial partition.
- Jepsen's etcd, Zookeeper, and Raft tests. Aphyr's write-ups are the canonical source on how these systems actually behave under fault injection. Having read them is signal.
Pseudocode: leader election with ephemeral znodes
# ZooKeeper-style leader election
my_znode = zk.create("/election/candidate_", ephemeral=True, sequential=True)
while True:
children = sorted(zk.get_children("/election"))
if children[0] == my_znode:
become_leader()
break
# Watch the immediate predecessor only (avoids herd effect)
predecessor = children[children.index(my_znode) - 1]
zk.exists(f"/election/{predecessor}", watch=True)
wait_for_watch_event()
The reason to watch only the predecessor and not the entire list is the herd effect: if all candidates watch /election, every change wakes all of them. Sequential watching is the standard fix. Candidates who know this detail stand out.
Advanced follow-ups
- "How does Raft handle a network partition?" Answer: the minority side cannot elect a leader (no majority), so it refuses writes. The majority side elects a new leader if needed and continues. When the partition heals, the minority catches up via log replication. No split-brain.
- "How do you read from Raft without going to the leader?" Answer: leader leases — the leader holds a lease and can serve reads locally for the lease duration. Or use learner replicas for eventual reads. Be explicit about the consistency tradeoff.
- "What's the write latency in Raft?" Answer: one majority round trip + one fsync. In a 3-node cluster, roughly max(RTT to 1 of 2 remaining replicas, local disk fsync). Cross-AZ adds 1–2ms.
- "What's the difference between Raft and Multi-Paxos?" Answer: algorithmic structure is similar; Raft enforces log contiguity and leader restrictions that simplify implementation. Operationally, Raft is what you want to build today.
- "How do you change cluster membership safely?" Answer: Raft's joint consensus (C_old ∪ C_new) — commit configuration changes through a two-phase membership change. Naive replacement can cause split-brain.
- "What's the failure mode when half the cluster dies?" Answer: no majority, no writes. Reads may continue from a stale leader until its lease expires. Operators must restore quorum (via a new node or manual intervention).
- "Why does etcd default to 3 or 5 nodes and not more?" Answer: commit latency scales with the slowest majority member. Bigger clusters tolerate more failures but are slower. 3 and 5 are the sweet spots; 7 is rare.
The candidates who land the consensus question are the ones who can draw the Raft state machine without looking at notes, name quorum sizes for specific cluster sizes, cite the relevant Jepsen reports, and refuse to use consensus when a cheaper primitive solves the problem. Practice narrating the commit path and the election path; those are the two things every interviewer will ask you to walk through.
Consensus is the most technically dense topic in system design and the one where precise vocabulary earns the most credit. If you can say "majority quorum," "log matching," "leader lease," and "joint consensus" in their correct places, you have cleared the bar most candidates miss.
Related guides
- A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM — A tactical guide to A/B testing interview questions in 2026, with answer frameworks for power analysis, peeking, sample-ratio mismatch, guardrails, metrics, and experiment trade-offs. Built for product analysts, data scientists, PMs, and growth roles.
- API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A practical API design interview cheatsheet for 2026: how to scope the problem, choose REST/GraphQL/gRPC patterns, model resources, handle auth, versioning, rate limits, and avoid the traps that cost senior candidates offers.
- API Design Interview Guide — REST vs GraphQL vs gRPC, Versioning, and Pagination — A practical API design interview guide covering REST, GraphQL, gRPC, versioning, pagination, idempotency, errors, auth, rate limits, and the tradeoffs interviewers expect.
- AWS Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A high-signal AWS interview cheatsheet for 2026 covering architecture patterns, IAM, networking, reliability, cost, debugging, and the answers that show real cloud judgment.
- AWS Interview Questions in 2026 — VPC, IAM, and the Services That Always Come Up — A focused AWS interview prep guide for 2026 covering VPC design, IAM reasoning, core services, common architecture prompts, debugging flows, and the mistakes that weaken senior answers.
