Skills and frameworks

Distributed Transactions Interview Guide: 2PC, Sagas, and the Outbox Pattern

9 min read · April 25, 2026

Distributed transactions are where system design candidates either confidently walk through sagas or get buried in 2PC failure modes. Here's how to pick a pattern on purpose and answer the partial-failure follow-up.

Distributed Transactions Interview Guide: 2PC, Sagas, and the Outbox Pattern

Every microservice-era system design question eventually reaches the question: how do you make a change that spans two services consistent? Payment + order + inventory. Account debit + credit. Notification + state update. The naive answer is "a transaction." The senior answer is "we don't have one, so here's how we simulate it."

This guide is the version of the distributed transactions conversation I wish every candidate walked into. The goal is to name the real patterns, understand their failure modes, and pick one on purpose based on the workload — not reach for 2PC because it sounds rigorous.

Why the problem exists

A database transaction gives you ACID over a single store. Two services with two databases have no such luxury. The generals' problem (Lamport, 1982) and the FLP impossibility result (1985) together say you cannot atomically commit two independent systems in finite time on an asynchronous network. You can get close, with caveats.

Candidates who say "we need a distributed transaction" without justifying why lose points. The first question is: do we actually need atomicity, or do we need eventual consistency with correct compensation?

The patterns that matter

Two-Phase Commit (2PC). Coordinator-driven atomic commit. Participants vote prepare/commit. Strong consistency, blocking on coordinator failure, impractical across Internet-scale services. Used inside databases (XA in MySQL, Postgres prepared transactions, Spanner internally).
Three-Phase Commit (3PC). Adds a pre-commit phase to 2PC to avoid blocking. Assumes synchronous network; fails under real network partitions. Mostly academic.
Paxos Commit / Spanner-style. Multi-Paxos-backed commit, removes the single-coordinator block. Used inside Google Spanner and CockroachDB; complex to implement.
Saga. A long-running workflow: a sequence of local transactions, each with a compensating transaction if a later step fails. Garcia-Molina and Salem, 1987. This is the microservices answer.
Transactional outbox. Write to your DB and to an outbox table in the same transaction; a separate process reads the outbox and publishes to Kafka/SQS. Guarantees at-least-once delivery with local atomicity.
TCC (Try-Confirm-Cancel). A middle path between 2PC and Saga. Each service reserves resources (Try), then either confirms or cancels. Tightly coupled to the service contract; common in payments.
Idempotency keys. Not a transaction pattern but critical: ensure retries don't duplicate effects. Stripe's Idempotency-Key header is the reference.

If you can name these and pick one for a stated workload, you're at the staff bar.

Two-phase commit — what to actually say

  Coordinator                Participant A          Participant B
  -----------                -------------          -------------
      |  -- PREPARE -->           |                       |
      |                           | write to redo log     |
      |                           | acquire locks         |
      |  <-- VOTE YES --          |                       |
      |  -- PREPARE --------------->                       |
      |                                                   | write/lock
      |  <-- VOTE YES ------------------------------------
      |                                                   |
      |  -- COMMIT -->            |                       |
      |                           | apply, release locks  |
      |  -- COMMIT --------------->                       |
      |                                                   | apply/release

The fatal property: between VOTE YES and the coordinator's COMMIT, participants are blocked with locks held. If the coordinator crashes here, participants wait indefinitely. Timeouts can force an abort, but the participant doesn't know if the coordinator sent COMMIT to someone else — leading to inconsistency on recovery.

Real-world use of 2PC:

XA transactions in relational databases. Supported in MySQL, Postgres (as prepared transactions), Oracle, SQL Server. Mostly used by integration layers (IBM WebSphere, JBoss/WildFly) and mostly avoided in modern microservices.
Inside Spanner/CockroachDB. 2PC over Paxos groups — the coordinator is itself a replicated group, so it doesn't block on a single failure.
In Kafka transactions. The transactional API uses a coordinator broker and 2PC-like semantics to atomically publish across partitions.

If you say "we'll use 2PC between our order service and our inventory service," the interviewer will ask what happens if the order service crashes after voting yes, and you'll have to describe a 20-minute incident. Usually not the right answer in a microservices question.

Sagas — the microservices answer

A saga is a sequence of local transactions T1, T2, ..., Tn, each with a compensating transaction C1, C2, ..., C(n-1). If Ti fails, we execute C(i-1), C(i-2), ..., C1 in reverse order to undo the work.

Two execution styles:

Choreography. Each service reacts to events. Service A commits T1 and emits OrderCreated; Service B consumes it, commits T2, emits PaymentReserved; etc. No central coordinator. Easy to start, hard to reason about as steps grow.
Orchestration. A saga orchestrator (Temporal, AWS Step Functions, Camunda, Netflix Conductor) drives the steps explicitly. Easier to reason about, debug, and monitor. Adds a runtime dependency.

Orchestrated sagas are the modern default for anything non-trivial. Temporal in particular has become the reference implementation — durable execution, automatic retries, replay-based recovery.

The hard properties of sagas:

Not atomic. There's a window where T1 has committed and T2 hasn't. Users and other services can observe this intermediate state.
Compensations are business-level, not technical. You don't roll back a credit card charge; you issue a refund. The compensation may have side effects (notifying the user, accounting entries).
Compensations must be idempotent. The orchestrator may retry a compensation. If issuing a refund runs twice, you owe money twice.
Some steps can't be compensated. If step T3 sends an email, you cannot un-send it. Plan the saga so irreversible steps happen last, or design a "soft" compensation (a second email apologizing).
Isolation is weak. Read skew during a saga in progress is a real concern. Semantic locks, commutative operations, or status flags mitigate it.

Chris Richardson's microservices.io/patterns/data/saga.html is the canonical writeup. Cite it.

The outbox pattern — the piece nearly everyone needs

Any service that wants to "update my DB and publish a message" atomically needs the outbox pattern. You cannot write to Postgres and then publish to Kafka without a window where one succeeded and the other failed.

BEGIN TRANSACTION
  UPDATE orders SET status='placed' WHERE id=42;
  INSERT INTO outbox(event_type, payload, created_at)
    VALUES ('OrderPlaced', '{...}', now());
COMMIT

# Separate relay process:
SELECT * FROM outbox WHERE published_at IS NULL ORDER BY id LIMIT 100;
foreach event: publish to Kafka, UPDATE outbox SET published_at=now() WHERE id=...

At-least-once delivery. Consumers must be idempotent. Combine with a message_id so consumers can deduplicate.

Modern variations:

CDC-based outbox. Use Debezium or AWS DMS to tail the Postgres WAL; changes to the outbox table become Kafka messages automatically. No separate relay process. Eliminates dual-write bugs entirely.
Transactional outbox in NoSQL. DynamoDB Streams, Cosmos DB change feed, MongoDB change streams — same pattern, different plumbing.
Inbox pattern. Mirror for consumers: consume a message, write it to an inbox table in the same transaction as the domain update, so re-delivery is detected and ignored.

Candidates who mention the outbox before being asked signal they've shipped systems that publish reliably. It is the most underrated pattern in interview answers.

TCC — Try-Confirm-Cancel

TCC formalizes a two-phase commit at the application level. Each participant exposes three operations:

Try: reserve resources, validate preconditions. Holds a soft lock.
Confirm: commit the reservation. Must succeed eventually.
Cancel: release the reservation.

Payment processors use this heavily. "Authorize" the card (Try), then "capture" (Confirm) or "void" (Cancel). The authorization reserves the funds on the card; it either becomes a real charge or expires.

TCC vs 2PC: in 2PC the participant locks at prepare time and waits on the coordinator. In TCC, the participant "reserves" in application semantics with its own timeout, so a dead coordinator doesn't block indefinitely.

When to use what

Intra-database multi-row atomicity: local ACID transaction. Don't reach for distributed patterns if everything is in one store.
Cross-service with no user-visible intermediate state acceptable: saga with orchestrator.
Cross-service with intermediate state tolerable and loose timing OK: saga with choreography via events.
Publish a message on DB write: transactional outbox.
Payments or reservation-style flows: TCC.
Across shards of the same database: the DB's built-in transaction (Spanner, CockroachDB, Vitess). Don't build 2PC yourself.
You genuinely need atomic commit across independent services: 2PC via XA — and be prepared to explain why a saga didn't fit.

Common candidate mistakes

Reaching for 2PC reflexively. It's the textbook answer and usually the wrong one for microservices. Sagas are the modern default.
Forgetting idempotency. Every operation in a distributed transaction must tolerate being retried. Use idempotency keys, version numbers, or natural keys.
Missing the dual-write trap. "Write to DB, then publish to Kafka" is broken. Use the outbox pattern.
Ignoring saga isolation. A half-completed saga leaves intermediate state visible. Design the data model so partial states are safe (status flags, pending states).
Assuming compensations always succeed. They can fail too. The orchestrator needs retries, circuit breakers, and ultimately a dead-letter queue for human intervention.
Not handling duplicate compensations. If the orchestrator loses track of whether compensation ran, it runs it again. Every compensation must be idempotent.
Blending choreography and orchestration carelessly. Choose one primarily; mix only when you have a clear reason.

Real-world references

Stripe's Idempotency-Key. The RFC-quality standard for idempotent API requests. Every payments interview probes this.
Uber Cadence / Temporal. Workflow engine that powers durable sagas at Uber, Coinbase, and elsewhere. The Temporal blog has excellent saga write-ups.
Netflix Conductor. Open-source orchestration engine used for Netflix's content workflows.
AWS Step Functions. Managed saga orchestration with visual state machine — often the right answer in AWS-heavy stacks.
Debezium. CDC-based outbox implementation for Postgres, MySQL, MongoDB, SQL Server. The reference open-source choice.
Eventuate Tram. Chris Richardson's saga framework; the book "Microservices Patterns" is the canonical reference.
Kafka transactions. Exactly-once semantics across topics via a transactional coordinator. Useful for stream-processing, not for cross-service OLTP.

Advanced follow-ups

"What if a compensation fails?" Answer: retry with exponential backoff, then dead-letter to an operator. Some systems mark the saga as COMPENSATION_FAILED and require human intervention. Do not silently continue.
"How do you test sagas?" Answer: replay engines (Temporal has one built-in), fault injection on each step, and explicit tests for every compensation path.
"How do you guarantee the outbox relay doesn't lose messages?" Answer: relay writes published_at after the broker acks. CDC-based relays are even safer because they tail the WAL directly.
"What about exactly-once?" Answer: exactly-once doesn't exist; what you get is at-least-once delivery plus idempotent consumers. State it that way.
"How do you handle isolation issues in long-running sagas?" Answer: semantic locks (a pending status on the affected entity), commutative operations (credits/debits rather than overwrites), or pessimistic locks in the first step.
"When is 2PC actually fine?" Answer: inside a single database cluster with a reliable coordinator (Spanner, CockroachDB). Cross-datacenter XA over flaky networks is where it falls apart.
"How do you observe a saga in flight?" Answer: workflow engines expose per-saga state and events. Emit domain events at each step into your observability stack. Distributed tracing with trace IDs spanning all services.

The candidates who ace distributed transaction questions are the ones who name the pattern, draw the compensations, acknowledge the weak isolation, and plan for the failure modes before being asked. Sagas are the modern default; the outbox pattern is the silent requirement underneath almost every event-driven system; 2PC is the answer you almost never want but must be able to describe.

If you can walk a whiteboard through an orchestrated saga, name the outbox pattern on the DB-to-Kafka boundary, and state which steps are irreversible and why, you will outperform the majority of candidates. Distributed transactions are the topic where vocabulary and pattern recognition translate most directly into staff-plus signal.

Distributed Systems Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A practical distributed systems interview cheatsheet for 2026: the patterns interviewers expect, how to reason through tradeoffs, and the traps that cost strong candidates offers.
A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM — A tactical guide to A/B testing interview questions in 2026, with answer frameworks for power analysis, peeking, sample-ratio mismatch, guardrails, metrics, and experiment trade-offs. Built for product analysts, data scientists, PMs, and growth roles.
API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A practical API design interview cheatsheet for 2026: how to scope the problem, choose REST/GraphQL/gRPC patterns, model resources, handle auth, versioning, rate limits, and avoid the traps that cost senior candidates offers.
API Design Interview Guide — REST vs GraphQL vs gRPC, Versioning, and Pagination — A practical API design interview guide covering REST, GraphQL, gRPC, versioning, pagination, idempotency, errors, auth, rate limits, and the tradeoffs interviewers expect.
AWS Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A high-signal AWS interview cheatsheet for 2026 covering architecture patterns, IAM, networking, reliability, cost, debugging, and the answers that show real cloud judgment.

Distributed Transactions Interview Guide: 2PC, Sagas, and the Outbox Pattern

Why the problem exists

The patterns that matter

Two-phase commit — what to actually say

Sagas — the microservices answer

The outbox pattern — the piece nearly everyone needs

TCC — Try-Confirm-Cancel

When to use what

Common candidate mistakes

Real-world references

Advanced follow-ups

Related guides

More in Skills and frameworks

A/B Testing Interview Questions in 2026 — Power Analysis, Peeking, and SRM

API Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps

API Design Interview Guide — REST vs GraphQL vs gRPC, Versioning, and Pagination