Designing a Payment System Design Interview — Idempotency, Ledgers, and Reconciliation
A senior payment-system design answer lives or dies on idempotency, double-entry ledgers, and reconciliation. This guide gives the architecture, state model, failure-mode answers, and interview script.
Designing a Payment System Design Interview — Idempotency, Ledgers, and Reconciliation
Designing a payment system design interview is not really about naming Stripe, Kafka, and a database in the first five minutes. The strong answer is about money correctness under retries, partial failures, delayed provider callbacks, and accounting close. Idempotency, ledgers, and reconciliation are the spine of the design because they prove you understand that a payment is not done when an API returns 200; it is done when every party agrees on the durable financial state.
This is the working guide for a senior system design answer: what to draw, what to say, how to handle failure modes, and where interviewers usually push after the happy path.
Designing a payment system design interview: what the interviewer is testing
A payment design prompt usually starts broad: "Design payments for an e-commerce marketplace," "Design a wallet," or "Design a checkout system." The interviewer is not looking for one canonical architecture. They are checking whether you can separate product flow from financial truth.
The core signals are:
- Correctness over availability at the money boundary. You can be highly available for carts, pricing, and checkout sessions; you should be conservative when moving money.
- Idempotent external APIs. Clients, mobile networks, load balancers, and workers will retry. The same logical payment attempt must not charge twice.
- A ledger as source of truth. Balances should be derived from immutable entries, not overwritten totals.
- Asynchronous settlement. Authorization, capture, refund, dispute, and payout have different timelines.
- Reconciliation. You must compare your internal ledger against payment processor files, bank reports, and provider webhooks.
- Auditability. A senior design answer explains how to answer "why is this user's balance wrong?" two months later.
A simple way to frame the prompt back to the interviewer: "I'll design a card-payment checkout with auth, capture, refunds, and a merchant ledger. The principles also apply to wallets and bank transfers. I'll optimize for no duplicate charges, traceable state transitions, and nightly reconciliation."
The architecture to draw first
Start with a deliberately boring diagram. Boring is good here.
| Component | Job | Data it owns | |---|---|---| | Checkout API | Creates payment intent and accepts client confirmation | payment_intent, idempotency key, customer context | | Payment orchestrator | Talks to processors, schedules retries, handles state transitions | provider attempt, state machine, retry metadata | | Ledger service | Writes immutable double-entry records | ledger transactions and ledger entries | | Webhook receiver | Accepts provider events and verifies signatures | raw provider events, dedupe keys | | Reconciliation worker | Compares internal state to provider/bank reports | reconciliation runs, exceptions | | Risk/fraud service | Scores payments before capture or payout | risk decision, review status | | Notification service | Sends receipts and failure messages | delivery status |
The interview-friendly flow:
- Client asks Checkout API to create a payment intent.
- Checkout API stores a payment_intent with a client-supplied or server-generated idempotency key.
- Payment orchestrator sends an authorization request to the processor.
- Processor returns synchronous response or later sends a webhook.
- Orchestrator updates the payment state only through allowed transitions.
- Ledger service records money movement when the business event is financially real.
- Reconciliation worker verifies that processor settlement and internal ledger match.
Do not start with Kafka. Start with the state model. Messaging matters, but money systems fail when the state model is vague.
Payment lifecycle and state machine
A strong payment design has explicit states. The exact names vary, but the transitions should be narrow.
| State | Meaning | Allowed next states | |---|---|---| | created | intent exists, no provider call yet | authorizing, cancelled | | authorizing | provider authorization in progress | authorized, failed, requires_action, unknown | | authorized | card hold or payment approval exists | captured, voided, expired | | capture_pending | capture request submitted | captured, capture_failed, unknown | | captured | merchant can recognize charge | refunded, disputed, settled | | settled | funds confirmed in settlement report | refunded, disputed | | refund_pending | refund request submitted | refunded, refund_failed, unknown | | disputed | cardholder dispute opened | won, lost | | unknown | provider result ambiguous | resolved by webhook or reconciliation |
The important move is the unknown state. Interviewers love to ask, "What happens if your service times out after the processor charged the card?" A junior answer retries blindly. A senior answer records the attempt as unknown, queries the provider by idempotency key or provider reference, and lets a reconciliation job resolve it before taking another irreversible action.
Use optimistic locking or compare-and-swap on state transitions. For example, update from authorizing to authorized only if the row is still authorizing. If two webhooks arrive or a worker retries, one transition wins and the other becomes a harmless duplicate.
Idempotency keys: the exact design
Idempotency is the most important concept in a payment system design interview because every layer retries. The client retries when the app loses connectivity. The API gateway retries on 502. The worker retries after a timeout. The provider may retry webhooks for hours.
For client-facing create/confirm APIs, store an idempotency record:
| Field | Purpose | |---|---| | idempotency_key | Unique key supplied by client or generated for a logical operation | | scope | merchant_id + customer_id + endpoint, so keys do not collide globally | | request_hash | Detects accidental reuse with different payload | | status | in_progress, completed, failed_retryable, failed_final | | response_body | Replay exact response for completed operation | | resource_id | payment_intent_id or refund_id created by the first request | | expires_at | Retention window, often days or weeks depending on product |
Decision rule: same key plus same request hash returns the same result; same key plus different payload returns a 409 conflict. That one sentence shows you understand idempotency as a semantic guarantee, not just a unique constraint.
For provider calls, pass a separate provider idempotency key when supported. Store it with the provider attempt. If the provider does not support idempotency, use your own attempt table and never issue a second irreversible call until the first attempt is resolved.
For webhooks, dedupe by provider event id and also design handlers to be naturally idempotent. A webhook that says "payment captured" should attempt the captured transition once and create ledger entries with a deterministic transaction id. If the same webhook arrives again, the transition and ledger insert should no-op.
Ledgers: the source of financial truth
The ledger is where many candidates go shallow. A payments system should not store a mutable merchant_balance = 100.00 and update it in place as the source of truth. Store immutable double-entry records and derive balances from them.
Basic schema:
ledger_accounts: customer cash, merchant receivable, platform clearing, processor clearing, fees, disputes, payouts.ledger_transactions: one business event, such as capture, refund, fee, or payout.ledger_entries: debit/credit lines that sum to zero per transaction.available_balance_snapshots: optional cached balance for fast reads, rebuildable from entries.
Example capture for a $100 order with a $3 platform fee:
| Account | Debit | Credit | |---|---:|---:| | Processor clearing | $100 | | | Merchant payable | | $97 | | Platform fee revenue | | $3 |
Example payout of $97 to merchant:
| Account | Debit | Credit | |---|---:|---:| | Merchant payable | $97 | | | Bank cash | | $97 |
The ledger transaction should have a deterministic external reference: capture:{payment_intent_id} or refund:{refund_id}. That prevents duplicate ledger entries even when workers retry. Every ledger transaction must balance to zero in the smallest currency unit. Never use floating point for money; use integer cents or a decimal type with explicit currency.
A senior answer also separates authorization state from ledger state. You usually do not credit merchant payable when a card is merely authorized. You write ledger entries when capture is confirmed or when the business decides an internal wallet balance should move.
Reconciliation: what closes the loop
Reconciliation is the part that turns a system design answer from plausible to production-grade. The job is to prove that your records match the outside world.
At minimum, reconcile four streams:
- Provider API state: query individual payment/refund attempts stuck in unknown.
- Provider webhooks: confirm that every event id was processed once.
- Settlement reports: compare captured/refunded/disputed amounts against daily processor files.
- Bank statements: compare actual cash movement for payouts and processor deposits.
A nightly reconciliation run should produce exception types, not just logs:
| Exception | Example | Action | |---|---|---| | Missing internal record | Processor says charge settled, no internal capture | create investigation ticket, block payout if needed | | Missing provider record | Internal capture, provider has no charge | mark payment unsafe, notify support | | Amount mismatch | Internal $100, provider $99.50 | check fee/currency/tax handling | | Duplicate event | Two captures for one order | freeze account, reverse duplicate if real | | Late dispute | Settled charge enters chargeback | move funds to dispute liability account |
Interview line to use: "The ledger is my source of customer and merchant balances, but reconciliation is my proof that the ledger matches processors and banks." That is the distinction interviewers want to hear.
Handling failure modes without hand-waving
Payments are a failure-mode interview. Be ready with concrete answers.
Timeout after provider call. Store attempt as unknown, do not immediately retry the charge. Query provider using provider idempotency key. If still unknown, schedule backoff and rely on webhook/reconciliation.
Webhook arrives before synchronous response. Webhook handler and API response handler both use the same state-transition function. One wins; the other sees the final state.
User double-clicks pay. Same idempotency key returns the same payment intent. Different key for the same cart can be blocked by an order-level invariant: only one active payment intent per order.
Database commit succeeds but message publish fails. Use an outbox table in the same database transaction as the state change. A relay publishes the outbox event later. Do not dual-write directly to DB and queue.
Processor partial outage. Route by payment method and merchant configuration if you support multiple processors, but avoid automatic failover for ambiguous attempts. Failover is safe before the first provider attempt; after an unknown attempt it can double-charge.
Refund after payout. Ledger it as a merchant liability or debit future payout. Product policy decides whether the platform fronts the refund.
Currency and rounding. Store currency on every amount. Calculate fees in integer minor units with deterministic rounding rules. Never add USD and EUR entries in the same ledger transaction unless conversion entries are explicit.
Scale, storage, and partitioning decisions
Payment throughput is usually lower than feed or chat throughput, but correctness pressure is higher. Partition by merchant or payment id for operational scaling. Keep ledger writes strongly consistent within a transaction boundary. Use read replicas or precomputed balance snapshots for dashboards, but make snapshots rebuildable.
For queues, partition events by payment_intent_id so ordering is preserved per payment. Global ordering is unnecessary. Retain raw provider webhooks for audit and replay, but separate raw payload storage from normalized events so provider schema changes do not leak everywhere.
For observability, track business metrics, not only CPU and latency:
- authorization success rate by processor and card network
- unknown-state count and age
- duplicate idempotency-key conflicts
- ledger imbalance attempts blocked
- reconciliation exception count by type
- webhook delivery lag
- payout hold amount due to unresolved exceptions
These metrics help you explain operational maturity. They also give the interviewer natural places to push on alerting and support tooling.
Prep checklist and interview script
Use this answer structure in a 45-minute interview:
- Clarify scope. Card checkout? Wallet? Marketplace? Refunds? Payouts?
- Define invariants. No duplicate charge per order, ledger entries balance, every external attempt is traceable.
- Draw core services. Checkout API, orchestrator, ledger, webhook receiver, reconciliation.
- Walk happy path. Create intent, authorize, capture, ledger, receipt.
- Deep dive idempotency. Keys, request hash, response replay, provider attempt dedupe.
- Deep dive ledger. Double-entry model, deterministic transaction ids, balance snapshots.
- Deep dive reconciliation. Unknown states, settlement files, exception workflow.
- Discuss scale and operations. Partitioning, outbox, metrics, audit logs.
A concise closing: "The design favors boring consistency at the money boundary. Product APIs can be horizontally scalable, but state transitions, idempotency, and ledger writes are guarded because the cost of a duplicate or missing charge is much higher than a slightly slower checkout."
How to talk about this on a resume
If you have real payments experience, avoid vague bullets like "worked on payment processing." Use correctness language:
- Built idempotent payment orchestration for card authorization, capture, refund, and webhook replay.
- Designed double-entry ledger entries for merchant balances, fees, payouts, and disputes.
- Reduced unresolved payment states by adding provider-query retries and settlement reconciliation.
- Implemented an outbox-based event pipeline so payment state changes and downstream messages stayed consistent.
If you do not have payments experience, build a small project that demonstrates the same ideas: an order table, payment attempts, idempotency records, ledger entries, and a reconciliation script that compares against a fake provider CSV. That project is more credible than another generic "Stripe clone" because it shows you understand the hard part: money systems are state machines with audit trails, not just API wrappers.
Related guides
- Designing a Chat System Design Interview — WebSockets, Presence, and Message Storage — A system design interview guide for chat applications, covering WebSockets, fanout, message ordering, presence, storage, delivery receipts, media, search, scaling, and common trade-offs.
- Designing a Search System Design Interview — Inverted Index, Ranking, and Recall — A practical system design guide for search interviews, covering inverted indexes, crawling and ingestion, query execution, ranking, recall, freshness, personalization, scaling, and evaluation trade-offs.
- Designing a URL Shortener System Design Interview: Capacity, Encoding, and Analytics — URL shortener is the most-asked warm-up system design question and the easiest to under-deliver on. Here's how to walk the full loop — capacity math, base62 encoding, caching, and analytics — without hand-waving.
- Designing Uber System Design Interview — Geo-Indexing, Matching, and ETA — A practical system design guide for the Uber-style ride-hailing prompt, covering geo-indexing, driver matching, ETA estimation, trip state, scale, and failure modes.
- Backend System Design Interview Cheatsheet in 2026 — Patterns, Examples, Practice Plan, and Common Traps — A backend System Design interview cheatsheet for 2026 with the core flow, architecture patterns, capacity heuristics, reliability tradeoffs, and traps that separate senior answers from vague box drawing.
