Skip to main content
Guides Skills and frameworks Event-Driven Architecture Interview Guide: Events, Streams, and Choreography vs Orchestration
Skills and frameworks

Event-Driven Architecture Interview Guide: Events, Streams, and Choreography vs Orchestration

9 min read · April 25, 2026

Event-driven architecture is the section where weak candidates say Kafka and stop. Here is how to name the event type, pick choreography vs orchestration, and survive the ordering question.

Every system design interview above senior level eventually lands on an event-driven architecture question. "How would you handle order placement across inventory, payments, and shipping?" is the canonical setup. Weak candidates say "Kafka" and stop. Strong candidates pick an event model, name the ordering guarantees, and make a conscious choice between choreography and orchestration.

This guide is the event-driven architecture (EDA) conversation as it plays out at staff and principal interview loops. The topic is broad, but the signal is specific: can you name the event type, pick the coordination pattern, and survive the failure modes.

The three kinds of events and why the distinction matters

The first thing to get right: not all "events" are the same. Interviewers are checking whether you know the taxonomy.

  • Event-carried state transfer. The event contains the full state of the thing that changed: "OrderPlaced with this entire order object." Consumers don't need to call back to the producer. Heavy events, loose coupling, easy to replay.
  • Event notification. The event says "something happened, here is an ID, go look it up." Light events, tight coupling (consumers need the producer's API to hydrate), easy for producers but pushes load onto them.
  • Event sourcing. The event log is the source of truth. State is derived by folding events. Immutable append-only history. Used in financial systems, audit-heavy domains, and anywhere regulators ask "show me how this balance got here."

Most production systems mix these. Stripe, for example, uses event notification for webhooks (here's a charge ID, go fetch it) and event sourcing internally for ledger operations.

In the interview, declare what kind of event you're using and why. "We'll use event-carried state transfer for the order service so downstream consumers don't hammer the order API" is a staff-level sentence. "We'll use Kafka" is not.

Choreography vs orchestration: the question that distinguishes principal candidates

This is the distinction that matters most in EDA interviews and the one most candidates get wrong.

Choreography. Services react to events independently. No central coordinator. The order service publishes OrderPlaced, payment service listens and publishes PaymentAuthorized, inventory service listens and publishes InventoryReserved, shipping service listens and publishes ShipmentCreated. Loose coupling, no single point of control, classic pub/sub.

Orchestration. A central orchestrator (workflow engine) coordinates the sequence. It calls payment, then inventory, then shipping, in order. The orchestrator owns the state machine, retries, compensation, and timeouts.

When choreography wins: when flows are linear and short, when teams own services independently and want autonomy, when you can tolerate eventual consistency and implicit flow, and when you want maximum decoupling.

When orchestration wins: when flows are complex and multi-step with conditional branches, when you need visibility into where a flow stalled (choreography makes this surprisingly hard), when compensation (saga rollback) is non-trivial, when business people need to see workflow status in a dashboard.

Tools and names you should drop: Temporal (the current gold standard for durable execution and orchestration at engineering-heavy companies like Snap, Box, and Uber — though Uber's Cadence predates it), AWS Step Functions, Netflix Conductor, Camunda, Apache Airflow (for batch), Restate (newer, worth mentioning if you want to sound current). For pure choreography, just Kafka / RabbitMQ / AWS EventBridge plus consumer-side logic.

The principal-level answer is: "Choreography for short, linear flows with well-understood contracts. Orchestration via Temporal for long-running, multi-step workflows with compensation. Real systems run both — orchestration inside a bounded context, choreography between bounded contexts."

What interviewers actually test for

Beyond the event-type and coordination-pattern signals, interviewers push on:

  • Ordering guarantees. Kafka guarantees order per partition, not across partitions. RabbitMQ guarantees order per queue, not across queues. SQS standard is unordered; SQS FIFO is ordered per message group. If you need global order, you're in a world of pain (single partition = single writer = scalability cliff). Usually you don't need it — just per-entity order.
  • Exactly-once delivery. Doesn't exist in a distributed system without careful cooperation. What actually exists: at-least-once delivery plus idempotent consumers. Name this. Mention idempotency keys, dedup tables, or Kafka transactional producers with read-committed consumers for the "exactly-once semantics" within Kafka specifically.
  • Schema evolution. How do you add a field to an event without breaking consumers? Confluent Schema Registry with Avro or Protobuf, backward/forward compatibility rules. Know the difference.
  • Replay and backfill. Event logs enable time-travel. New consumer? Start from the beginning. Bug in a consumer? Reset offset and reprocess. This is a Kafka superpower versus queue-based systems.
  • Dead-letter queues and poison messages. A message a consumer can't process will block the partition (Kafka) or infinite-loop (SQS without DLQ). Every EDA design needs a DLQ strategy.
  • Backpressure. If producers outpace consumers, queues grow unbounded. Mention consumer lag monitoring, autoscaling based on lag (KEDA for Kubernetes), or producer throttling.
  • Observability. Distributed tracing across async boundaries is hard. Mention OpenTelemetry context propagation through message headers. Tools like Honeycomb, Datadog APM, Jaeger.

The tradeoffs you need to name

  • Coupling. EDA reduces temporal coupling (producer and consumer don't need to be online at the same time) but increases semantic coupling (consumers must understand the event schema). Name both axes.
  • Complexity tax. EDA is more complex than synchronous RPC. Debugging an async flow across five services is genuinely harder. Don't pretend it's free.
  • Eventual consistency. Consumers see updates after a delay. If the user clicks "buy" and the confirmation page expects the order to exist, you need either a synchronous path for the user-facing response or a pattern like outbox-with-polling for read-your-writes.
  • Storage cost. An event log with 90-day retention of billions of events is expensive. Tiered storage (Kafka with S3 tiering, Warpstream, Redpanda BYOC) has become the modern answer.
  • Fault isolation. A slow consumer doesn't slow the producer — this is the big win. But a broken consumer that can't process poison messages stops its queue. Name the DLQ.

When you should NOT use event-driven architecture

Not every system needs EDA. Resist the default.

  • Simple CRUD apps with synchronous UX. Don't introduce Kafka to save a user clicking a button. A synchronous REST call is fine.
  • Strict linearizable requirements across multiple services. EDA is eventually consistent. If you need strong consistency across services, either colocate them on one database with transactions or use a distributed transaction pattern with all its costs.
  • Low-volume systems. Kafka has operational overhead. For a few thousand messages a day, a simple database outbox + poller is enough.
  • When you can't operate it. Kafka in production is a senior operations investment. If your team doesn't have that, use managed (MSK, Confluent Cloud, Redpanda Cloud) or choose simpler tooling.
  • When you're really trying to solve a request-reply problem. If you want an immediate response, you want RPC or gRPC, not events. Forcing events into a request-reply shape (send request event, wait for response event) is almost always worse than calling an API.

Real-world example: Uber's dispatch and order systems

Uber's dispatch system is a well-documented EDA example. Trips and Eats orders generate events that fan out to matching, ETA calculation, surge pricing, driver notification, and analytics. Uber's choice of orchestration (via Cadence, their internal workflow engine that begat Temporal) for the trip lifecycle specifically — because trip state is long-lived, has complex timeouts (driver has N minutes to accept, rider can cancel, retries happen), and needs compensating actions on failure.

Other canonical examples to name:

  • LinkedIn Kafka. The origin story of Kafka itself; the 2012 paper is foundational. LinkedIn runs trillions of messages a day.
  • Netflix. Massive Kafka fleet (trillions of events daily), plus custom orchestration layers (Conductor) and event-sourced state in Cassandra.
  • Stripe's outbox pattern. For webhooks, Stripe uses a transactional outbox: write the outbound event to the same database transaction as the domain write, then a poller ships events to the external world. Canonical pattern.
  • Shopify's flash-sale architecture. Queue-backed admission with Kafka, backpressure to control inventory, async order processing.
  • Financial exchanges (LMAX Disruptor). Ultra-low-latency event processing with a ring buffer. Worth a mention for anyone interviewing in fintech.

Common candidate mistakes

  • Ignoring idempotency. "We'll use Kafka for at-least-once." Fine, but then you need idempotent consumers. Name the idempotency key (event ID, or a natural business key) and where it's checked (dedup table, Redis SET NX).
  • Coupling the event schema to the producer's internal model. The event contract is a public API. Treat it with the same care. Don't leak internal field names or implementation details.
  • Building a distributed monolith. If every service must coordinate on every event, you don't have a decoupled system — you have a monolith with network hops. Bounded contexts matter.
  • Forgetting the outbox pattern. Writing to the DB and publishing to Kafka as separate steps means one can fail. Transactional outbox (write event to DB in same transaction, poller publishes) or CDC (Debezium) is the answer.
  • Assuming Kafka solves ordering globally. Per-partition. Set the partition key to the entity ID (userId, orderId) to get per-entity ordering.
  • No schema registry. Events are contracts. Without a registry, consumers break on schema changes. Name Confluent Schema Registry or AWS Glue Schema Registry.
  • No lag monitoring or SLO. Consumer lag is the key metric for EDA health. "What's your consumer lag SLO" is a staff-level question.

Advanced follow-ups interviewers will ask

  • "How do you handle exactly-once?" Idempotent consumers with a dedup store, or Kafka transactional producers within Kafka's ecosystem. For external side effects (sending an email, charging a card), idempotency keys on the receiving end are the only real answer.
  • "What if a consumer is down for hours?" Events accumulate in the log. When the consumer comes back, it resumes from its last committed offset. This is the point of a log-based broker vs a queue.
  • "How do you reprocess events after a bug fix?" Reset consumer offset to a point before the bug, or spin up a new consumer group starting earlier. This is trivial in Kafka, impossible in SQS standard (messages are gone after ack).
  • "What about GDPR and personal data?" Events with PII are hard to delete. Options: use tombstone records (Kafka log compaction), encrypt PII with a per-subject key and delete the key on right-to-erasure, or keep PII out of events and pass only IDs.
  • "How do you version events?" Schema evolution with backward-compatible changes (add optional fields), versioned topics for breaking changes, or upcasters (read old format, transform to new).
  • "When do you use Kafka vs Pulsar vs Kinesis vs Pub/Sub?" Kafka: default, huge ecosystem. Pulsar: multi-tenancy, tiered storage built in, geo-replication. Kinesis: managed AWS, simpler, smaller scale ceiling. Pub/Sub: managed GCP, very simple, no partition model. Know tradeoffs at a glance.

The principal-level signal in an EDA interview is specificity about three things: what kind of event, what coordination pattern, and what failure modes. Candidates who hand-wave "we'll use Kafka" get a senior rating. Candidates who narrate "event-carried state transfer over Kafka partitioned by order ID, choreography for fulfillment, Temporal for the long-running refund workflow, outbox pattern from the order database, DLQ after three failures, and consumer lag SLO at under 10 seconds for the customer-facing path" get a principal rating.

Pick your events on purpose. Pick your coordination on purpose. Name what breaks.