Skip to main content
Guides Skills and frameworks Message Queues Interview Guide: Kafka vs RabbitMQ vs SQS for System Design
Skills and frameworks

Message Queues Interview Guide: Kafka vs RabbitMQ vs SQS for System Design

10 min read · April 25, 2026

Picking Kafka for every system design question is a senior-level tell. Here is how to actually choose between Kafka, RabbitMQ, and SQS based on the workload, not the buzzword.

Message Queues Interview Guide: Kafka vs RabbitMQ vs SQS for System Design

The modal system design candidate picks Kafka for everything. Kafka for a thousand notifications a day. Kafka for job queues. Kafka for the order status webhook. This is a tell. Kafka is the right answer for some workloads, the wrong answer for many, and the senior signal in a system design interview is being able to explain which is which.

This guide covers Kafka, RabbitMQ, and SQS — the three you will be asked to compare — plus the supporting cast (Kinesis, Pub/Sub, NATS, Pulsar, Redpanda) that should round out your vocabulary at the staff level. The goal is not to list features. The goal is to be able to pick one on purpose and defend the pick.

The fundamental distinction: log vs queue

Before you name any product, name the model. Interviewers are listening for this vocabulary.

Log-based systems (Kafka, Pulsar, Kinesis, Redpanda) store messages in an append-only, partitioned log. Consumers track their position (offset) independently. Messages are not deleted on consumption; they're retained by time or size. Multiple consumer groups read the same log independently. Replay is trivial — seek to an earlier offset and reprocess.

Queue-based systems (RabbitMQ, SQS, ActiveMQ, traditional JMS) store messages in a queue where consumption removes them. Once acknowledged, gone. Multiple consumers compete for messages (work distribution). Replay generally requires re-publishing.

These are different tools for different jobs. A log is good at event streaming, fan-out to many consumers, and replay-based recovery. A queue is good at work distribution, task processing, and request-response patterns.

Many "message queue" questions are actually event streaming questions (fan out order events to five downstream systems) and many are actually work queue questions (process user-uploaded videos with workers). Diagnose which before you pick the product.

Kafka: when it actually fits

Kafka fits when you have:

  • High throughput event streams. Millions of events per second is Kafka's sweet spot. LinkedIn, Uber, Netflix run trillions of messages daily on it.
  • Multiple independent consumers of the same data. Fan-out to analytics, fraud detection, notifications, and cache invalidation off a single producer.
  • Replay as a core requirement. Time-travel debugging, backfilling new consumers, recovering from consumer bugs. Kafka is uniquely good at this.
  • Per-key ordering. Partition by entity key and you get ordered processing per entity without global ordering constraints.
  • Integration with a streaming ecosystem. Flink, Spark Streaming, kSQL, ksqlDB, Kafka Streams, Debezium, Schema Registry — the ecosystem is the real reason Kafka wins at scale.
  • Long retention as an integration surface. Want a new service to consume the last seven days of events? Kafka makes that a config flag.

Kafka does not fit when you have:

  • Low throughput. A few thousand messages a day doesn't justify a Kafka cluster. You're paying operational cost for no benefit.
  • Complex routing needs. Kafka's routing is "topic + partition key." If you need content-based routing with exchanges and bindings, RabbitMQ wins.
  • Per-message TTL or delay scheduling. Kafka doesn't do "deliver this message in 30 minutes" natively. RabbitMQ and SQS do (via delay queues).
  • Priority queues. Kafka has no concept. RabbitMQ has priority support.
  • True exactly-once across arbitrary sinks. Kafka transactions give you exactly-once within Kafka. For external sinks (send an email, charge a card), you still need idempotency on the consumer side.

Kafka's operational reality: running your own cluster is non-trivial. JVM tuning, ZooKeeper (or KRaft now), disk sizing, broker rebalancing. Managed options — Confluent Cloud, Amazon MSK, Aiven, Redpanda Cloud, Warpstream (S3-backed, cheap) — are worth mentioning. At the staff level, acknowledge that DIY Kafka is a platform team investment.

RabbitMQ: still the right answer more often than candidates admit

RabbitMQ is a mature, feature-rich AMQP broker that most senior candidates undersell.

RabbitMQ fits when you have:

  • Work queue distribution. Workers compete for tasks. Image processing, email sending, webhook dispatch. This is the canonical RabbitMQ use case.
  • Complex routing. AMQP exchanges (direct, topic, fanout, headers) let you do routing logic at the broker. "Route this message if type=order AND region=us" is a one-liner in RabbitMQ.
  • Request-reply patterns. RabbitMQ's reply-to semantics and RPC patterns are well-supported.
  • Priority and delay. Priority queues and delayed message plugin work out of the box.
  • Moderate throughput with low latency. Tens of thousands of messages per second per node, sub-millisecond. Not Kafka scale, but fine for most apps.
  • Feature-rich semantics. Dead-letter exchanges, TTL per message, selective consumer acknowledgment, channel-level backpressure.

RabbitMQ does not fit when you have:

  • High volume event streaming with replay. Wrong tool. Messages are gone after consumption by default.
  • Trillions of events daily. Scale ceiling is lower than Kafka. Possible with clustering but operationally painful.
  • Analytics pipelines. Wrong paradigm. You want a log.

RabbitMQ runs on Erlang and is rock-solid operationally. The 3.13+ releases added quorum queues and streams (a Kafka-like streaming feature) that blur the line somewhat. But for a pure work queue, RabbitMQ remains excellent and it's undervalued by candidates who assume Kafka is always the answer.

SQS: the boring right answer on AWS

SQS is AWS's managed queue. Two flavors:

  • Standard SQS. Unlimited throughput, at-least-once delivery, best-effort ordering, duplicates possible.
  • FIFO SQS. Strict ordering per message group, exactly-once (within a 5-minute deduplication window), capped at 3,000 TPS with batching (higher with high-throughput FIFO).

SQS fits when you have:

  • AWS workloads that need a queue and don't justify running your own. The default answer for "I have a Lambda and I need to buffer jobs."
  • Bursty workloads. Scales to any throughput on standard. No capacity planning.
  • Simple work distribution. Producers enqueue, consumers poll, done. No complex routing needed.
  • Dead-letter queues as a first-class feature. Built in, easy to configure.
  • Zero operational overhead tolerance. You pay AWS instead of managing brokers.

SQS does not fit when you have:

  • Replay needs. Once consumed and deleted, messages are gone. No rewind.
  • Fan-out to multiple independent consumers. Use SNS or EventBridge for fan-out, then SQS for each consumer. The SNS-to-SQS pattern is idiomatic on AWS.
  • Strict global ordering without the FIFO throughput ceiling. FIFO is capped; standard is unordered.
  • Cross-cloud portability. It's AWS-only.
  • Low latency requirements. Polling-based, typical latency is tens to hundreds of milliseconds.

Kinesis is AWS's log-based counterpart to Kafka and belongs in the SQS family by vendor. Kinesis Data Streams is managed Kafka-like but with a different API and shard-based scaling. At the staff level, you should be able to compare SQS + SNS vs Kinesis: SQS+SNS is stateless fan-out, Kinesis is stateful streaming with replay.

What interviewers actually want to hear

A staff-level answer to "which message queue" looks like this: "We have a work queue pattern — video transcoding jobs — with maybe 10K messages an hour, workers that take minutes per task, and retries with a DLQ after three attempts. That's a classic RabbitMQ or SQS use case. I'd default to SQS because we're on AWS, add a DLQ, and keep it simple. If we needed to replay or fan out to analytics, I'd reconsider and likely introduce Kafka — but I wouldn't start there."

That sentence names the workload, the volume, the ordering needs, the retry semantics, and the vendor context. It chooses on purpose. It signals restraint, which is a principal-level trait.

Contrast: "We'll use Kafka for durable messaging." No context. No tradeoff. Senior rating at best.

The tradeoffs you need to name

  • Delivery guarantees. At-most-once (fire and forget, lose messages on failure), at-least-once (default for most systems, requires idempotent consumers), exactly-once (narrow and caveated, doesn't exist across arbitrary side effects). Always assume at-least-once and design idempotent consumers.
  • Ordering. Global (rare, expensive), per-partition (Kafka, Kinesis), per-queue (RabbitMQ), per-message-group (SQS FIFO), none (SQS standard).
  • Durability. Replication factor, persistence flush policy, multi-AZ. Kafka's acks=all with min.insync.replicas=2 is the durable setting most teams ignore.
  • Throughput and latency. Orders of magnitude matter. Kafka: millions/sec, ms latency. RabbitMQ: tens of thousands/sec, sub-ms latency. SQS: unlimited/sec, tens of ms latency.
  • Retention. Kafka: hours to forever. RabbitMQ: until consumed. SQS: 14 days max.
  • Operational cost. DIY Kafka is a platform team. Managed Kafka is money. RabbitMQ is easier to run. SQS is zero-ops.
  • Schema management. Kafka has Schema Registry ecosystem. RabbitMQ and SQS leave this to you.

When you should push back on adding any queue

  • Synchronous request-reply with a deterministic response. Use RPC/HTTP. Forcing it into a queue adds latency and complexity.
  • Simple single-producer-single-consumer with low volume. A database table with a status column and a worker polling it works fine. Postgres with SELECT FOR UPDATE SKIP LOCKED is an excellent poor-man's queue.
  • When you don't need durability. In-memory channels work inside a process.

Real-world examples to cite

  • Netflix Kafka. Trillions of events daily. Originally open-source Kafka, now a mix with their own extensions.
  • Uber's Kafka. Powers their real-time pricing, ETAs, and fraud detection. Publicly documented hot-path system.
  • Slack's Kafka usage. For their pubsub backbone; they wrote publicly about the lessons learned scaling it.
  • Amazon SQS. Launched in 2004, probably the world's oldest commercial managed message queue. Boringly reliable.
  • Instagram's RabbitMQ. Used heavily for task queues and notification fan-out, especially with Celery workers.
  • Stripe's internal queues. Mix of SQS, Kafka, and custom solutions for different workload classes. The point being that real companies use all three at different layers.

Common candidate mistakes

  • Defaulting to Kafka without analysis. The number-one mistake. Ask the volume, the ordering needs, the replay needs. Then pick.
  • Conflating queue and log. They are different models. "Let's put a queue in front" when you need a streaming platform (or vice versa) is a red flag.
  • Ignoring DLQ. No message queue design is complete without a dead-letter strategy. How many retries, what's the DLQ TTL, who owns triage.
  • Assuming exactly-once. It doesn't exist across external side effects. Always pair at-least-once with idempotent consumers and explicit idempotency keys.
  • Not sizing the queue. "What's the expected depth, what's the SLO on consumer lag, what happens when depth exceeds X?" Have answers.
  • Ignoring poison messages. A message that always fails will block its partition (Kafka) or loop forever (queues). DLQ after N retries.
  • Forgetting the producer side. What happens if the broker is down? Retry with backoff, buffer locally, accept data loss, or propagate failure upstream. Name your policy.

Advanced follow-ups interviewers will ask

  • "How do you achieve exactly-once webhook delivery?" At-least-once delivery plus idempotency keys on the consumer. The receiver must dedupe.
  • "How do you handle a sudden 10x traffic spike?" Producer throttling, autoscale consumers based on lag (KEDA, SQS consumer autoscaling), add partitions in Kafka (and note the gotcha: existing partitioned keys may shuffle).
  • "What if a consumer processes a message and crashes before acking?" Message becomes visible again after visibility timeout, another consumer picks it up. Idempotent handler required.
  • "How do you handle ordered processing at scale?" Partition by entity key (user ID, order ID). Scale horizontally while preserving per-entity order.
  • "How do you roll out a breaking schema change?" Versioned topics, dual-publish for a transition period, consumers handle both versions, eventually deprecate the old.
  • "Kafka vs Pulsar vs Redpanda?" Kafka: incumbent, huge ecosystem, complex ops. Pulsar: multi-tenancy and tiered storage native, smaller ecosystem. Redpanda: API-compatible with Kafka, single binary, no JVM, strong performance claims.

The staff-level signal on message queues is the ability to diagnose the workload first, then pick the tool. Candidates who name Kafka for everything are signaling pattern-matching without understanding. Candidates who can say "SQS for this, RabbitMQ for that, Kafka for the event streaming backbone" and defend each pick with workload characteristics are signaling judgment. Judgment is the promotion metric at FAANG-tier companies.

Pick on purpose. Name the guarantees. Name the failure mode. Name the DLQ.