software-design|March 21, 2026|13 min read

Deep Dive on Apache Kafka: A System Design Interview Perspective

TL;DR

Kafka is a distributed append-only commit log partitioned across brokers. Producers write to partition leaders, consumers pull in order via offsets, and consumer groups enable parallel consumption with automatic rebalancing. Replication (ISR) provides durability. For interviews, know when to use Kafka (event streaming, decoupling services, CDC, log aggregation) vs when not to (request-reply, small-scale queuing, low-latency sub-ms). Key design levers: partition count (parallelism), replication factor (durability), retention policy (storage), and consumer group design (scaling consumers). Exactly-once requires idempotent producers + transactional writes + read-committed consumers.

Deep Dive on Apache Kafka: A System Design Interview Perspective

“Kafka is not a message queue. It’s a distributed commit log that happens to be useful as a message queue, event store, and stream processor.”

Apache Kafka appears in nearly every system design interview involving event-driven architecture, real-time data pipelines, or service decoupling. It’s the backbone of architectures at LinkedIn, Uber, Netflix, and Airbnb — processing trillions of messages per day.

This article covers what you need for interviews: how Kafka works internally, the tradeoffs it makes, and the design patterns built on top of it.

What is Kafka?

Kafka is a distributed, partitioned, replicated append-only commit log. That single sentence encodes everything important about its architecture:

  • Distributed — runs across a cluster of brokers
  • Partitioned — topics are split into ordered partitions for parallelism
  • Replicated — each partition is copied to multiple brokers for fault tolerance
  • Append-only — messages are immutable once written; consumers track their position via offsets
  • Commit log — not a queue; messages aren’t deleted after consumption

Core Architecture

Consumer Group: order-processor

Kafka Cluster (3 Brokers)

Producers

Broker 3

Broker 2

Broker 1 (Controller)

replicate

replicate

replicate

Producer 1

Producer 2

orders P0 ★ Leader

orders P2 Follower

orders P1 ★ Leader

orders P0 Follower

orders P2 ★ Leader

orders P1 Follower

Consumer 1

← P0

Consumer 2

← P1

Consumer 3

← P2

Key Components

Component Role
Broker A single Kafka server. Stores partitions, serves reads/writes.
Topic A named stream of records (like a database table).
Partition An ordered, immutable sequence of records within a topic. The unit of parallelism.
Offset A sequential ID for each record within a partition. Consumers track their position via offsets.
Producer Publishes records to topics.
Consumer Reads records from topics by pulling.
Consumer Group A set of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group.
Controller One broker elected to manage partition leader election and cluster metadata (replaced by KRaft in newer versions).

Topics and Partitions

A topic is split into partitions — each partition is an ordered, append-only log stored on disk.

Topic: orders (3 partitions)

Partition 0: [msg0] [msg1] [msg2] [msg3] [msg4] [msg5] → newest
Partition 1: [msg0] [msg1] [msg2] [msg3] → newest
Partition 2: [msg0] [msg1] [msg2] [msg3] [msg4] → newest

                                               ↑ offset

Key properties:

  • Ordering is guaranteed only within a partition, not across partitions
  • Messages are immutable — you can’t update or delete a record in place
  • Messages are retained by time or size, not consumed-and-deleted like traditional queues
  • Each consumer tracks its own offset — it can re-read old messages by resetting its offset

Partition Count: The Most Important Design Decision

Topic parallelism = min(partition_count, consumer_count_in_group)
  • Too few partitions → consumers sit idle, can’t scale throughput
  • Too many partitions → more leader elections, more memory, higher end-to-end latency, more open file handles
  • You can increase partitions later but NEVER decrease them — and increasing breaks key-based ordering guarantees for existing data

Rule of thumb: Start with max(expected_throughput_MB/s / 10, expected_consumer_count) and round up. For most systems, 6–30 partitions is a reasonable starting point.

How Producers Choose a Partition

// Default: hash the key, mod by partition count
int partition = Math.abs(key.hashCode()) % numPartitions;

// No key? Round-robin (or sticky partitioning in newer clients)
// Custom partitioner example:
public class GeoPartitioner implements Partitioner {
    public int partition(String topic, Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes, Cluster cluster) {
        String region = extractRegion((String) key);
        return switch (region) {
            case "US" -> 0;
            case "EU" -> 1;
            case "APAC" -> 2;
            default -> 3;
        };
    }
}

Interview insight: If you need ordering for a specific entity (e.g., all events for user:1001), use the entity ID as the message key. All messages with the same key go to the same partition → guaranteed order.

Producer Internals

Write Path

send record

batch full

or linger.ms

replicate

ack

response

Application

Serializer

key + value

Partitioner

hash key % N

Record Accumulator

batch per partition

Network Thread

send to leader

Broker

(Partition Leader)

Followers ISR

Durability: The acks Setting

This is the most important producer config and a frequent interview topic:

acks Behavior Durability Latency
0 Fire and forget. Don’t wait for any acknowledgment. Lowest — messages can be lost Fastest
1 Wait for the partition leader to write to its local log. Medium — lost if leader dies before replication Medium
all (-1) Wait for all in-sync replicas (ISR) to acknowledge. Highest — survives any single broker failure Slowest
from confluent_kafka import Producer

producer = Producer({
    'bootstrap.servers': 'kafka1:9092,kafka2:9092,kafka3:9092',
    'acks': 'all',                    # strongest durability
    'enable.idempotence': True,       # prevent duplicates on retry
    'max.in.flight.requests.per.connection': 5,
    'retries': 2147483647,            # retry forever
    'linger.ms': 5,                   # batch for 5ms to improve throughput
    'batch.size': 65536,              # 64KB batch
    'compression.type': 'lz4',        # compress batches
})

def delivery_callback(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}')

producer.produce(
    topic='orders',
    key='user:1001',
    value='{"orderId":"abc","amount":99.99}',
    callback=delivery_callback
)
producer.flush()  # wait for all messages to be delivered

Batching and Compression

Producers batch messages per partition before sending:

  • batch.size — max bytes per batch (default 16KB, increase to 64-256KB for throughput)
  • linger.ms — wait this long to fill the batch (default 0, set to 5-20ms)
  • compression.type — compress each batch (lz4 for speed, zstd for ratio)

Batching + compression can 10x your throughput at the cost of a few milliseconds of latency.

Consumer Internals

Consumer Groups: The Scaling Model

Every consumer belongs to a consumer group. Kafka ensures:

  • Each partition is consumed by exactly one consumer in a group
  • If consumers > partitions → some consumers sit idle
  • If consumers < partitions → some consumers handle multiple partitions
  • If a consumer dies → its partitions are rebalanced to other consumers in the group
from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers': 'kafka1:9092',
    'group.id': 'order-processor',
    'auto.offset.reset': 'earliest',      # start from beginning if no committed offset
    'enable.auto.commit': False,           # manual commit for at-least-once
    'max.poll.interval.ms': 300000,        # 5 min max between polls
    'session.timeout.ms': 45000,
})

consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            print(f'Error: {msg.error()}')
            continue

        # Process the message
        process_order(msg.key(), msg.value())

        # Commit offset AFTER successful processing (at-least-once)
        consumer.commit(msg)
finally:
    consumer.close()

Offset Management

Offsets are how consumers track what they’ve consumed. Three strategies:

Strategy How Guarantee
Auto-commit enable.auto.commit=true, commits every auto.commit.interval.ms At-most-once (can skip messages on crash)
Manual commit after processing consumer.commit() after each message/batch At-least-once (may reprocess on crash)
Exactly-once Transactional producer + read_committed consumer Exactly-once (complex setup)
# At-least-once: commit after processing a batch
messages = consumer.consume(num_messages=100, timeout=5.0)
for msg in messages:
    process(msg)
consumer.commit()  # commit the batch

# Seeking to a specific offset (replay)
from confluent_kafka import TopicPartition
consumer.assign([TopicPartition('orders', partition=0, offset=0)])  # replay from start

Rebalancing

When a consumer joins, leaves, or crashes, Kafka rebalances — reassigning partitions across the remaining consumers. This causes a brief pause in consumption.

Rebalancing strategies:

  • Eager (default before 2.4) — revoke all partitions, reassign everything. Causes a full stop-the-world.
  • Cooperative/Incremental (2.4+) — only reassign affected partitions. Much less disruption.
consumer = Consumer({
    # ...
    'partition.assignment.strategy': 'cooperative-sticky',  # incremental rebalance
})

Interview tip: If asked “how do you avoid rebalancing storms?”, the answer is: use cooperative-sticky assignment, tune session.timeout.ms and max.poll.interval.ms appropriately, and ensure consumers process fast enough to poll within the interval.

Replication and Fault Tolerance

Every partition has one leader and N-1 followers. All reads and writes go through the leader. Followers pull from the leader to stay in sync.

In-Sync Replicas (ISR)

The ISR is the set of replicas that are “caught up” with the leader. A follower drops out of the ISR if it falls behind by more than replica.lag.time.max.ms (default 30s).

Partition 0: Leader=Broker1, ISR=[Broker1, Broker2, Broker3]

Broker3 falls behind →
Partition 0: Leader=Broker1, ISR=[Broker1, Broker2]

Broker1 dies →
Partition 0: Leader=Broker2, ISR=[Broker2]  (Broker2 elected new leader)

Broker1 recovers, catches up →
Partition 0: Leader=Broker2, ISR=[Broker2, Broker1]

Durability Guarantees

The combination of acks, min.insync.replicas, and replication.factor determines your durability:

# Topic-level config for strong durability
replication.factor=3
min.insync.replicas=2

# Producer config
acks=all

This means: a write succeeds only when at least 2 of 3 replicas acknowledge. The cluster can tolerate 1 broker failure without data loss and without blocking writes.

Config Tolerates Behavior
RF=3, min.isr=2, acks=all 1 broker failure Recommended for production
RF=3, min.isr=1, acks=all 2 broker failures (reads), 0 for writes if 2 die Higher availability, lower durability
RF=3, min.isr=2, acks=1 Leader failure only if data replicated in time Fast but risky

Retention and Compaction

Kafka doesn’t delete messages after consumption. Instead, it uses retention policies:

Time/Size-Based Retention

# Keep messages for 7 days (default)
retention.ms=604800000

# Or keep up to 100GB per partition
retention.bytes=107374182400

# Segment size — Kafka breaks each partition into segment files
segment.bytes=1073741824  # 1GB segments

Log Compaction

For compacted topics, Kafka keeps only the latest value for each key. Older values are garbage collected.

cleanup.policy=compact

Key: user:1001 → {"name":"Alice","v":1}  ← deleted during compaction
Key: user:1002 → {"name":"Bob","v":1}    ← kept (latest for this key)
Key: user:1001 → {"name":"Alice","v":2}  ← kept (latest for this key)
Key: user:1002 → null                    ← tombstone: delete this key

Use cases for compacted topics:

  • Changelog/CDC — latest state of each database row
  • Configuration — latest config for each service
  • Cache rebuilding — replay the topic to rebuild a cache from scratch

Exactly-Once Semantics (EOS)

Kafka supports exactly-once through two mechanisms:

1. Idempotent Producer

producer = Producer({
    'enable.idempotence': True,  # assigns a producer ID + sequence numbers
    # Kafka deduplicates retried messages using (producerId, sequenceNumber)
})

This prevents duplicates from producer retries within a single session.

2. Transactions (Cross-Partition EOS)

producer.initTransactions();

try {
    producer.beginTransaction();

    // Produce to multiple partitions atomically
    producer.send(new ProducerRecord<>("orders", "key1", "value1"));
    producer.send(new ProducerRecord<>("payments", "key2", "value2"));

    // Commit consumer offsets as part of the transaction
    producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);

    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Consumers must be configured to only read committed messages:

consumer = Consumer({
    'isolation.level': 'read_committed',  # skip uncommitted/aborted transactions
})

Interview insight: Exactly-once is about the consume-transform-produce loop being atomic. It doesn’t magically make external side effects (HTTP calls, DB writes) idempotent — you still need application-level deduplication for those.

Common System Design Patterns

1. Event-Driven Microservices

OrderCreated

PaymentCompleted

Order Service

Kafka

orders topic

Payment Service

Inventory Service

Notification Service

Analytics Service

Kafka

payments topic

Why Kafka over direct HTTP calls:

  • Decoupling — Order Service doesn’t know (or care) about downstream consumers
  • Resilience — if Payment Service is down, events queue up; no data loss
  • Replay — new services can replay history from the topic
  • Fan-out — multiple consumer groups independently process the same events

2. Change Data Capture (CDC)

# Debezium connector config (Kafka Connect)
{
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.dbname": "orders_db",
    "table.include.list": "public.orders,public.customers",
    "topic.prefix": "cdc",
    # Produces to: cdc.public.orders, cdc.public.customers
    "plugin.name": "pgoutput"
}

CDC captures every INSERT/UPDATE/DELETE from the database WAL and publishes it to Kafka. Downstream consumers can:

  • Keep Elasticsearch in sync (search indexing)
  • Update a cache (Redis)
  • Replicate to a data warehouse
  • Feed real-time dashboards

3. Log Aggregation

# Application logs → Kafka → multiple sinks
# Typically: App → Fluentd/Filebeat → Kafka → Consumers

# Simple Python log producer
import logging
from confluent_kafka import Producer

class KafkaLogHandler(logging.Handler):
    def __init__(self, producer, topic):
        super().__init__()
        self.producer = producer
        self.topic = topic

    def emit(self, record):
        self.producer.produce(
            self.topic,
            key=record.name,
            value=self.format(record)
        )

logger = logging.getLogger('myapp')
logger.addHandler(KafkaLogHandler(producer, 'app-logs'))

4. Event Sourcing

Store the full history of state changes as events:

# Instead of updating a row, append events
events = [
    {"type": "OrderCreated", "orderId": "abc", "items": [...], "ts": 1679000001},
    {"type": "PaymentReceived", "orderId": "abc", "amount": 99.99, "ts": 1679000050},
    {"type": "OrderShipped", "orderId": "abc", "trackingId": "...", "ts": 1679000100},
]

# Use compacted topic keyed by orderId
# Replay events to reconstruct current state
for event in events:
    producer.produce('order-events', key=event['orderId'], value=json.dumps(event))

5. CQRS (Command Query Responsibility Segregation)

Writes → Command Service → Kafka → Event Store
                                  → Read Model Builder → Optimized Read DB
Reads  → Query Service → Read DB (Elasticsearch, Redis, Materialized View)

Kafka Connect and Kafka Streams

Kafka Connect

Pre-built connectors for moving data in/out of Kafka without writing code:

// Source connector: PostgreSQL → Kafka
{
    "name": "postgres-source",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "db.example.com",
        "database.dbname": "mydb",
        "topic.prefix": "mydb"
    }
}

// Sink connector: Kafka → Elasticsearch
{
    "name": "elasticsearch-sink",
    "config": {
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics": "mydb.public.products",
        "connection.url": "http://elasticsearch:9200",
        "type.name": "_doc",
        "key.ignore": false
    }
}

Kafka Streams

A lightweight stream processing library (no separate cluster needed):

StreamsBuilder builder = new StreamsBuilder();

// Read from orders topic
KStream<String, Order> orders = builder.stream("orders");

// Filter, transform, aggregate
KTable<String, Long> orderCounts = orders
    .filter((key, order) -> order.getAmount() > 100)
    .groupBy((key, order) -> order.getRegion())
    .count();

// Write results to a new topic
orderCounts.toStream().to("order-counts-by-region");

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

Schema Management

Use Schema Registry (Confluent) with Avro, Protobuf, or JSON Schema to enforce contracts:

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

schema_str = """
{
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "orderId", "type": "string"},
        {"name": "userId", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "createdAt", "type": "long", "logicalType": "timestamp-millis"}
    ]
}
"""

schema_registry = SchemaRegistryClient({'url': 'http://schema-registry:8081'})
avro_serializer = AvroSerializer(schema_registry, schema_str)

producer.produce(
    'orders',
    key='order:abc',
    value=avro_serializer(order_dict, SerializationContext('orders', MessageField.VALUE))
)

Schema evolution rules (backward/forward/full compatibility):

  • Backward (default) — new schema can read old data. Safe to add fields with defaults.
  • Forward — old schema can read new data. Safe to remove fields with defaults.
  • Full — both. Only add/remove optional fields with defaults.

Performance Tuning

Throughput Optimization

Parameter Default Tuned Impact
batch.size (producer) 16KB 64-256KB Fewer requests, higher throughput
linger.ms (producer) 0 5-20ms Allow batching
compression.type none lz4 / zstd Smaller payloads, less network I/O
fetch.min.bytes (consumer) 1 65536 Fewer fetch requests
fetch.max.wait.ms (consumer) 500 500-1000 Batch fetches
num.io.threads (broker) 8 cores × 2 More parallel I/O
num.network.threads (broker) 3 cores More parallel network handling

Latency Optimization

For low-latency use cases (< 10ms end-to-end):

producer = Producer({
    'linger.ms': 0,           # send immediately
    'batch.size': 0,          # don't batch
    'acks': 1,                # don't wait for replicas
    'compression.type': 'none',
})

The tradeoff: throughput vs latency. You can’t maximize both.

Kafka vs Alternatives

Feature Kafka RabbitMQ Amazon SQS Pulsar Redpanda
Model Distributed log Message broker Managed queue Distributed log Distributed log
Ordering Per-partition Per-queue FIFO queues only Per-partition Per-partition
Retention Configurable (days/size) Until consumed 14 days max Tiered storage Configurable
Replay Yes (seek by offset) No (consumed = deleted) No Yes Yes
Consumer model Pull Push Pull Pull Pull (Kafka API)
Throughput Millions msg/s ~50K msg/s ~3K msg/s per queue Millions msg/s Millions msg/s
Exactly-once Yes (transactions) No (at-least-once) No (at-least-once) Yes Yes
Operational complexity High (ZK/KRaft, brokers, SR) Medium None (managed) High Low (no JVM/ZK)

When to use Kafka: High-throughput event streaming, CDC, event sourcing, log aggregation, connecting many producers and consumers.

When NOT to use Kafka: Simple task queues (use RabbitMQ/SQS), request-reply patterns, very low message volumes (< 1K/s), sub-millisecond latency requirements.

Interview Cheat Sheet

Key Numbers

Metric Typical Value
Throughput (single partition) ~10 MB/s write, ~30 MB/s read
Throughput (cluster) GB/s+ with enough partitions
Latency (p99, acks=1) 2-5 ms
Latency (p99, acks=all) 5-15 ms
Partition count sweet spot 6-30 per topic
Replication factor 3 (production standard)
Retention default 7 days
Max message size default 1 MB
Consumer group rebalance 1-30 seconds

Interview Answer Template

When designing an event-driven system:

  1. Why Kafka? — need decoupling, event replay, high throughput, multiple consumers
  2. Topic design — one topic per event type (e.g., orders, payments, shipments)
  3. Partitioning strategy — partition key = entity ID for ordering (e.g., userId, orderId)
  4. Partition count — based on expected throughput and consumer parallelism
  5. ReplicationRF=3, min.insync.replicas=2, acks=all for production durability
  6. Consumer groups — one group per downstream service; each scales independently
  7. Delivery guarantee — at-least-once by default; exactly-once if consume-transform-produce
  8. Schema management — Avro + Schema Registry with backward compatibility
  9. Failure handling — dead letter topic for poison messages, idempotent consumers
  10. Monitoring — consumer lag, under-replicated partitions, ISR shrink rate

Common Interview Questions

Q: How do you handle message ordering? A: Use the entity ID as the partition key. All events for that entity go to the same partition, which guarantees ordering. Cross-entity ordering requires a single partition (limits throughput).

Q: What happens when a consumer crashes? A: The consumer group coordinator detects the missing heartbeat after session.timeout.ms, triggers a rebalance, and assigns that consumer’s partitions to other consumers in the group. Any uncommitted messages are reprocessed (at-least-once).

Q: How do you handle a poison message (bad format/unprocessable)? A: Catch the error, publish the message to a dead letter topic (DLT) with error metadata, commit the offset, and continue. A separate process inspects and resolves DLT messages.

try:
    process(msg)
    consumer.commit(msg)
except Exception as e:
    producer.produce(
        'orders.dead-letter',
        key=msg.key(),
        value=msg.value(),
        headers=[('error', str(e)), ('original-topic', 'orders')]
    )
    consumer.commit(msg)  # skip the poison message

Q: How do you scale consumers? A: Add more consumers to the consumer group (up to the partition count). If you need more parallelism, increase the partition count — but be aware this may break key-based ordering for in-flight data.

Wrapping Up

Kafka’s power comes from a simple insight: an append-only log is the most general-purpose data structure for distributed systems. It can be a queue, an event store, a database changelog, or a stream processor — depending on how you consume it.

The mental model for interviews:

  • Producers write to partition leaders; partitioning determines parallelism and ordering
  • Consumers pull at their own pace, tracking position via offsets
  • Consumer groups enable scaling and fault tolerance through partition assignment
  • Replication (ISR) provides durability; acks=all + min.insync.replicas=2 is the production standard
  • Retention keeps data available for replay; compacted topics keep latest-per-key
  • The tradeoff triangle: throughput vs latency vs durability — pick two, tune the third

If you understand partitions, consumer groups, and the acks + ISR durability model, you can design any Kafka-based system in an interview.

Related Posts

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

“Redis is not just a cache. It’s a data structure server that happens to be…

System Design Patterns for Real-Time Updates at High Traffic

System Design Patterns for Real-Time Updates at High Traffic

The previous articles in this series covered scaling reads and scaling writes…

System Design Patterns for Scaling Writes

System Design Patterns for Scaling Writes

In the companion article on scaling reads, we covered caching, replicas, and…

Deep Dive on Elasticsearch: A System Design Interview Perspective

Deep Dive on Elasticsearch: A System Design Interview Perspective

“If you’re searching, filtering, or aggregating over large volumes of semi…

Deep Dive on Caching: From Browser to Database

Deep Dive on Caching: From Browser to Database

“There are only two hard things in Computer Science: cache invalidation and…

System Design Patterns for Scaling Reads

System Design Patterns for Scaling Reads

Most production systems are read-heavy. A typical web application sees 90-9…

Latest Posts

Deep Dive on Elasticsearch: A System Design Interview Perspective

Deep Dive on Elasticsearch: A System Design Interview Perspective

“If you’re searching, filtering, or aggregating over large volumes of semi…

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

“Redis is not just a cache. It’s a data structure server that happens to be…

Deep Dive on API Gateway: A System Design Interview Perspective

Deep Dive on API Gateway: A System Design Interview Perspective

“An API Gateway is the front door to your microservices. Every request walks…

REST API Design: Pagination, Versioning, and Best Practices

REST API Design: Pagination, Versioning, and Best Practices

Every time two systems need to talk, someone has to design the contract between…

Efficient Data Modelling: A Practical Guide for Production Systems

Efficient Data Modelling: A Practical Guide for Production Systems

Most engineers learn data modelling backwards. They draw an ER diagram…

Deep Dive on Caching: From Browser to Database

Deep Dive on Caching: From Browser to Database

“There are only two hard things in Computer Science: cache invalidation and…