Deep Dive on Apache Kafka: A System Design Interview Perspective

“Kafka is not a message queue. It’s a distributed commit log that happens to be useful as a message queue, event store, and stream processor.”

Apache Kafka appears in nearly every system design interview involving event-driven architecture, real-time data pipelines, or service decoupling. It’s the backbone of architectures at LinkedIn, Uber, Netflix, and Airbnb — processing trillions of messages per day.

This article covers what you need for interviews: how Kafka works internally, the tradeoffs it makes, and the design patterns built on top of it.

What is Kafka?

Kafka is a distributed, partitioned, replicated append-only commit log. That single sentence encodes everything important about its architecture:

Distributed — runs across a cluster of brokers
Partitioned — topics are split into ordered partitions for parallelism
Replicated — each partition is copied to multiple brokers for fault tolerance
Append-only — messages are immutable once written; consumers track their position via offsets
Commit log — not a queue; messages aren’t deleted after consumption

Core Architecture

Key Components

Component	Role
Broker	A single Kafka server. Stores partitions, serves reads/writes.
Topic	A named stream of records (like a database table).
Partition	An ordered, immutable sequence of records within a topic. The unit of parallelism.
Offset	A sequential ID for each record within a partition. Consumers track their position via offsets.
Producer	Publishes records to topics.
Consumer	Reads records from topics by pulling.
Consumer Group	A set of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group.
Controller	One broker elected to manage partition leader election and cluster metadata (replaced by KRaft in newer versions).

Topics and Partitions

A topic is split into partitions — each partition is an ordered, append-only log stored on disk.

Topic: orders (3 partitions)

Partition 0: [msg0] [msg1] [msg2] [msg3] [msg4] [msg5] → newest
Partition 1: [msg0] [msg1] [msg2] [msg3] → newest
Partition 2: [msg0] [msg1] [msg2] [msg3] [msg4] → newest

                                               ↑ offset

Key properties:

Ordering is guaranteed only within a partition, not across partitions
Messages are immutable — you can’t update or delete a record in place
Messages are retained by time or size, not consumed-and-deleted like traditional queues
Each consumer tracks its own offset — it can re-read old messages by resetting its offset

Partition Count: The Most Important Design Decision

Topic parallelism = min(partition_count, consumer_count_in_group)

Too few partitions → consumers sit idle, can’t scale throughput
Too many partitions → more leader elections, more memory, higher end-to-end latency, more open file handles
You can increase partitions later but NEVER decrease them — and increasing breaks key-based ordering guarantees for existing data

Rule of thumb: Start with max(expected_throughput_MB/s / 10, expected_consumer_count) and round up. For most systems, 6–30 partitions is a reasonable starting point.

How Producers Choose a Partition

// Default: hash the key, mod by partition count
int partition = Math.abs(key.hashCode()) % numPartitions;

// No key? Round-robin (or sticky partitioning in newer clients)
// Custom partitioner example:
public class GeoPartitioner implements Partitioner {
    public int partition(String topic, Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes, Cluster cluster) {
        String region = extractRegion((String) key);
        return switch (region) {
            case "US" -> 0;
            case "EU" -> 1;
            case "APAC" -> 2;
            default -> 3;
        };
    }
}

Interview insight: If you need ordering for a specific entity (e.g., all events for user:1001), use the entity ID as the message key. All messages with the same key go to the same partition → guaranteed order.

Producer Internals

Write Path

Durability: The `acks` Setting

This is the most important producer config and a frequent interview topic:

`acks`	Behavior	Durability	Latency
`0`	Fire and forget. Don’t wait for any acknowledgment.	Lowest — messages can be lost	Fastest
`1`	Wait for the partition leader to write to its local log.	Medium — lost if leader dies before replication	Medium
`all` (`-1`)	Wait for all in-sync replicas (ISR) to acknowledge.	Highest — survives any single broker failure	Slowest

from confluent_kafka import Producer

producer = Producer({
    'bootstrap.servers': 'kafka1:9092,kafka2:9092,kafka3:9092',
    'acks': 'all',                    # strongest durability
    'enable.idempotence': True,       # prevent duplicates on retry
    'max.in.flight.requests.per.connection': 5,
    'retries': 2147483647,            # retry forever
    'linger.ms': 5,                   # batch for 5ms to improve throughput
    'batch.size': 65536,              # 64KB batch
    'compression.type': 'lz4',        # compress batches
})

def delivery_callback(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}')

producer.produce(
    topic='orders',
    key='user:1001',
    value='{"orderId":"abc","amount":99.99}',
    callback=delivery_callback
)
producer.flush()  # wait for all messages to be delivered

Batching and Compression

Producers batch messages per partition before sending:

batch.size — max bytes per batch (default 16KB, increase to 64-256KB for throughput)
linger.ms — wait this long to fill the batch (default 0, set to 5-20ms)
compression.type — compress each batch (lz4 for speed, zstd for ratio)

Batching + compression can 10x your throughput at the cost of a few milliseconds of latency.

Consumer Internals

Consumer Groups: The Scaling Model

Every consumer belongs to a consumer group. Kafka ensures:

Each partition is consumed by exactly one consumer in a group
If consumers > partitions → some consumers sit idle
If consumers < partitions → some consumers handle multiple partitions
If a consumer dies → its partitions are rebalanced to other consumers in the group

from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers': 'kafka1:9092',
    'group.id': 'order-processor',
    'auto.offset.reset': 'earliest',      # start from beginning if no committed offset
    'enable.auto.commit': False,           # manual commit for at-least-once
    'max.poll.interval.ms': 300000,        # 5 min max between polls
    'session.timeout.ms': 45000,
})

consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            print(f'Error: {msg.error()}')
            continue

        # Process the message
        process_order(msg.key(), msg.value())

        # Commit offset AFTER successful processing (at-least-once)
        consumer.commit(msg)
finally:
    consumer.close()

Offset Management

Offsets are how consumers track what they’ve consumed. Three strategies:

Strategy	How	Guarantee
Auto-commit	`enable.auto.commit=true`, commits every `auto.commit.interval.ms`	At-most-once (can skip messages on crash)
Manual commit after processing	`consumer.commit()` after each message/batch	At-least-once (may reprocess on crash)
Exactly-once	Transactional producer + `read_committed` consumer	Exactly-once (complex setup)

# At-least-once: commit after processing a batch
messages = consumer.consume(num_messages=100, timeout=5.0)
for msg in messages:
    process(msg)
consumer.commit()  # commit the batch

# Seeking to a specific offset (replay)
from confluent_kafka import TopicPartition
consumer.assign([TopicPartition('orders', partition=0, offset=0)])  # replay from start

Rebalancing

When a consumer joins, leaves, or crashes, Kafka rebalances — reassigning partitions across the remaining consumers. This causes a brief pause in consumption.

Rebalancing strategies:

Eager (default before 2.4) — revoke all partitions, reassign everything. Causes a full stop-the-world.
Cooperative/Incremental (2.4+) — only reassign affected partitions. Much less disruption.

consumer = Consumer({
    # ...
    'partition.assignment.strategy': 'cooperative-sticky',  # incremental rebalance
})

Interview tip: If asked “how do you avoid rebalancing storms?”, the answer is: use cooperative-sticky assignment, tune session.timeout.ms and max.poll.interval.ms appropriately, and ensure consumers process fast enough to poll within the interval.

Replication and Fault Tolerance

Every partition has one leader and N-1 followers. All reads and writes go through the leader. Followers pull from the leader to stay in sync.

In-Sync Replicas (ISR)

The ISR is the set of replicas that are “caught up” with the leader. A follower drops out of the ISR if it falls behind by more than replica.lag.time.max.ms (default 30s).

Partition 0: Leader=Broker1, ISR=[Broker1, Broker2, Broker3]

Broker3 falls behind →
Partition 0: Leader=Broker1, ISR=[Broker1, Broker2]

Broker1 dies →
Partition 0: Leader=Broker2, ISR=[Broker2]  (Broker2 elected new leader)

Broker1 recovers, catches up →
Partition 0: Leader=Broker2, ISR=[Broker2, Broker1]

Durability Guarantees

The combination of acks, min.insync.replicas, and replication.factor determines your durability:

# Topic-level config for strong durability
replication.factor=3
min.insync.replicas=2

# Producer config
acks=all

This means: a write succeeds only when at least 2 of 3 replicas acknowledge. The cluster can tolerate 1 broker failure without data loss and without blocking writes.

Config	Tolerates	Behavior
`RF=3, min.isr=2, acks=all`	1 broker failure	Recommended for production
`RF=3, min.isr=1, acks=all`	2 broker failures (reads), 0 for writes if 2 die	Higher availability, lower durability
`RF=3, min.isr=2, acks=1`	Leader failure only if data replicated in time	Fast but risky

Retention and Compaction

Kafka doesn’t delete messages after consumption. Instead, it uses retention policies:

Time/Size-Based Retention

# Keep messages for 7 days (default)
retention.ms=604800000

# Or keep up to 100GB per partition
retention.bytes=107374182400

# Segment size — Kafka breaks each partition into segment files
segment.bytes=1073741824  # 1GB segments

Log Compaction

For compacted topics, Kafka keeps only the latest value for each key. Older values are garbage collected.

cleanup.policy=compact

Key: user:1001 → {"name":"Alice","v":1}  ← deleted during compaction
Key: user:1002 → {"name":"Bob","v":1}    ← kept (latest for this key)
Key: user:1001 → {"name":"Alice","v":2}  ← kept (latest for this key)
Key: user:1002 → null                    ← tombstone: delete this key

Use cases for compacted topics:

Changelog/CDC — latest state of each database row
Configuration — latest config for each service
Cache rebuilding — replay the topic to rebuild a cache from scratch

Exactly-Once Semantics (EOS)

Kafka supports exactly-once through two mechanisms:

1. Idempotent Producer

producer = Producer({
    'enable.idempotence': True,  # assigns a producer ID + sequence numbers
    # Kafka deduplicates retried messages using (producerId, sequenceNumber)
})

This prevents duplicates from producer retries within a single session.

2. Transactions (Cross-Partition EOS)

producer.initTransactions();

try {
    producer.beginTransaction();

    // Produce to multiple partitions atomically
    producer.send(new ProducerRecord<>("orders", "key1", "value1"));
    producer.send(new ProducerRecord<>("payments", "key2", "value2"));

    // Commit consumer offsets as part of the transaction
    producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);

    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Consumers must be configured to only read committed messages:

consumer = Consumer({
    'isolation.level': 'read_committed',  # skip uncommitted/aborted transactions
})

Interview insight: Exactly-once is about the consume-transform-produce loop being atomic. It doesn’t magically make external side effects (HTTP calls, DB writes) idempotent — you still need application-level deduplication for those.

Common System Design Patterns

1. Event-Driven Microservices

Why Kafka over direct HTTP calls:

Decoupling — Order Service doesn’t know (or care) about downstream consumers
Resilience — if Payment Service is down, events queue up; no data loss
Replay — new services can replay history from the topic
Fan-out — multiple consumer groups independently process the same events

2. Change Data Capture (CDC)

# Debezium connector config (Kafka Connect)
{
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.dbname": "orders_db",
    "table.include.list": "public.orders,public.customers",
    "topic.prefix": "cdc",
    # Produces to: cdc.public.orders, cdc.public.customers
    "plugin.name": "pgoutput"
}

CDC captures every INSERT/UPDATE/DELETE from the database WAL and publishes it to Kafka. Downstream consumers can:

Keep Elasticsearch in sync (search indexing)
Update a cache (Redis)
Replicate to a data warehouse
Feed real-time dashboards

3. Log Aggregation

# Application logs → Kafka → multiple sinks
# Typically: App → Fluentd/Filebeat → Kafka → Consumers

# Simple Python log producer
import logging
from confluent_kafka import Producer

class KafkaLogHandler(logging.Handler):
    def __init__(self, producer, topic):
        super().__init__()
        self.producer = producer
        self.topic = topic

    def emit(self, record):
        self.producer.produce(
            self.topic,
            key=record.name,
            value=self.format(record)
        )

logger = logging.getLogger('myapp')
logger.addHandler(KafkaLogHandler(producer, 'app-logs'))

4. Event Sourcing

Store the full history of state changes as events:

# Instead of updating a row, append events
events = [
    {"type": "OrderCreated", "orderId": "abc", "items": [...], "ts": 1679000001},
    {"type": "PaymentReceived", "orderId": "abc", "amount": 99.99, "ts": 1679000050},
    {"type": "OrderShipped", "orderId": "abc", "trackingId": "...", "ts": 1679000100},
]

# Use compacted topic keyed by orderId
# Replay events to reconstruct current state
for event in events:
    producer.produce('order-events', key=event['orderId'], value=json.dumps(event))

5. CQRS (Command Query Responsibility Segregation)

Writes → Command Service → Kafka → Event Store
                                  → Read Model Builder → Optimized Read DB
Reads  → Query Service → Read DB (Elasticsearch, Redis, Materialized View)

Kafka Connect and Kafka Streams

Kafka Connect

Pre-built connectors for moving data in/out of Kafka without writing code:

// Source connector: PostgreSQL → Kafka
{
    "name": "postgres-source",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "db.example.com",
        "database.dbname": "mydb",
        "topic.prefix": "mydb"
    }
}

// Sink connector: Kafka → Elasticsearch
{
    "name": "elasticsearch-sink",
    "config": {
        "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
        "topics": "mydb.public.products",
        "connection.url": "http://elasticsearch:9200",
        "type.name": "_doc",
        "key.ignore": false
    }
}

Kafka Streams

A lightweight stream processing library (no separate cluster needed):

StreamsBuilder builder = new StreamsBuilder();

// Read from orders topic
KStream<String, Order> orders = builder.stream("orders");

// Filter, transform, aggregate
KTable<String, Long> orderCounts = orders
    .filter((key, order) -> order.getAmount() > 100)
    .groupBy((key, order) -> order.getRegion())
    .count();

// Write results to a new topic
orderCounts.toStream().to("order-counts-by-region");

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

Schema Management

Use Schema Registry (Confluent) with Avro, Protobuf, or JSON Schema to enforce contracts:

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

schema_str = """
{
    "type": "record",
    "name": "Order",
    "fields": [
        {"name": "orderId", "type": "string"},
        {"name": "userId", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "createdAt", "type": "long", "logicalType": "timestamp-millis"}
    ]
}
"""

schema_registry = SchemaRegistryClient({'url': 'http://schema-registry:8081'})
avro_serializer = AvroSerializer(schema_registry, schema_str)

producer.produce(
    'orders',
    key='order:abc',
    value=avro_serializer(order_dict, SerializationContext('orders', MessageField.VALUE))
)

Schema evolution rules (backward/forward/full compatibility):

Backward (default) — new schema can read old data. Safe to add fields with defaults.
Forward — old schema can read new data. Safe to remove fields with defaults.
Full — both. Only add/remove optional fields with defaults.

Performance Tuning

Throughput Optimization

Parameter	Default	Tuned	Impact
`batch.size` (producer)	16KB	64-256KB	Fewer requests, higher throughput
`linger.ms` (producer)	0	5-20ms	Allow batching
`compression.type`	none	lz4 / zstd	Smaller payloads, less network I/O
`fetch.min.bytes` (consumer)	1	65536	Fewer fetch requests
`fetch.max.wait.ms` (consumer)	500	500-1000	Batch fetches
`num.io.threads` (broker)	8	cores × 2	More parallel I/O
`num.network.threads` (broker)	3	cores	More parallel network handling

Latency Optimization

For low-latency use cases (< 10ms end-to-end):

producer = Producer({
    'linger.ms': 0,           # send immediately
    'batch.size': 0,          # don't batch
    'acks': 1,                # don't wait for replicas
    'compression.type': 'none',
})

The tradeoff: throughput vs latency. You can’t maximize both.

Kafka vs Alternatives

Feature	Kafka	RabbitMQ	Amazon SQS	Pulsar	Redpanda
Model	Distributed log	Message broker	Managed queue	Distributed log	Distributed log
Ordering	Per-partition	Per-queue	FIFO queues only	Per-partition	Per-partition
Retention	Configurable (days/size)	Until consumed	14 days max	Tiered storage	Configurable
Replay	Yes (seek by offset)	No (consumed = deleted)	No	Yes	Yes
Consumer model	Pull	Push	Pull	Pull	Pull (Kafka API)
Throughput	Millions msg/s	~50K msg/s	~3K msg/s per queue	Millions msg/s	Millions msg/s
Exactly-once	Yes (transactions)	No (at-least-once)	No (at-least-once)	Yes	Yes
Operational complexity	High (ZK/KRaft, brokers, SR)	Medium	None (managed)	High	Low (no JVM/ZK)

When to use Kafka: High-throughput event streaming, CDC, event sourcing, log aggregation, connecting many producers and consumers.

When NOT to use Kafka: Simple task queues (use RabbitMQ/SQS), request-reply patterns, very low message volumes (< 1K/s), sub-millisecond latency requirements.

Interview Cheat Sheet

Key Numbers

Metric	Typical Value
Throughput (single partition)	~10 MB/s write, ~30 MB/s read
Throughput (cluster)	GB/s+ with enough partitions
Latency (p99, acks=1)	2-5 ms
Latency (p99, acks=all)	5-15 ms
Partition count sweet spot	6-30 per topic
Replication factor	3 (production standard)
Retention default	7 days
Max message size default	1 MB
Consumer group rebalance	1-30 seconds

Interview Answer Template

When designing an event-driven system:

Why Kafka? — need decoupling, event replay, high throughput, multiple consumers
Topic design — one topic per event type (e.g., orders, payments, shipments)
Partitioning strategy — partition key = entity ID for ordering (e.g., userId, orderId)
Partition count — based on expected throughput and consumer parallelism
Replication — RF=3, min.insync.replicas=2, acks=all for production durability
Consumer groups — one group per downstream service; each scales independently
Delivery guarantee — at-least-once by default; exactly-once if consume-transform-produce
Schema management — Avro + Schema Registry with backward compatibility
Failure handling — dead letter topic for poison messages, idempotent consumers
Monitoring — consumer lag, under-replicated partitions, ISR shrink rate

Common Interview Questions

Q: How do you handle message ordering? A: Use the entity ID as the partition key. All events for that entity go to the same partition, which guarantees ordering. Cross-entity ordering requires a single partition (limits throughput).

Q: What happens when a consumer crashes? A: The consumer group coordinator detects the missing heartbeat after session.timeout.ms, triggers a rebalance, and assigns that consumer’s partitions to other consumers in the group. Any uncommitted messages are reprocessed (at-least-once).

Q: How do you handle a poison message (bad format/unprocessable)? A: Catch the error, publish the message to a dead letter topic (DLT) with error metadata, commit the offset, and continue. A separate process inspects and resolves DLT messages.

try:
    process(msg)
    consumer.commit(msg)
except Exception as e:
    producer.produce(
        'orders.dead-letter',
        key=msg.key(),
        value=msg.value(),
        headers=[('error', str(e)), ('original-topic', 'orders')]
    )
    consumer.commit(msg)  # skip the poison message

Q: How do you scale consumers? A: Add more consumers to the consumer group (up to the partition count). If you need more parallelism, increase the partition count — but be aware this may break key-based ordering for in-flight data.

Wrapping Up

Kafka’s power comes from a simple insight: an append-only log is the most general-purpose data structure for distributed systems. It can be a queue, an event store, a database changelog, or a stream processor — depending on how you consume it.

The mental model for interviews:

Producers write to partition leaders; partitioning determines parallelism and ordering
Consumers pull at their own pace, tracking position via offsets
Consumer groups enable scaling and fault tolerance through partition assignment
Replication (ISR) provides durability; acks=all + min.insync.replicas=2 is the production standard
Retention keeps data available for replay; compacted topics keep latest-per-key
The tradeoff triangle: throughput vs latency vs durability — pick two, tune the third

If you understand partitions, consumer groups, and the acks + ISR durability model, you can design any Kafka-based system in an interview.