System Design Interview Checklist — System Design Masterclass

System design interviews are not trivia tests. The interviewer is not checking whether you memorized the architecture of Twitter or can recite the CAP theorem. They want to see how you think through an ambiguous, open-ended problem. Can you break it down? Can you make decisions under uncertainty? Can you explain why you chose X over Y? Can you identify what matters and ignore what does not? These are the skills that matter in senior engineering roles, and the interview is designed to surface them in 45 minutes.

This lesson gives you the complete framework: how to structure your time, what to cover in each phase, which mistakes to avoid, and a checklist of components you should consider for any design problem.

System design interview framework — 45-minute time allocation

The 45-Minute Framework

Every system design interview, regardless of the problem, follows the same structure. The time allocation is not a suggestion — it is a survival strategy. Candidates who skip phases or spend 20 minutes on requirements gathering fail.

Phase 1: Requirements & Scope         5 minutes
Phase 2: Back-of-Envelope Estimation   5 minutes
Phase 3: API Design                    5 minutes
Phase 4: High-Level Architecture      15 minutes
Phase 5: Deep Dive                    10 minutes
Phase 6: Wrap-Up & Extensions          5 minutes
                                     -----------
                                      45 minutes

Let’s walk through each phase.

Phase 1: Requirements and Scope (5 minutes)

The problem is deliberately vague. “Design a chat application” could mean Slack, WhatsApp, Discord, or a customer support widget. Your first job is to narrow it down.

Functional Requirements

Ask clarifying questions. Do not assume. The interviewer wants to see you ask.

You: "Before I start designing, I'd like to clarify a few things.
      Is this a 1:1 chat, group chat, or both?"
Interviewer: "Both. Groups up to 500 members."

You: "Do we need to support media messages — images, files, voice?"
Interviewer: "Text and images for now."

You: "What about message history? Persistent or ephemeral?"
Interviewer: "Persistent. Users should see full chat history."

You: "Read receipts? Typing indicators? Online status?"
Interviewer: "Read receipts yes. Others are nice-to-have."

Write down the agreed requirements. This becomes your contract for the rest of the interview.

Functional Requirements:
- 1:1 and group chat (up to 500 members)
- Text and image messages
- Persistent message history
- Read receipts
- Push notifications for offline users

Non-Functional Requirements

These matter more than functional ones. They drive your architecture.

Non-Functional Requirements:
- Scale: 50M DAU, 500M messages/day
- Latency: messages delivered in < 500ms
- Availability: 99.99% uptime
- Durability: zero message loss
- Ordering: messages appear in send order within a chat

If the interviewer does not volunteer scale numbers, propose them yourself: “I’ll assume we’re designing for a scale similar to WhatsApp — around 50 million daily active users. Does that sound reasonable?”

Phase 2: Back-of-Envelope Estimation (5 minutes)

Estimation accomplishes two things: it proves you can reason about scale, and it reveals the technical constraints that drive your design decisions.

Template for Any Problem

Users and Traffic:
  DAU:                    50M
  Avg messages/user/day:  10
  Total messages/day:     500M
  QPS (avg):              500M / 86400 = ~5,800
  QPS (peak, 3x):         ~17,400

Storage:
  Avg message size:       200 bytes (text) or 200KB (image)
  Text messages/day:      450M * 200B = 90GB/day
  Image messages/day:     50M * 200KB = 10TB/day
  5-year text storage:    ~165TB
  5-year image storage:   ~18PB

Bandwidth:
  Incoming:               10TB/day = ~115MB/s sustained
  Outgoing (fan-out):     With avg 5 recipients, 5x read amplification
                          ~575MB/s outgoing

Connections:
  Concurrent connections: 50M DAU, ~30% online at once = 15M WebSocket connections
  Per server (10K conn):  ~1,500 WebSocket servers

Do not spend more than 5 minutes on this. Round aggressively. The point is order-of-magnitude thinking, not precise math. If you calculated 5,800 QPS, say “roughly 6,000 QPS, peaking at around 18,000.”

Estimation Shortcuts

These approximations are useful across many problems:

Time:
  1 day  = 86,400 seconds  (~100K for quick math)
  1 year = 31.5M seconds   (~30M for quick math)

Storage:
  1 char = 1 byte (ASCII) or 2-4 bytes (UTF-8)
  Average tweet/message: 100-200 bytes
  Average image (compressed): 200KB-500KB
  Average video (1 min, compressed): 10MB

Scale references:
  Twitter:   500M tweets/day, 300M MAU
  WhatsApp:  100B messages/day, 2B MAU
  YouTube:   500 hours uploaded/min, 1B hours watched/day
  Uber:      20M rides/day

Network:
  1 Gbps link:       ~125 MB/s
  Single server:     10K-50K concurrent connections
  Single DB server:  5K-10K QPS (depends on query complexity)
  Redis single node: 100K+ QPS

Phase 3: API Design (5 minutes)

Define the core API endpoints. This forces you to think about the interface before the implementation.

# Core APIs for a chat system

POST   /api/v1/messages
  Body: { chat_id, content, type: "text"|"image", media_url? }
  Response: { message_id, timestamp, status }

GET    /api/v1/chats/{chat_id}/messages?cursor=xxx&limit=50
  Response: { messages: [...], next_cursor, has_more }

POST   /api/v1/chats
  Body: { type: "1:1"|"group", member_ids: [...], name? }
  Response: { chat_id }

PUT    /api/v1/messages/{message_id}/read
  Response: { status: "ok" }

WebSocket /ws/v1/connect?token=xxx
  Events: message.new, message.read, typing.start, typing.stop, presence.update

Keep it concise. Three to five endpoints. Use standard REST conventions. Mention pagination (cursor-based, not offset-based). Note which operations use WebSocket vs REST.

Phase 4: High-Level Architecture (15 minutes)

This is the core of the interview. Draw the major components and explain how data flows through the system.

The Standard Building Blocks

Almost every system design uses some subset of these components. Start here and add what you need.

[Clients] --> [DNS] --> [CDN] --> [Load Balancer]
                                       |
                              [API Gateway / Reverse Proxy]
                                   /        \
                          [App Servers]   [WebSocket Servers]
                              |                |
                    [Cache (Redis)]    [Message Queue (Kafka)]
                              |                |
                        [Database]       [Workers / Consumers]
                     (Primary + Replicas)
                              |
                    [Object Storage (S3)]

How to Present It

Walk through the architecture by following a request:

"Let me trace what happens when User A sends a message to User B.

1. User A's client sends the message over their WebSocket connection
   to a WebSocket server.

2. The WebSocket server publishes the message to Kafka,
   topic 'chat-messages', partitioned by chat_id.

3. A message consumer reads from Kafka, persists the message
   to the database, and looks up which WebSocket server
   User B is connected to.

4. The consumer sends the message to User B's WebSocket server
   via an internal pub/sub channel (Redis Pub/Sub).

5. User B's WebSocket server pushes the message to User B's client.

6. If User B is offline, a separate consumer triggers a push
   notification via APNs/FCM."

This approach is powerful because it shows data flow, not just boxes. The interviewer sees that you understand how the pieces fit together.

Architecture Patterns by Problem Type

Different problems call for different patterns. Here is a quick reference:

Read-heavy (news feed, timeline):
  - Fan-out on write (precompute feeds)
  - Heavy caching (Redis, CDN)
  - Read replicas

Write-heavy (logging, analytics, IoT):
  - Append-only logs (Kafka, Kinesis)
  - LSM-tree databases (Cassandra, RocksDB)
  - Batch processing (Spark, Flink)

Real-time (chat, notifications, live updates):
  - WebSocket / SSE for persistent connections
  - In-memory pub/sub (Redis Pub/Sub)
  - Message queues for async processing

Storage-heavy (file storage, media):
  - Object storage (S3, GCS)
  - Chunking and deduplication
  - CDN for content delivery

Search-heavy (search engine, product catalog):
  - Inverted index (Elasticsearch, Solr)
  - Ranking and relevance scoring
  - Autocomplete with tries or prefix indices

System design interview checklist — components to consider

The Component Checklist

Use this checklist to make sure you have not missed a critical piece. You do not need every component for every problem — but you should consciously decide what to include and what to skip.

Networking and Traffic

Component	When to Include	Key Decisions
DNS	Always (mention briefly)	Round-robin vs geo-routing
CDN	When serving static assets or media	Pull vs push CDN, cache TTL
Load Balancer	Always	L4 vs L7, algorithm (round-robin, least-connections, consistent hashing)
API Gateway	When you need auth, rate limiting, routing	Combined with LB or separate
Rate Limiter	When facing abuse or uneven traffic	Token bucket vs sliding window, per-user vs per-IP

Compute

Component	When to Include	Key Decisions
App Servers	Always	Stateless, horizontal scaling
WebSocket Servers	Real-time features	Connection management, server affinity
Workers / Consumers	Async processing	Queue-based, scaling with queue depth
Cron / Scheduler	Periodic tasks	Idempotency, distributed locking

Data

Component	When to Include	Key Decisions
SQL Database	Structured data, relationships, ACID	MySQL vs Postgres, sharding strategy
NoSQL Database	High write throughput, flexible schema	Cassandra, DynamoDB, MongoDB
Cache (Redis)	Read-heavy, latency-sensitive	Cache-aside vs write-through, TTL, eviction
Search Engine	Full-text search, ranking	Elasticsearch, indexing strategy
Object Storage (S3)	Files, images, videos	Bucket organization, lifecycle policies
Message Queue	Async, decoupling, buffering	Kafka vs RabbitMQ, partitioning, consumer groups

Cross-Cutting Concerns

Concern	What to Mention	Common Solutions
Monitoring	Metrics, logging, alerting	Prometheus, Grafana, ELK stack
Security	Auth, encryption, input validation	OAuth2, JWT, TLS, HTTPS
Data Privacy	GDPR, user data deletion	Soft deletes, anonymization
Disaster Recovery	Backups, multi-region	Cross-region replication, RTO/RPO targets

Phase 5: Deep Dive (10 minutes)

The interviewer will ask you to go deeper on one or two areas. This is where you demonstrate real expertise. Common deep-dive topics:

Database Schema

-- Chat system example
CREATE TABLE messages (
    message_id      BIGINT PRIMARY KEY,
    chat_id         BIGINT NOT NULL,
    sender_id       BIGINT NOT NULL,
    content         TEXT,
    message_type    VARCHAR(10),  -- 'text', 'image'
    media_url       VARCHAR(500),
    created_at      TIMESTAMP DEFAULT NOW(),
    INDEX (chat_id, created_at)
);

-- Partition by chat_id for locality
-- Shard by chat_id % N for horizontal scaling

Be ready to discuss: why this schema, how it’s indexed, how it scales, what the access patterns are.

Sharding Strategy

"I'd shard the messages table by chat_id.

Why chat_id and not user_id?
- All messages in a chat live on the same shard
- Fetching chat history is a single-shard query
- No need for scatter-gather across shards

If user_id were the shard key, loading a single chat
would require querying every shard (since members are
on different shards). That's N cross-shard queries
per page load -- it doesn't scale."

Caching Strategy

"For the chat system, I'd use Redis for two things:

1. Recent messages cache: last 50 messages per active chat
   Key: chat:{chat_id}:recent
   TTL: 24 hours (refreshed on access)
   This avoids hitting the database for every chat open.

2. User presence: which users are online
   Key: presence:{user_id}
   TTL: 60 seconds (refreshed by heartbeat)
   This avoids querying the WebSocket servers directly.

Cache invalidation: new messages append to the list
and trim to 50. Consistent with the database because
every write goes through the same service."

Failure Scenarios

This is where strong candidates separate themselves. Proactively discuss what breaks.

"What happens when a WebSocket server crashes?

1. All 10K connections on that server drop.
2. Clients detect the disconnect and reconnect
   (with exponential backoff) to a different server.
3. On reconnect, the client sends its last received
   message_id. The server replays any missed messages
   from the database.
4. No messages are lost because they're persisted to
   the database via Kafka before being pushed to clients.

The key invariant: Kafka and the database are the source
of truth. WebSocket servers are stateless delivery vehicles.
If one dies, clients reconnect and catch up."

Common system design tradeoffs to discuss

How to Discuss Tradeoffs

Tradeoffs are the most important thing in a system design interview. Every decision has a cost. Articulating both sides shows maturity.

The Tradeoff Template

"I'm choosing X over Y because [reason].
The tradeoff is [downside of X].
We could mitigate that by [mitigation]."

Common Tradeoffs to Discuss

Consistency vs Availability

"For the chat system, I prioritize availability over strong consistency.
Users can tolerate seeing a message 500ms late, but they cannot tolerate
the chat being down. So I'll use eventual consistency with a conflict
resolution mechanism.

If this were a banking system, I'd flip this -- strong consistency
is non-negotiable for financial transactions, even at the cost
of higher latency."

SQL vs NoSQL

"I'm choosing Cassandra for the message store because:
- Write throughput: 500M messages/day needs horizontal write scaling
- Access pattern: always query by (chat_id, time_range) -- perfect for Cassandra's partition key model
- No complex joins needed

The tradeoff: no ACID transactions, no ad-hoc queries, harder operational model.
For user profiles and chat membership, I'd still use PostgreSQL
because those have relationships and need consistency."

Push vs Pull

"For the news feed, I'm using fan-out on write (push model):
- When a user posts, we precompute the feed for all their followers
- Reading the feed is a single cache/DB lookup -- very fast
- Tradeoff: celebrity users with 10M followers create write amplification

For celebrities (>10K followers), I'd switch to fan-out on read:
- Don't precompute. When a user opens their feed, merge in celebrity posts at read time
- This hybrid approach avoids the worst case of both strategies"

Cache-Aside vs Write-Through

"I'm using cache-aside (lazy loading):
- On cache miss, read from DB and populate cache
- On write, invalidate cache (not update)
- Tradeoff: first read after invalidation hits the DB (cache miss penalty)

Write-through would keep the cache always fresh, but it adds latency
to every write and caches data that might never be read.
For our read-heavy workload (100:1 read-write ratio), cache-aside
is the better fit."

Monolith vs Microservices

"At our scale (50M DAU), I'd use microservices for core paths:
- Message service, presence service, notification service
- Each can scale independently
- Teams can deploy independently

But I wouldn't split everything. Auth, rate limiting, and logging
stay in the API gateway. Over-decomposition creates more problems
than it solves -- distributed transactions, debugging difficulty,
network overhead."

Common Mistakes

These are the patterns that tank system design interviews.

Jumping to the solution. You draw a diagram in the first minute without understanding what you’re building. The interviewer wanted a file storage system and you designed a CDN. Always start with requirements.

No estimation. You propose Redis for caching but never calculated whether the data fits in memory. You suggest a single PostgreSQL instance but the write volume needs sharding. Estimation grounds your design in reality.

Ignoring non-functional requirements. Your design handles the happy path but you never discussed: What happens when a server crashes? What happens when the database is full? What happens when traffic spikes 10x? Non-functional requirements are what make a design production-ready.

Not discussing tradeoffs. You say “I’ll use Kafka” but not why. You pick DynamoDB without explaining what you give up compared to PostgreSQL. Every technology choice is a tradeoff. If you can’t articulate both sides, the interviewer assumes you don’t understand the choice.

Over-engineering. You propose Kubernetes, service mesh, multi-region active-active replication, and event sourcing for a system that handles 100 QPS. Match the complexity of your design to the scale of the problem.

Silent design. You think for three minutes without saying anything, then present a finished diagram. The interviewer cannot evaluate your thought process if they can’t hear it. Think out loud. Say “I’m considering X because…” even while you’re still deciding.

Single points of failure. You have one database with no replicas, one cache with no fallback, one service with no redundancy. Always ask yourself: “What happens if this component goes down?”

Communication Tips

Drive the conversation. Do not wait for the interviewer to tell you what to do next. Move through the phases yourself: “Now that we’ve agreed on requirements, let me do a quick estimation.”

Use the whiteboard (or shared doc) actively. Draw as you talk. Label components. Draw arrows showing data flow. A visual design is easier to discuss and critique than a verbal one.

Check in with the interviewer. After the high-level design: “Does this look reasonable so far? Is there an area you’d like me to dive deeper into?” This shows collaboration and lets the interviewer steer you toward what they want to evaluate.

Acknowledge uncertainty. “I’m not sure about the exact throughput of Kafka on a single broker, but I know it’s in the hundreds of thousands of messages per second range. For our load of 18K QPS, a small Kafka cluster should be sufficient.” This is better than either guessing a precise number or saying “I don’t know.”

Name your assumptions. “I’m assuming users are distributed globally, so I’ll include a CDN and multi-region deployment.” If the assumption is wrong, the interviewer will correct you. That is a good thing.

Phase 6: Wrap-Up and Extensions (5 minutes)

In the last few minutes, summarize your design and proactively mention extensions you did not have time to cover.

"To summarize: we have a chat system that handles 50M DAU with
WebSocket servers for real-time delivery, Kafka for reliable
message processing, Cassandra for message storage, Redis for
caching and presence, and a push notification service for
offline users.

If I had more time, I'd discuss:
- End-to-end encryption (Signal Protocol)
- Message search (Elasticsearch index on message content)
- Multi-region deployment for global latency
- Abuse detection and content moderation
- Message retention policies and GDPR compliance"

This shows breadth of knowledge without requiring time to design each extension in detail.

Practice Problem List

Here are the most common system design problems, grouped by difficulty. The first four in this course (Lessons 11-14) cover the most frequently asked ones. Practice at least one from each category.

Warm-Up (30-minute problems)

1. URL Shortener (TinyURL)
   Focus: Hashing, database design, read-heavy scaling

2. Paste Bin
   Focus: Object storage, expiration, content addressing

3. Rate Limiter
   Focus: Token bucket, sliding window, distributed counting

Core (45-minute problems)

4. News Feed / Timeline (Facebook, Twitter)
   Focus: Fan-out, ranking, caching, real-time updates

5. Chat System (WhatsApp, Slack)
   Focus: WebSockets, message ordering, presence, group chat

6. Notification System
   Focus: Multi-channel, templates, dedup, priority queues

7. File Storage (Dropbox, Google Drive)
   Focus: Chunking, sync, dedup, conflict resolution

8. Web Crawler
   Focus: URL frontier, politeness, dedup, distributed crawling

Advanced (60-minute problems)

9. Search Autocomplete (Typeahead)
    Focus: Trie, ranking, precomputation, caching

10. Distributed Key-Value Store
    Focus: Consistent hashing, replication, vector clocks

11. Video Streaming (YouTube, Netflix)
    Focus: Encoding pipeline, adaptive bitrate, CDN, recommendations

12. Ride-Sharing (Uber, Lyft)
    Focus: Geospatial indexing, matching, real-time tracking, surge pricing

13. Payment System (Stripe)
    Focus: Idempotency, distributed transactions, reconciliation, PCI compliance

14. Distributed Task Scheduler
    Focus: Priority queues, sharding, exactly-once execution, failure recovery

For Each Practice Problem, Cover:

1. Requirements (functional + non-functional)
2. Estimation (storage, QPS, bandwidth)
3. API (3-5 core endpoints)
4. High-level architecture (trace a request end-to-end)
5. Deep dive on 2 components (schema, caching, sharding)
6. Tradeoffs (at least 3 explicit tradeoff discussions)
7. Failure scenarios (what breaks, how you recover)

Quick-Reference: Estimation Cheat Sheet

Tape this to your wall. These numbers come up constantly.

Latency Numbers Every Engineer Should Know:
  L1 cache reference:              1 ns
  L2 cache reference:              4 ns
  RAM reference:                   100 ns
  SSD random read:                 16 us
  HDD random read:                 2 ms
  Round trip same datacenter:      500 us
  Round trip US coast-to-coast:    40 ms
  Round trip US-to-Europe:         80 ms

Throughput References:
  Single MySQL server:             5K-10K QPS (simple queries)
  Single Redis server:             100K+ QPS
  Single Kafka broker:             200K+ messages/s
  Single Elasticsearch node:       5K-20K queries/s
  Single app server (API):         1K-10K QPS (depends on logic)

Storage Conversions:
  1 KB  = 1,000 bytes
  1 MB  = 1,000 KB = 10^6 bytes
  1 GB  = 1,000 MB = 10^9 bytes
  1 TB  = 1,000 GB = 10^12 bytes
  1 PB  = 1,000 TB = 10^15 bytes

Availability:
  99%:     3.65 days downtime/year
  99.9%:   8.76 hours downtime/year
  99.99%:  52.6 minutes downtime/year
  99.999%: 5.26 minutes downtime/year

Key Takeaways

Follow the framework religiously: requirements (5 min), estimation (5 min), API (5 min), high-level design (15 min), deep dive (10 min), wrap-up (5 min). Deviating from this structure is the most common cause of running out of time.
Ask clarifying questions first. The problem is intentionally vague. Narrowing scope shows maturity. Writing down agreed requirements gives you a contract to design against.
Estimation is not about precision. It is about proving you can reason about scale and using the results to justify design decisions. “We have 6K QPS so a single database won’t cut it” is the kind of insight estimation provides.
Trace a request end-to-end through your architecture. This is more convincing than listing components. “User clicks send, the message hits the WebSocket server, goes to Kafka, gets persisted, and is pushed to the recipient” — that tells a story.
Every technology choice is a tradeoff. State the tradeoff explicitly. “I chose Cassandra over PostgreSQL because we need horizontal write scaling, at the cost of giving up ACID transactions.” One-sided reasoning signals inexperience.
Proactively discuss failure scenarios. What happens when a server crashes, a database goes down, or traffic spikes 10x? This separates senior candidates from mid-level ones.
Think out loud. The interviewer is evaluating your thought process, not your final answer. Silent thinking for three minutes followed by a perfect diagram is less impressive than walking through your reasoning step by step.
Do not over-engineer. A URL shortener does not need Kubernetes, event sourcing, and multi-region active-active replication. Match the complexity of your solution to the scale of the problem.
Practice by writing, not just reading. For each problem, actually draw the architecture, write the schema, calculate the numbers, and articulate the tradeoffs out loud. Reading about system design and doing system design are different skills.