Scaling Reads — Caching, Replicas & CDN — System Design Masterclass

Most real-world systems are read-heavy. Twitter serves 600K reads per second but only 6K writes. Netflix handles millions of concurrent streams but catalog updates are rare. The read-to-write ratio is often 100:1 or higher.

The good news: reads are far easier to scale than writes. You can cache them, replicate them, and push them to the edge.

Multi-level caching architecture showing browser, CDN, app cache, and database layers

Caching Strategies in Depth

Cache-Aside (Lazy Loading)

The most common pattern. The application manages the cache directly.

import redis, json
cache = redis.Redis(host='cache.internal', port=6379)

def get_user_profile(user_id: int) -> dict:
    cache_key = f"user:{user_id}:profile"

    # Step 1: Check cache
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)  # Cache HIT

    # Step 2: Cache MISS — read from database
    profile = db.execute("SELECT * FROM users WHERE id = %s", (user_id,))

    # Step 3: Populate cache for next time
    cache.setex(cache_key, 300, json.dumps(profile))  # TTL = 5 min
    return profile

Best for: Read-heavy workloads where stale data is acceptable during the TTL window. Weakness: First request is always slow (cold cache). Cache and DB can drift.

Write-Through

Every write goes to both cache and database. Reads always hit the cache.

def update_user_profile(user_id: int, updates: dict) -> dict:
    # Write to database
    db.execute("UPDATE users SET name=%s WHERE id=%s", (updates['name'], user_id))
    # Write to cache in the same operation
    updated = db.execute("SELECT * FROM users WHERE id=%s", (user_id,)).fetchone()
    cache.setex(f"user:{user_id}:profile", 300, json.dumps(updated))
    return updated

Best for: Strong consistency between cache and DB. Weakness: Higher write latency. Cache fills with data that may never be read.

Write-Behind (Write-Back)

Write to cache immediately, asynchronously persist to database.

def update_user_async(user_id: int, updates: dict):
    # Write to cache (fast — user gets immediate response)
    cache.setex(f"user:{user_id}:profile", 300, json.dumps(updates))
    # Queue the database write
    write_queue.put({"user_id": user_id, "data": updates})

# Background worker drains the queue and writes to DB in batches

Best for: Low write latency, batch writes. Weakness: Data loss if cache crashes before background flush. Complex error handling.

Read-Through

The cache itself loads data from the database on a miss. The application only talks to the cache — the loading logic is encapsulated.

class ReadThroughCache:
    def __init__(self, cache_client, ttl=300):
        self.cache = cache_client
        self.ttl = ttl

    def get(self, key: str, loader_fn):
        """Get from cache. On miss, loader_fn fetches from source."""
        value = self.cache.get(key)
        if value:
            return json.loads(value)

        # Cache loads from DB on miss -- transparent to caller
        value = loader_fn()
        if value:
            self.cache.setex(key, self.ttl, json.dumps(value))
        return value

# Usage -- caching logic is hidden from application code
user_cache = ReadThroughCache(cache, ttl=300)
profile = user_cache.get(
    f"user:{user_id}",
    lambda: db.execute("SELECT * FROM users WHERE id=%s", (user_id,)).fetchone()
)

Best for: Keeping caching logic out of application code. Weakness: Less control over what gets cached and when.

The Five Questions of Caching

Before adding a cache to any system, answer these:

1. WHAT to cache?     Hot data, expensive computations, static content
2. WHERE to cache?    Browser, CDN, app server, distributed cache, DB cache
3. HOW to populate?   Cache-aside, read-through, write-through, write-behind
4. WHEN to invalidate? TTL, event-based, version-based
5. WHAT on miss?      Fetch from source, queue and wait, return stale

Cache Invalidation

TTL-Based

Every cached value expires after a fixed duration. Simple and self-cleaning.

Content Type          Suggested TTL   Reasoning
──────────────────────────────────────────────────
Static assets         24 hours+       Rarely change
Product catalog       1 hour          Changes daily
User profile          5 minutes       Changes occasionally
Account balance       0 (no cache)    Must be real-time
Trending topics       30 seconds      Changes constantly

Event-Based

Invalidate on write events. Near-instant consistency, more complexity.

def update_product(product_id: int, updates: dict):
    db.execute("UPDATE products SET ... WHERE id = %s", (product_id,))
    # Invalidate product cache and all related caches
    cache.delete(f"product:{product_id}")
    cache.delete(f"category:{updates['category_id']}:products")
    # Notify other services via event bus
    event_bus.publish("product.updated", {"product_id": product_id})

Version-Based

Change the cache key when data changes. Old entries expire naturally via TTL.

def get_product_versioned(product_id: int) -> dict:
    version = cache.get(f"product:{product_id}:version") or "0"
    cache_key = f"product:{product_id}:v{version}"
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    product = db.get_product(product_id)
    cache.setex(cache_key, 3600, json.dumps(product))
    return product

def update_product(product_id: int, updates: dict):
    db.update_product(product_id, updates)
    cache.incr(f"product:{product_id}:version")  # Old key becomes unreachable

The Thundering Herd Problem

When a popular cache key expires, hundreds of requests simultaneously miss the cache and all hit the database.

def get_with_lock(key: str, loader_fn, ttl: int = 300):
    """Only one request fetches from DB on cache miss."""
    value = cache.get(key)
    if value:
        return json.loads(value)

    lock_key = f"lock:{key}"
    if cache.set(lock_key, "1", nx=True, ex=10):  # Acquire lock
        try:
            value = loader_fn()
            cache.setex(key, ttl, json.dumps(value))
            return value
        finally:
            cache.delete(lock_key)
    else:
        time.sleep(0.05)  # Losers wait, then retry from cache
        return json.loads(cache.get(key)) or loader_fn()

Alternative: Stale-while-revalidate. Serve the expired cached value immediately while refreshing in the background. The user gets a fast response with slightly stale data.

Read Replicas

Caching works for hot data, but not every query can be cached. Read replicas distribute database reads across multiple copies.

Read replica topology showing primary handling writes and replicas handling reads

Read/Write Splitting

class DatabaseRouter:
    def __init__(self, primary_dsn, replica_dsns):
        self.primary = connect(primary_dsn)
        self.replicas = [connect(dsn) for dsn in replica_dsns]

    def get_read_connection(self):
        return random.choice(self.replicas)

    def get_write_connection(self):
        return self.primary

    def get_read_after_write_connection(self):
        """For reads that MUST see the latest write."""
        return self.primary

router = DatabaseRouter(
    primary_dsn="postgresql://primary:5432/app",
    replica_dsns=["postgresql://replica1:5432/app", "postgresql://replica2:5432/app"]
)

Replication Lag

The gap between a write on the primary and when it appears on a replica.

SYNCHRONOUS:  0ms lag, slower writes, replica failure blocks all writes
ASYNCHRONOUS: 10-1000ms lag, fast writes, stale reads possible
SEMI-SYNC:    At least 1 replica confirms. Balance of both.

Handling Read-Your-Writes Consistency

def handle_read_your_writes(user_id: int, last_write_ts: float):
    """Route to primary if the user's last write was recent."""
    if time.time() - last_write_ts < 2.0:
        return router.get_write_connection()  # Read from primary

    replica = router.get_read_connection()
    lag = get_replication_lag(replica)
    if lag > 1.0:
        return router.get_write_connection()  # Fallback to primary
    return replica

CDN (Content Delivery Network)

A CDN caches content at edge locations close to users. Instead of crossing the ocean, the nearest CDN node serves it.

What to Put on a CDN

STATIC (always CDN): Images, CSS, JS, fonts. Cache forever, bust with filename hash.
SEMI-STATIC (short TTL): Product pages, blog posts. Cache 1-60 min, invalidate on update.
DYNAMIC (usually not CDN): User-specific data, authenticated responses.

Cache Headers

@app.route("/api/products/<int:product_id>")
def get_product(product_id):
    product = db.get_product(product_id)
    response = make_response(json.dumps(product))
    # CDN caches 5 min, browser caches 1 min
    response.headers['Cache-Control'] = 'public, max-age=60, s-maxage=300'
    response.headers['ETag'] = f'"{hash(json.dumps(product))}"'
    return response

@app.route("/api/users/<int:user_id>/profile")
def get_profile(user_id):
    response = make_response(json.dumps(db.get_user(user_id)))
    response.headers['Cache-Control'] = 'private, max-age=60'  # No CDN
    return response

Pull vs Push CDN

PULL: CDN fetches from origin on first request. Simple, cold start on first hit.
PUSH: You upload to CDN proactively. No cold start, more operational complexity.

Redis vs Memcached: When to Use Which

CHOOSE REDIS WHEN:
- You need data structures (sorted sets, lists, hashes, HyperLogLog)
- You need persistence (RDB snapshots or AOF)
- You need pub/sub for real-time features
- You need Lua scripting for atomic operations
- You need built-in replication and clustering
- Use cases: sessions, rate limiting, leaderboards, queues, pub/sub

CHOOSE MEMCACHED WHEN:
- You need simple key-value caching with the largest possible cache
- You need multi-threaded performance on multi-core machines
- You do not need persistence or data structures
- You want a simpler operational model
- Use cases: HTML fragment caching, DB query result caching

Redis Cluster for Horizontal Scaling

from redis.cluster import RedisCluster

rc = RedisCluster(
    startup_nodes=[
        {"host": "redis-1", "port": 6379},
        {"host": "redis-2", "port": 6379},
        {"host": "redis-3", "port": 6379},
    ],
    decode_responses=True
)

# Cluster shards data using CRC16(key) % 16384 hash slots
rc.set("user:1001:session", json.dumps(session_data), ex=1800)

# Pipeline for batch operations (reduces round trips)
pipe = rc.pipeline()
for user_id in user_ids:
    pipe.get(f"user:{user_id}:profile")
results = pipe.execute()

Putting It All Together

How to scale reads for 100K requests per second:

LAYER 1: BROWSER CACHE     → ~40% of requests never leave browser
LAYER 2: CDN               → ~30% served from edge (5ms latency)
LAYER 3: REDIS              → ~20% served from cache (1ms latency)
LAYER 4: READ REPLICAS      → ~8% served from replicas (10ms latency)
LAYER 5: PRIMARY DATABASE   → ~2% reach the primary (20ms latency)

Of 100,000 req/sec:
  40,000 → browser    |  18,000 → CDN    |  8,400 → Redis
  2,800  → replicas   |  800    → primary DB

Key Takeaways

Most systems are read-heavy (90%+ reads). Scaling reads is your first and most common challenge. The tools are caching, read replicas, and CDNs.
Cache-aside is the most common pattern. Write-through gives stronger consistency at higher write latency. Write-behind gives low write latency but risks data loss.
Cache invalidation is the hardest part. TTL-based is simple but allows stale data. Event-based gives near-instant consistency but adds complexity. Version-based avoids race conditions but wastes memory.
The thundering herd problem is real. Use distributed locks or stale-while-revalidate to prevent cache misses from crushing your database.
Read replicas scale database reads but introduce replication lag. Handle read-your-writes by routing recent writes to the primary.
CDNs are the most effective read scaling tool for static and semi-static content. Use proper Cache-Control headers.
Layer your caches: browser, CDN, application cache, read replicas, primary database. Each layer absorbs traffic so only a fraction reaches the DB.