“An API Gateway is the front door to your microservices. Every request walks through it, and every cross-cutting concern lives there — so you don’t repeat it in 50 services.”
In a monolith, there’s one entry point. In a microservices architecture, there can be dozens or hundreds. Clients shouldn’t need to know about your internal service topology, manage multiple connections, or handle authentication, rate limiting, and retries themselves. That’s the API Gateway’s job.
This article covers everything you need for system design interviews: what an API Gateway does, how each component works, the algorithms behind rate limiting, and the patterns that come up when designing real systems.
What is an API Gateway?
An API Gateway is a reverse proxy that sits between external clients and your backend services. It handles cross-cutting concerns that every request needs but no individual service should implement:
Without a gateway, every service must independently handle:
- TLS termination
- Authentication / authorization
- Rate limiting
- Request validation
- Logging, metrics, tracing
- CORS headers
With a gateway, services only handle business logic. Everything else is centralized.
Request Lifecycle
Every request through an API Gateway follows a predictable pipeline:
Let’s break down each stage.
Authentication and Authorization
The gateway validates identity before requests reach your services.
Common Auth Patterns
| Pattern | How | Best For |
|---|---|---|
| API Key | Key in header (X-API-Key) or query param |
Service-to-service, simple APIs |
| JWT (Bearer Token) | Verify signature locally, extract claims | Stateless auth, microservices |
| OAuth 2.0 | Token introspection or JWT validation | Third-party access, SSO |
| mTLS | Client presents certificate | Service mesh, zero-trust |
JWT Validation at the Gateway
# Gateway middleware — JWT validation (no database call needed)
import jwt
from functools import wraps
PUBLIC_KEY = open('public.pem').read()
def authenticate(func):
@wraps(func)
def wrapper(request, *args, **kwargs):
token = request.headers.get('Authorization', '').replace('Bearer ', '')
if not token:
return Response(status=401, body='Missing token')
try:
claims = jwt.decode(token, PUBLIC_KEY, algorithms=['RS256'],
audience='my-api')
except jwt.ExpiredSignatureError:
return Response(status=401, body='Token expired')
except jwt.InvalidTokenError:
return Response(status=401, body='Invalid token')
# Attach user context for downstream services
request.headers['X-User-ID'] = claims['sub']
request.headers['X-User-Roles'] = ','.join(claims.get('roles', []))
return func(request, *args, **kwargs)
return wrapperInterview insight: JWT validation is stateless — the gateway verifies the signature using the public key without calling the auth service. This is why JWTs are preferred over opaque tokens in API Gateways — no network call per request.
The tradeoff: you can’t revoke a JWT before it expires. Mitigations: short expiry (15 min) + refresh tokens, or a lightweight token blocklist in Redis.
Rate Limiting
Rate limiting protects your services from abuse and ensures fair usage. The gateway is the natural place to enforce it.
Token Bucket Algorithm
The most common algorithm. Each client gets a bucket with a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token.
import time
import redis
class TokenBucketRateLimiter:
def __init__(self, redis_client, max_tokens=100, refill_rate=10):
self.redis = redis_client
self.max_tokens = max_tokens # bucket capacity
self.refill_rate = refill_rate # tokens per second
def is_allowed(self, client_id: str) -> bool:
key = f"ratelimit:{client_id}"
now = time.time()
# Lua script for atomic check-and-update
lua_script = """
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 60)
return 0 -- rate limited
end
"""
return bool(self.redis.eval(lua_script, 1, key,
self.max_tokens, self.refill_rate, now))Sliding Window Log
More accurate than token bucket for strict per-second limits:
def sliding_window_is_allowed(redis_client, client_id, window_sec=60, max_requests=100):
key = f"ratelimit:sw:{client_id}"
now = time.time()
pipe = redis_client.pipeline()
pipe.zremrangebyscore(key, 0, now - window_sec) # remove expired entries
pipe.zadd(key, {str(now): now}) # add current request
pipe.zcard(key) # count in window
pipe.expire(key, window_sec)
results = pipe.execute()
return results[2] <= max_requestsRate Limiting Response Headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1679000060
-- When exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0Rate Limiting Dimensions
| Dimension | Example | Use Case |
|---|---|---|
| Per API key | 1000 req/min per key | SaaS API tiers |
| Per user | 100 req/min per user | Logged-in users |
| Per IP | 50 req/min per IP | Anonymous/public APIs |
| Per endpoint | 10 req/min on /api/export |
Expensive operations |
| Global | 50K req/s total | Cluster protection |
Routing and Load Balancing
Path-Based Routing
The gateway maps incoming paths to backend services:
# NGINX example
location /api/users/ {
proxy_pass http://user-service:8080/;
}
location /api/orders/ {
proxy_pass http://order-service:8080/;
}
location /api/payments/ {
proxy_pass http://payment-service:8080/;
}# Kong declarative config
services:
- name: user-service
url: http://user-service:8080
routes:
- name: user-routes
paths:
- /api/users
strip_path: true
- name: order-service
url: http://order-service:8080
routes:
- name: order-routes
paths:
- /api/orders
methods:
- GET
- POSTHeader-Based Routing
Route based on headers for A/B testing, canary deployments, or API versioning:
# Version-based routing
location /api/ {
if ($http_api_version = "v2") {
proxy_pass http://service-v2:8080;
}
proxy_pass http://service-v1:8080;
}Load Balancing Algorithms
| Algorithm | Behavior | Best For |
|---|---|---|
| Round Robin | Rotate through instances sequentially | Homogeneous instances |
| Weighted Round Robin | More traffic to higher-weight instances | Mixed instance sizes |
| Least Connections | Route to instance with fewest active connections | Variable request duration |
| IP Hash | Same client always hits same instance | Session affinity |
| Random | Pick an instance randomly | Simple, surprisingly effective |
# NGINX weighted upstream
upstream order-service {
server order-1:8080 weight=3; # gets 3x traffic
server order-2:8080 weight=1;
server order-3:8080 weight=1;
}Circuit Breaker
When a backend service starts failing, the gateway should stop sending requests to it instead of overwhelming it further.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
success_threshold=2):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = 0
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit is open")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
self.success_count = 0Response Caching
The gateway can cache responses to reduce backend load for idempotent requests.
# NGINX response caching
proxy_cache_path /tmp/cache levels=1:2 keys_zone=api_cache:10m
max_size=1g inactive=60m;
location /api/products/ {
proxy_cache api_cache;
proxy_cache_methods GET HEAD;
proxy_cache_valid 200 5m; # cache 200s for 5 minutes
proxy_cache_valid 404 1m;
proxy_cache_key "$request_uri|$http_authorization"; # vary by auth
proxy_cache_bypass $http_cache_control;
add_header X-Cache-Status $upstream_cache_status;
proxy_pass http://product-service:8080;
}What to cache:
- GET requests with stable responses (product listings, user profiles)
- Public endpoints (no auth variation)
- Responses with explicit
Cache-Controlheaders
What NOT to cache:
- POST/PUT/DELETE (non-idempotent)
- Responses with user-specific data (unless cache key includes user ID)
- Real-time data (stock prices, live scores)
Request Aggregation (BFF Pattern)
For mobile or web clients that need data from multiple services in a single call:
# BFF endpoint — aggregate multiple service calls
import asyncio
import aiohttp
async def get_user_dashboard(user_id: str):
async with aiohttp.ClientSession() as session:
# Fan out to multiple services in parallel
user_task = session.get(f'http://user-service/users/{user_id}')
orders_task = session.get(f'http://order-service/users/{user_id}/orders?limit=5')
recs_task = session.get(f'http://recommendation-service/users/{user_id}')
user_resp, orders_resp, recs_resp = await asyncio.gather(
user_task, orders_task, recs_task,
return_exceptions=True
)
# Aggregate responses (graceful degradation)
result = {
'user': await user_resp.json() if not isinstance(user_resp, Exception) else None,
'recent_orders': await orders_resp.json() if not isinstance(orders_resp, Exception) else [],
'recommendations': await recs_resp.json() if not isinstance(recs_resp, Exception) else [],
}
return resultBFF (Backend for Frontend): One gateway per client type. The mobile BFF returns less data, fewer images, and different aggregations than the web BFF.
API Versioning Strategies
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| URL path | /v1/users, /v2/users |
Clear, easy routing | URL pollution |
| Header | Api-Version: 2 |
Clean URLs | Hidden, harder to test |
| Query param | /users?version=2 |
Easy to test | Caching complications |
| Content negotiation | Accept: application/vnd.api.v2+json |
RESTful | Complex |
# URL-path versioning at the gateway
location /v1/users/ {
proxy_pass http://user-service-v1:8080/users/;
}
location /v2/users/ {
proxy_pass http://user-service-v2:8080/users/;
}Security at the Gateway
CORS
location /api/ {
add_header Access-Control-Allow-Origin "https://myapp.com";
add_header Access-Control-Allow-Methods "GET, POST, PUT, DELETE, OPTIONS";
add_header Access-Control-Allow-Headers "Authorization, Content-Type";
add_header Access-Control-Max-Age 86400;
if ($request_method = OPTIONS) {
return 204;
}
proxy_pass http://backend;
}IP Whitelisting and Geo-blocking
# Allow only specific IPs for admin endpoints
location /admin/ {
allow 10.0.0.0/8;
allow 192.168.1.0/24;
deny all;
proxy_pass http://admin-service:8080;
}Request Size Limits
client_max_body_size 10m; # reject requests > 10MB
proxy_read_timeout 30s; # timeout slow backends
proxy_connect_timeout 5s;Observability
Every request through the gateway should generate:
Structured Logging
{
"timestamp": "2026-03-21T10:00:00Z",
"method": "POST",
"path": "/api/orders",
"status": 201,
"latency_ms": 45,
"client_ip": "203.0.113.42",
"user_id": "user:1001",
"upstream": "order-service:8080",
"request_id": "req-abc-123",
"rate_limit_remaining": 73,
"cache_status": "MISS"
}Distributed Tracing
The gateway generates or propagates trace IDs:
import uuid
def add_trace_headers(request):
# Generate trace ID if not present
trace_id = request.headers.get('X-Trace-ID', str(uuid.uuid4()))
span_id = str(uuid.uuid4())[:16]
request.headers['X-Trace-ID'] = trace_id
request.headers['X-Span-ID'] = span_id
request.headers['X-Request-ID'] = request.headers.get('X-Request-ID',
str(uuid.uuid4()))
return requestAPI Gateway vs Service Mesh
| Aspect | API Gateway | Service Mesh (Istio/Linkerd) |
|---|---|---|
| Position | Edge (north-south traffic) | Internal (east-west traffic) |
| Clients | External (web, mobile, 3rd party) | Internal services only |
| Auth | JWT, OAuth, API keys | mTLS between services |
| Rate limiting | Per client/API key | Per service |
| Routing | Path/header based | Service-to-service |
| Protocol | HTTP/REST/GraphQL/WebSocket | gRPC, HTTP, TCP |
| Deployment | Dedicated proxy cluster | Sidecar per service |
Interview insight: They’re complementary, not competing. Use an API Gateway for external traffic and a service mesh for internal service-to-service communication.
Popular API Gateways Compared
| Feature | Kong | AWS API Gateway | Envoy | NGINX | Traefik |
|---|---|---|---|---|---|
| Type | Plugin-based | Managed | L7 proxy | Web server + proxy | Cloud-native proxy |
| Config | Declarative / Admin API | Console / CloudFormation | YAML / xDS | Config files | Auto-discovery |
| Rate limiting | Plugin (Redis-backed) | Built-in (per stage) | Filter | Lua / OpenResty | Plugin |
| Auth | Plugins (JWT, OAuth, etc) | Cognito, Lambda authorizer | ext_authz filter | Lua / modules | Middleware |
| gRPC | Yes | Yes | Native | Limited | Yes |
| WebSocket | Yes | Yes (v2) | Yes | Yes | Yes |
| Best for | General purpose, plugin ecosystem | AWS-native, serverless | Service mesh sidecar, high perf | Simple, battle-tested | Kubernetes-native |
Interview Cheat Sheet
When to Use an API Gateway
- Multiple backend services behind a single endpoint
- Need centralized auth, rate limiting, and logging
- Different client types (web, mobile, IoT) need different APIs
- API versioning and canary deployments
- Third-party API access with usage tracking
When NOT to Use
- Single monolith — a simple reverse proxy (NGINX) is enough
- Only internal traffic — use a service mesh instead
- Ultra-low latency — every proxy hop adds 1-5ms
Key Numbers
| Metric | Typical Value |
|---|---|
| Gateway latency overhead | 1-5 ms per request |
| Rate limit check (Redis) | < 1 ms |
| JWT validation | < 0.5 ms (local, no network call) |
| Connection pool to upstream | 100-1000 per service |
| Gateway instances (production) | 2-4 (behind LB) |
Single Point of Failure?
The gateway is on the critical path. Mitigate with:
- Multiple instances behind a load balancer (or DNS round-robin)
- Health checks — remove unhealthy gateway instances
- Graceful degradation — if rate limiter (Redis) is down, fail open
- Stateless design — any instance can handle any request (no sessions)
Interview Answer Template
When designing an API Gateway:
- Why? — centralize cross-cutting concerns, decouple clients from internal topology
- Request pipeline — TLS → Auth → Rate Limit → Validate → Route → LB → Upstream
- Auth strategy — JWT for stateless, API keys for external consumers
- Rate limiting — token bucket per API key, sliding window for strict limits, backed by Redis
- Routing — path-based for service dispatch, header-based for versioning/canary
- Resilience — circuit breaker per upstream, retries with exponential backoff, timeouts
- Caching — response cache for GET endpoints, vary by auth context
- Observability — structured logs, distributed tracing (X-Trace-ID), Prometheus metrics
- Scaling — stateless horizontally-scaled instances behind an NLB
- BFF — one gateway per client type if mobile and web need different aggregations
Wrapping Up
An API Gateway is the control plane for your external API traffic. It lets your services focus on business logic while the gateway handles the boring-but-critical stuff: authentication, rate limiting, routing, resilience, and observability.
The mental model: think of it as a pipeline of middleware. Each stage in the pipeline either transforms the request, rejects it, or enriches it. The order matters: authenticate before rate limiting (so you know who to limit), rate limit before routing (so you reject early), and cache after routing (so you cache per-service responses).
Get the pipeline right, and your entire microservices architecture gets cleaner.








