System Design
Index
System Design Scalability Reliability Distributed Systems

System Design

Interview-ready blueprint: requirements, APIs, data modeling, scaling, reliability — plus traps that fail candidates.

1. How to Answer (the senior structure)

The interviewer evaluates your thinking process more than the final diagram. Be explicit: assumptions, trade-offs, and risks.

✅ The 6-step flow

  1. Clarify scope + requirements (functional + non-functional).
  2. Back-of-the-envelope numbers (QPS, storage, bandwidth).
  3. High-level architecture (clients → edge → services → data).
  4. Data model + key flows (write path / read path).
  5. Scaling & reliability (cache, sharding, async, failure modes).
  6. Wrap with trade-offs + next steps (what you’d improve).

⚠️ What kills answers

  • Starting with tech choices before requirements (“use Kafka!”).
  • No numbers (cannot justify capacity decisions).
  • Ignoring failures (DB down, cache miss storm, retries).
  • No trade-offs (“this is the best”).

🎯 One-liner that wins

“I’ll start by clarifying requirements and sizing the system, then propose a simple baseline architecture, and iterate for scale and reliability.”

2. Requirements (always separate these)

Functional requirements

  • What features exist? (create, search, feed, upload, notify…)
  • Who are the users? permissions? roles?
  • What is the core object? (Post, Order, Message, File…)
  • What are the core flows? (write path vs read path)

Non-functional requirements

  • Latency (p95 / p99), throughput, availability.
  • Consistency requirements (strong vs eventual).
  • Durability (data loss tolerance).
  • Security, privacy, compliance, auditability.

🚨 Trap: “We need strong consistency everywhere”

Strong consistency is expensive at scale. Prefer strong consistency only where required (payments, inventory), and eventual consistency for feeds, analytics, search, notifications.

3. Back-of-the-envelope Estimation

You don’t need perfect math. You need order-of-magnitude reasoning to justify architecture decisions.

Quick template

  • DAU (daily active users) → estimate requests per user per day.
  • QPS = (requests/day) / 86,400 seconds.
  • Storage = objects/day × size × retention.
  • Bandwidth = QPS × avg response size.
# Example (rough): 1M DAU, 20 actions/day
requests/day = 1,000,000 * 20 = 20,000,000
avg QPS = 20,000,000 / 86,400 ≈ 231 QPS
peak QPS ~ 10x average → ~2,300 QPS

# Storage example: 500k uploads/day, 200KB each
daily storage = 500,000 * 200KB ≈ 100GB/day
year storage ≈ 36TB (before compression/retention)

⚠️ Interview tip

Always say “peak traffic” and “hot keys” (some users/objects get way more traffic than average).

4. High-Level Design (baseline first)

Start simple (monolith / 3-tier), then evolve. Your first design should work at small scale.

// Typical baseline architecture
Clients → CDN/Edge → API Gateway/Load Balancer → App Service(s) → Database + Cache
// Optional for scale
App Service(s) → Queue/Stream → Worker(s) → Search/Analytics/Notifications

Edge layer

  • CDN for static assets, images, videos.
  • WAF, rate limiting, bot protection.
  • Geo routing when needed.

Service layer

  • Stateless services → easy horizontal scaling.
  • Idempotency keys for writes (safe retries).
  • Async workers for heavy tasks.

🚨 Trap: designing microservices immediately

In interviews, starting with 12 microservices often looks like “cargo cult”. Start with a modular monolith, then split services only when boundaries and scaling needs are clear.

5. Data Model (what you store drives everything)

Define your primary entities and how queries work. Then pick storage: relational vs document vs KV vs search.

Relational (SQL)

  • Strong consistency, transactions, joins.
  • Great for orders/payments/inventory.
  • Scaling: read replicas, sharding, partitioning.

NoSQL (Document / KV)

  • Flexible schema, high throughput.
  • Great for feeds, sessions, caching, user profiles.
  • Trade-off: joins/transactions limited or complex.
// Example: minimal tables for a "Post + Feed" system
users(id, name, created_at)
posts(id, user_id, content, created_at)
follows(follower_id, followee_id)

// Feed options
// 1) Fan-out on read: compute feed at read time (heavy reads)
// 2) Fan-out on write: push post ids into followers' feed (heavy writes)

⚠️ Practical heuristic

If your main query is “get by key” → KV store. If you need “search by text” → search engine. If you need “transactions + constraints” → SQL.

6. API Design (contracts + idempotency)

✅ Good API rules

  • Explicit versioning when breaking changes happen.
  • Consistent error schema (problem+json style).
  • Pagination for large lists.
  • Idempotency key for POST writes.

🚨 Common mistakes

  • Chatty APIs (too many round-trips).
  • Leaking internal DB ids when you need public ids.
  • No rate limits → DoS risk.
  • No retries/timeout guidance for clients.
# Example endpoints (generic)
POST /v1/posts # create post (Idempotency-Key header)
GET /v1/posts/{id}
GET /v1/feed?cursor=...&limit=20
POST /v1/follows # follow user

# Idempotency
Idempotency-Key: 8b6f... (store key → response mapping for a time window)

7. Caching (speed without lying)

Caching improves latency and reduces DB load, but introduces invalidation complexity (hardest problem).

Client/CDN

Static assets, images, public content.

Service Cache

Hot objects, computed results, rate limits.

DB Cache

Read replicas + indexes often beat “more cache”.

🚨 Cache stampede

When a hot key expires, many requests hit the DB at once. Mitigate with jittered TTL, request coalescing, stale-while-revalidate.

// Cache-aside pattern (concept)
// 1) Try cache
// 2) If miss → read DB
// 3) Set cache with TTL
// 4) Return response

8. Database Scaling (the real bottleneck)

✅ Common techniques

  • Indexes for critical query patterns.
  • Read replicas for read-heavy systems.
  • Partitioning (time-based, tenant-based).
  • Sharding by key (user_id, tenant_id).

⚠️ Trade-offs

  • Cross-shard queries become expensive.
  • Transactions across shards are painful.
  • Operational complexity increases a lot.
  • Hot partitions/hot shards need special care.

🚨 Trap: “Sharding solves everything”

Sharding is a last resort. First: better indexes, query patterns, caching, read replicas, and data model changes.

9. Queues & Streams (async decoupling)

Use async messaging to protect your core path: slow tasks go to workers, services are loosely coupled.

Queue (task distribution)

  • Point-to-point consumption (one worker handles a job).
  • Great for emails, image processing, background jobs.
  • Supports retries + dead-letter queues (DLQ).

Stream (event log)

  • Multiple consumers can read the same events.
  • Great for analytics, projections, event-driven architectures.
  • Ordering by partition key.
// Reliable events: Outbox pattern (concept)
// 1) DB transaction: save aggregate + save outbox event
// 2) Worker publishes outbox events to broker
// 3) Mark as sent (idempotent publish)

🚨 Trap: “Exactly-once delivery”

In practice, design for at-least-once delivery and make consumers idempotent.

10. Consistency (choose it intentionally)

Strong consistency

Required for money, inventory, permissions, critical state transitions.

  • Single-writer or transactional DB.
  • Careful with distributed transactions.

Eventual consistency

Fine for feeds, search, analytics, notifications.

  • Write DB first, update projections async.
  • Users might see stale data briefly.

⚠️ Useful interview language

“I’ll keep the source of truth strongly consistent, and build read models/search as eventually consistent projections.”

11. Resilience (timeouts, retries, bulkheads)

✅ Core patterns

  • Timeouts everywhere (fail fast).
  • Retries with backoff for transient errors only.
  • Circuit breaker to stop cascading failures.
  • Bulkhead isolation (separate pools/limits).
  • Rate limiting to prevent overload.

🚨 Traps

  • Retrying on 4xx (not transient).
  • No idempotency on writes → duplicate orders.
  • Unbounded queues → memory explosion.
  • No backpressure strategy.
// Reliability checklist
// - Define SLOs (latency/error rate)
// - Timeouts + bounded retries
// - Circuit breaker
// - Load shedding / rate limiting

12. Observability (you can’t fix what you can’t see)

Structured Logs

Searchable, consistent fields, correlation IDs.

Metrics

RED/USE, dashboards, alerting by SLO.

Traces

Distributed tracing across services.

// Example log fields (concept)
traceId: "4bf92f...", spanId: "00f067..."
service: "orders", route: "POST /v1/orders"
latencyMs: 128, status: 201

13. Security (trust boundaries)

✅ Must mention

  • AuthN/AuthZ (JWT/OIDC) + least privilege.
  • Rate limiting + WAF for edge protection.
  • Secrets management (not in git, not in images).
  • Audit logs for critical actions.

🚨 Traps

  • Trusting client-provided IDs/roles without validation.
  • No encryption at rest for sensitive data.
  • No threat model for public endpoints.
  • Overly permissive service-to-service calls.

14. Mini Case Study: “Design a URL Shortener”

This is a classic. The goal is to show the method: requirements → API → storage → scaling.

Core requirements

  • Shorten a long URL → return short code.
  • Redirect short code → long URL.
  • Optional: custom alias, expiration, analytics.

High-level design

  • Edge/CDN for redirects (super read-heavy).
  • Cache shortCode → longUrl (hot keys).
  • DB as source of truth; async analytics events.
# API
POST /v1/shorten { url, ttl? } → { code }
GET /{code} → 301/302 redirect

# Storage
urls(code PK, long_url, created_at, expires_at, owner_id?)

# Scaling
Cache-aside for redirects + CDN caching for popular codes

🚨 Trap: generating codes

Don’t use “random until unused” at scale. Prefer a deterministic scheme: base62 encoding of a unique ID, or a centralized ID generator (with shard/region bits).

15. Winning Interview Answers

“How do you scale reads?”

“I scale reads using caching + read replicas + denormalized read models, and I protect hot keys with TTL jitter and request coalescing.”

“How do you handle failures?”

“Timeouts first, retries with backoff only for transient errors, circuit breakers to avoid cascading failures, and idempotency for safe retries.”

“How do you publish events reliably?”

“I use the Outbox pattern: save state + event in the same DB transaction, then a worker publishes events idempotently.”

16. Common Traps (avoid these)

🚨 No numbers

Without QPS/storage estimates, your scaling decisions look random.

🚨 Overengineering too early

Don’t start with sharding, Kafka, and microservices. Start simple and iterate based on bottlenecks.

🚨 Ignoring operational reality

Backups, migrations, monitoring, incident response and runbooks matter in real systems.