Interview Scenarios

Full end-to-end system design walkthroughs using the same 8-step framework that top candidates use. Each scenario is a real interview question — practice them until the framework becomes second nature.

The 8-Step Framework

1. Requirements
2. Estimation
3. High-Level Design
4. Component Deep Dive
5. Database Schema
6. API Design
7. Scalability
8. Monitoring
Scenarios completed0/40(0%)

Design a social media platform / X Feed

Design a social media platform like a social media platform/X where users can post short messages (tweets), follow other users, and view a personalized home timeline.

Here is the thing about this problem: it looks like a simple social app, but the moment the interviewer says '50 million followers,' it transforms into one of the hardest distributed systems questions out there. The question they are really asking is: when someone with 50 million followers tweets, how do you get that tweet into 50 million timelines without the system falling over?

This is the fanout problem, and it is what the entire interview hinges on:
• You COULD pre-compute every follower's timeline when a tweet is posted (fanout-on-write). Great for reads. But 50M Redis writes per celebrity tweet? That takes hours
• You COULD compute timelines on the fly when someone opens the app (fanout-on-read). No write amplification. But 200M daily users each querying hundreds of accounts? Your database melts
• The answer is neither extreme — it is a hybrid that uses both. And explaining WHY you need the hybrid is what separates senior candidates from junior ones

If you walk away with one insight, it is this: the fanout strategy is not just a Twitter thing. It is the fundamental tradeoff in any system delivering personalized content at scale — activity feeds, notification systems, news aggregators. Master this pattern and you can answer a dozen different interview questions.

45-60 minutes
Start

Design a Distributed Lock Service

Design a distributed lock service that provides mutual exclusion across thousands of distributed nodes at 1M lock acquisitions per second with sub-5ms P99 latency. This is one of those foundational infrastructure problems that shows up everywhere once you start looking — payment systems need it to avoid double-charging, data pipelines need it for exactly-once processing, and leader election in every microservice cluster depends on it.

Here is what makes this problem genuinely hard, and why it trips up so many candidates: the core challenge sounds trivially simple — just make sure only one process holds a lock at any time. But then reality kicks in. Networks partition, clocks drift between machines, processes crash mid-operation, and JVM GC pauses can freeze a thread for hundreds of milliseconds while the rest of the world keeps moving. A naive Redis SETNX breaks under partition because the lock holder might not realize it lost the lock when the partition healed. Redlock tried to fix this with multi-node quorums but Martin Kleppmann wrote a famous critique showing it has subtle safety holes.

ZooKeeper actually gets the correctness right (it is built on ZAB consensus), but it is painfully slow — 10-15ms per lock acquisition, which is a non-starter for hot paths. This question is one of the best litmus tests in system design because it separates engineers who genuinely understand distributed consensus from those who think locking is just a SETNX command.

Premium Access
Sign in

Design a Service Registry & Discovery System

Design a service registry and discovery system that tracks 50K service instances across multiple datacenters, handles 10M discovery lookups per second with sub-5ms latency, and self-heals when instances crash or become unhealthy. If you have worked in any microservices environment, you know the pain this solves: Service A needs to find healthy instances of Service B to send requests to, but instances are ephemeral — containers start and stop constantly, auto-scaling adds and removes nodes, and rolling deployments cycle through the entire fleet.

Hard-coding IP addresses is obviously impossible at this scale. A service registry is essentially the phone book of microservices: services register themselves on startup, deregister on graceful shutdown, and other services look them up at runtime. Netflix Eureka, HashiCorp Consul, and AWS Cloud Map are all production implementations of this pattern, each with different tradeoff choices.

What makes this interview question particularly interesting is that it forces you to reason about registration protocols, health checking strategies (should the server poll instances or should instances heartbeat?), discovery patterns (does the client do the lookup or does a proxy?), consistency trade-offs (do you want an AP registry that is always available but might serve slightly stale data, or a CP registry that is always correct but might reject requests during partitions?), and the critical failure modes that can cascade a registry outage into a fleet-wide meltdown where nothing can find anything.

Premium Access
Sign in

Design a Time-Series Database

Design a time-series database (TSDB) optimized for storing, compressing, and querying timestamped metric data — the kind of data that every engineering team generates but few think deeply about: infrastructure monitoring (CPU usage, request latency), IoT sensors (temperature, pressure readings every second), financial markets (stock prices and trade volumes at millisecond granularity), and application telemetry (events, counters, histograms).

What makes a TSDB fundamentally different from a general-purpose database like PostgreSQL is the workload shape. Writes are massive and append-only — you are inserting millions of data points per second and you almost never update or delete individual rows. Reads are almost always time-range scans — 'give me the average CPU across 500 servers for the last 24 hours.' And data has a natural lifecycle — yesterday's per-second data is valuable, but last year's per-second data is just expensive storage; you want to automatically downsample it to per-hour aggregates.

The core engineering challenge is designing a storage engine that is simultaneously write-optimized (leveraging the sequential, append-only nature of time-series writes) and read-optimized (using columnar layout and time-aware compression to achieve 10-16x compression ratios while keeping time-range queries fast). On top of that, you have to handle the 'cardinality explosion' problem: in a large monitoring system, the number of unique metric series (the cross-product of host x metric x tag combinations) can reach tens of millions, which stresses both memory for indexing and disk for storage.

Premium Access
Sign in

Design a Data Warehouse (Snowflake/BigQuery)

Design a cloud-native data warehouse that can store petabytes of structured data and execute complex analytical queries — joins across billion-row tables, multi-dimensional aggregations, window functions — in seconds to minutes. If you have used Snowflake, BigQuery, or Redshift, you know the user experience we are targeting: an analyst writes SQL, hits run, and gets results across terabytes of data in under a minute.

The fundamental insight that shapes this entire design is how different analytical workloads are from transactional ones. An OLTP database like PostgreSQL is optimized for single-row reads and writes — find one customer, update their balance. A data warehouse does the opposite: it scans millions of rows but typically reads only a handful of columns at a time. That workload shape drives us toward columnar storage, where storing each column contiguously on disk means a query reading 3 of 50 columns only touches 6% of the data.

The revolutionary architectural idea here is compute-storage separation, which Snowflake pioneered. Data lives in cheap, durable cloud object storage (S3/GCS), and computation is provided by elastic virtual warehouse clusters that can be created, resized, and destroyed independently. This separation is what enables you to run 10 concurrent analytical workloads on the same dataset without them stepping on each other, scale compute up for a heavy ad-hoc query and back down to zero when the analyst goes to lunch, and pay only for compute you actually use. You will design a columnar storage layer, an MPP query engine, an ETL/ELT ingestion pipeline, and a multi-tenant resource management system.

Premium Access
Sign in

Design Real-Time Stream Processing System

Design a distributed real-time stream processing system like Apache Flink, Kafka Streams, or Apache Storm — a platform that continuously processes millions of events per second with sub-second latency. Unlike batch processing where you have the luxury of seeing all your data before computing a result, stream processing requires you to produce answers incrementally as data arrives, handle late data gracefully, and maintain correctness through failures — all simultaneously.

Core capabilities you need to nail:
• Stateful aggregations over time windows (tumbling, sliding, session) — because most useful stream computations accumulate state over time
• Exactly-once processing semantics — because 'at-least-once' means duplicate charges in a payment pipeline
• Late-arriving data handling via watermarks — because mobile events routinely arrive minutes after they occurred
• Distributed state with checkpointing for fault tolerance — because you cannot afford to reprocess hours of data after a crash
• Backpressure propagation when downstream systems are slow — because the alternative is buffering until you OOM

Why this matters in practice: This is the engine behind Uber's real-time pricing, Netflix's per-title viewing analytics, LinkedIn's activity feeds, and every fraud detection system processing credit card transactions in real time. It is one of the most technically demanding system design questions because it requires simultaneously reasoning about distributed systems, time semantics (event time vs. processing time is not a theoretical distinction — it determines whether your results are correct), and fault tolerance under continuous, never-ending processing.

Premium Access
Sign in

Design Gaming Platform

Design the backend infrastructure for a competitive online multiplayer game like League of Legends, Fortnite, Valorant, or Overwatch. This might be the most technically demanding system design problem on the list because it combines so many hard sub-problems, each of which could be its own interview question: a matchmaking system that must find balanced matches for millions of concurrent players in under 30 seconds, a tick-based game server architecture running at 60-128Hz where every single millisecond of latency is felt by players, an authoritative server model where the server is the single source of truth (because if the client is trusted, people will cheat), lag compensation techniques (client-side prediction, server reconciliation, entity interpolation) that make the game feel responsive even though packets take 20-100ms to cross the network, a global leaderboard serving real-time rankings across millions of players, an inventory and economy system managing billions of virtual item transactions with ACID guarantees, and an anti-cheat system that catches cheaters without punishing legitimate players with false positives.

The core tension that runs through every design decision: the game must feel instantaneous and responsive to every individual player, while the server enforces a single, consistent, cheat-proof world state across all participants. Those two goals are fundamentally in conflict with each other when you have non-zero network latency, and the clever tricks used to bridge that gap (prediction, reconciliation, interpolation) are what make game networking one of the most fascinating areas of distributed systems.

Premium Access
Sign in
Student? Ping us for a discount!