System Design Interview Practice Questions for Technical Architects
Below are real-world system design interview questions commonly asked for technical architect roles, with detailed sample answers.
1. Design a Scalable URL Shortener (e.g., bit.ly)
How would you handle billions of URLs and high read/write throughput?
- Use a distributed, horizontally scalable database (e.g., Cassandra, DynamoDB, or sharded MySQL/Postgres).
- Employ caching (e.g., Redis, Memcached) for frequently accessed URLs.
- Use load balancers and stateless application servers for scaling.
What database(s) would you use and why?
- NoSQL databases like DynamoDB or Cassandra for high write throughput and scalability.
- RDBMS with sharding if strong consistency is needed.
How would you prevent collisions and ensure short URL uniqueness?
- Use auto-incrementing IDs, UUIDs, or hash functions (e.g., base62 encoding of unique IDs).
- Check for collisions before assigning a new short URL.
How would you handle analytics and abuse prevention?
- Store analytics in a separate data store (e.g., time-series DB).
- Rate-limit API usage and monitor for suspicious activity.
- Use CAPTCHAs or authentication for bulk/automated requests.
2. Design a Global E-Commerce Platform
How would you architect a system to support millions of users, global inventory, and multi-currency transactions?
- Use microservices for modularity (user, product, order, payment, etc.).
- Deploy services in multiple regions for low latency.
- Use a global CDN for static assets.
- Support multi-currency via currency conversion microservice.
How would you ensure high availability and disaster recovery?
- Multi-region deployments with failover.
- Regular backups and automated disaster recovery drills.
- Use managed services with built-in HA (e.g., managed DBs, queues).
How would you handle catalog search and recommendations at scale?
- Use search engines like Elasticsearch or Solr for catalog search.
- Use recommendation engines (collaborative filtering, ML models) with batch and real-time processing.
3. Design a Real-Time Chat Application (e.g., WhatsApp, Slack)
How would you support millions of concurrent users and message delivery guarantees?
- Use WebSockets for real-time communication.
- Employ message queues (e.g., Kafka, RabbitMQ) for reliable delivery.
- Scale horizontally with stateless chat servers.
How would you design for message ordering, delivery, and offline support?
- Use message sequence numbers and persistent storage.
- Store undelivered messages and deliver when users reconnect.
How would you handle group chats and media sharing?
- Use group IDs and broadcast messages to group members.
- Store media in object storage (e.g., S3, GCS) and share links.
4. Design a Video Streaming Platform (e.g., YouTube, Netflix)
How would you handle video upload, encoding, storage, and global delivery?
- Use a microservice for uploads, trigger encoding jobs (e.g., via a queue).
- Store videos in object storage (S3, GCS, Azure Blob).
- Use a CDN for global delivery.
How would you design for adaptive bitrate streaming and CDN integration?
- Encode videos in multiple bitrates and formats (HLS, DASH).
- Use manifest files to allow clients to switch streams based on bandwidth.
- Integrate with CDN for edge caching.
How would you support recommendations and personalized feeds?
- Use ML models for recommendations (collaborative filtering, content-based).
- Store user activity and preferences for personalization.
5. Design a Ride-Sharing Service (e.g., Uber, Lyft)
How would you match riders and drivers in real time?
- Use geospatial indexing (e.g., QuadTree, Geohash) to find nearby drivers.
- Use real-time messaging (WebSockets, push notifications) for updates.
How would you handle surge pricing, location tracking, and trip histories?
- Calculate surge pricing based on demand/supply metrics.
- Track locations using GPS updates and store in a time-series DB.
- Store trip histories in a relational or NoSQL DB.
How would you ensure data consistency and low-latency updates?
- Use eventual consistency for non-critical data, strong consistency for payments/trips.
- Use in-memory data stores for fast lookups.
6. Design a Distributed Logging and Monitoring System
How would you collect, store, and analyze logs from thousands of servers?
- Use log shippers (e.g., Fluentd, Logstash) to collect logs.
- Store logs in a scalable store (e.g., Elasticsearch, S3, BigQuery).
- Use a centralized logging service with search and analytics.
How would you design for alerting, dashboards, and scalability?
- Use monitoring tools (e.g., Prometheus, Grafana) for metrics and dashboards.
- Set up alerting rules for anomalies and thresholds.
- Partition logs by time and source for scalability.
How would you ensure data privacy and retention policies?
- Encrypt logs at rest and in transit.
- Implement log retention and deletion policies.
- Mask or redact sensitive data before storage.
7. Design a Multi-Tenant SaaS Platform
How would you isolate data and resources between tenants?
- Use separate databases or schemas per tenant, or row-level security.
- Isolate compute resources using containers or VMs.
How would you handle onboarding, billing, and tenant-specific customizations?
- Automate onboarding with self-service portals.
- Integrate with payment gateways for billing.
- Use feature flags/configurations for customizations.
How would you design for extensibility and plugin support?
- Provide APIs and webhooks for integrations.
- Use a plugin architecture (e.g., via microservices or serverless functions).
8. Design a Secure Online Banking System
How would you ensure transaction security, auditability, and compliance?
- Use end-to-end encryption and secure authentication (MFA, OAuth2).
- Maintain audit logs for all transactions and access.
- Comply with standards (PCI DSS, GDPR, etc.).
How would you handle fraud detection and prevention?
- Use ML models to detect anomalies and flag suspicious transactions.
- Implement real-time monitoring and alerts.
- Use device fingerprinting and behavioral analytics.
How would you design for high availability and regulatory requirements?
- Deploy in multiple regions with failover.
- Use redundant infrastructure and regular DR testing.
- Ensure data residency and compliance with local regulations.
Tip: For each question, discuss trade-offs, scalability, reliability, security, and cost considerations. Draw diagrams and justify your technology choices.
Glossary
- API: Application Programming Interface
- API Gateway: A server that acts as an API front-end, receiving API requests and routing them to the appropriate backend service.
- Anomalies: Deviations from the expected behavior, often used in fraud detection
- Audit Logs: Records that provide a chronological sequence of events related to system activity
- Auto-Incrementing ID: A database-generated unique identifier that increases automatically with each new record
- Base62 Encoding: A method for encoding numeric values using 62 alphanumeric characters, often used for URL shorteners
- Behavioral Analytics: The analysis of user behavior patterns to detect anomalies or predict future actions
- CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart
- CDN: Content Delivery Network
- CI/CD: Continuous Integration/Continuous Deployment
- Clustering: Grouping multiple servers or nodes to work together for scalability and redundancy
- Compliance: Adherence to laws, regulations, and standards governing data security and privacy
- Container Orchestration: Automated management of containerized applications (e.g., Kubernetes)
- Data Residency: The requirement that data be stored within a specific geographic location
- Device Fingerprinting: A technique used to identify devices based on their unique characteristics
- Disaster Recovery (DR): Strategies and processes for restoring systems after a catastrophic failure
- Domain-Driven Design (DDD): An approach to software development that emphasizes collaboration between technical and domain experts
- End-to-End Encryption: A method of data transmission where only the communicating users can read the messages
- Eventual Consistency: A consistency model used in distributed systems where updates propagate over time
- Feature Flags: A technique to enable or disable features in a software application without deploying new code
- Geohash: A system for encoding latitude/longitude coordinates into a compact string representation
- Geospatial Indexing: Techniques for efficiently querying spatial data
- Group ID: An identifier used to represent a group in messaging or chat systems
- HA: High Availability
- HLS: HTTP Live Streaming, a protocol for streaming media over the internet
- Kafka: A distributed event streaming platform used for building real-time data pipelines
- Kubernetes: An open-source system for automating deployment, scaling, and management of containerized applications
- Load Balancer: A device or software that distributes network or application traffic across multiple servers
- Manifest File: A file that describes the structure and metadata of media streams (e.g., for adaptive streaming)
- MFA: Multi-Factor Authentication
- Microservices: An architectural style that structures an application as a collection of loosely coupled services
- ML: Machine Learning
- NoSQL: A class of databases that provide flexible schemas and scalability (e.g., DynamoDB, Cassandra)
- OAuth2: Open standard for access delegation commonly used for token-based authentication
- Object Storage: A storage architecture that manages data as objects (e.g., S3, GCS, Azure Blob)
- PCI DSS: Payment Card Industry Data Security Standard
- Plugin Architecture: A design pattern that allows for extensibility by enabling third-party developers to add functionality
- Prometheus: An open-source monitoring and alerting toolkit
- QPS: Queries Per Second
- QuadTree: A tree data structure which divides a two-dimensional space into four quadrants or regions
- RabbitMQ: An open-source message broker software
- Rate Limiting: A technique to control the rate of requests sent or received by a system
- Real-Time Messaging: Communication where messages are delivered instantly as they are sent
- Redundant Infrastructure: Systems designed to provide backup and failover capabilities in case of failure
- Redis: An in-memory data structure store, used as a database, cache, and message broker
- Recommendation Engine: A system that suggests products or content to users based on data analysis
- Row-Level Security: A database feature that restricts data access at the row level based on user permissions
- S3: Amazon Simple Storage Service, an object storage service
- SaaS: Software as a Service
- Scalability: The ability of a system to handle increased load by adding resources
- Search Engine: A system that indexes and retrieves data efficiently (e.g., Elasticsearch, Solr)
- Service Level Agreement (SLA): A contract that defines the level of service expected from a service provider
- Sharding: Partitioning data across multiple databases or servers to improve scalability
- Solr: An open-source search platform
- Stateless Application Server: A server that does not store client session data between requests
- Surge Pricing: Dynamic pricing strategy based on supply and demand
- Time-Series DB: A database optimized for time-stamped data, often used for monitoring and analytics
- UUID: Universally Unique Identifier
- WebSocket: A protocol providing full-duplex communication channels over a single TCP connection