Skip to content

Real-Time Chat App System Design

1. Business Requirements

Functional Requirements

  • User registration and authentication (users, admins)
  • One-to-one and group chat (text, emoji, file attachments)
  • Real-time message delivery and read receipts
  • Presence indicators (online/offline/typing)
  • Message history and search
  • Push notifications and urgent alerts (mentions, direct messages)
  • Mobile-ready responsive UI and API
  • Role-based access control
  • Analytics and reporting (active users, message trends)
  • API for programmatic chat integration

Non-Functional Requirements

  • 99.9% availability (max ~8.76 hours downtime/year)
  • Scalability to support thousands of concurrent users and messages
  • Secure data storage and access control (encryption in transit and at rest)
  • Fast response times (<200ms for message delivery)
  • Audit logging and monitoring
  • Backup and disaster recovery
  • GDPR/data privacy compliance
  • Mobile responsiveness

Out of Scope

  • Voice/video calling (unless specified)
  • Integration with external chat platforms
  • Built-in moderation/AI content filtering

2. Estimation & Back-of-the-Envelope Calculations

  • Users: 100,000
  • Daily messages: 10M
  • Peak concurrent users: ~10,000
  • Data size:
    • Messages: 10M × 0.5 KB ≈ 5 GB/day
    • 1 year: 5 GB × 365 ≈ 1.8 TB
    • User data: 100,000 × 2 KB ≈ 200 MB
    • Attachments: 1M × 1 MB ≈ 1 TB (object storage)
    • Total DB size: ~2 TB/year (excluding logs, backups, attachments)
  • Availability:
    • 99.9% = 8.76 hours/year downtime max
    • Use managed DB, multi-AZ deployment, health checks, auto-scaling

3. High Level Design (Mermaid Diagrams)

Component Diagram

mermaid
flowchart LR
  User[User (Web/Mobile)]
  LB[Load Balancer]
  WS[WebSocket Gateway]
  App[Application Server]
  DB[(Database)]
  Cache[Cache (Redis)]
  Storage[Object Storage (Attachments)]
  Alert[Alert/Notification Service]
  Analytics[Analytics Engine]

  User --> LB --> WS --> App
  App --> DB
  App --> Cache
  App --> Storage
  App --> Alert
  App --> Analytics
  Analytics --> DB

Data Flow Diagram

mermaid
sequenceDiagram
  participant U as User
  participant W as WebSocket Gateway
  participant A as App Server
  participant D as Database
  participant C as Cache
  participant S as Storage
  participant L as Alert Service

  U->>W: Send Message
  W->>A: Forward Message
  A->>C: Check User/Group Presence
  C-->>A: Hit/Miss
  A->>D: Store Message
  D-->>A: Success/Fail
  A->>S: Store Attachment (if any)
  S-->>A: Success/Fail
  A->>L: Send Urgent Alert (if needed)
  A-->>W: Deliver Message
  W-->>U: Receive Message

Key Design Decisions

  • Database: NoSQL (e.g., MongoDB, Cassandra) for high write throughput and flexible schema; Redis for ephemeral data (presence, sessions)
  • Object Storage: For attachments (e.g., AWS S3, Azure Blob)
  • WebSocket Gateway: For real-time, low-latency message delivery
  • Alerting/Notifications: Email/SMS/push via third-party service (e.g., Twilio, Firebase)
  • Analytics: Batch or streaming (e.g., Kafka + Spark, or managed cloud analytics)
  • Deployment: Cloud-based, multi-AZ, managed services for high availability
  • API: REST/GraphQL for mobile and web clients

4. Conceptual Design

Entities

  • User: id, name, email, password_hash, status, registration_date, last_seen
  • ChatRoom: id, name, type (group/1-1), members, created_at
  • Message: id, chatroom_id, sender_id, content, type (text/file), timestamp, status (delivered/read)
  • Attachment: id, message_id, url, type, size, uploaded_at
  • Presence: user_id, status, last_active
  • Alert: id, user_id, type (urgent/mention/dm), message, created_at, status
  • AuditLog: id, user_id, action, entity, entity_id, timestamp

Key Flows

  • Message Sending:
    1. User sends message via WebSocket
    2. App checks presence, stores message, delivers to recipients
    3. Stores attachment if any, triggers urgent alert if needed
  • Presence:
    • Updates in Redis for fast access
  • Alerts:
    • System triggers urgent alerts for mentions, DMs, or system events
  • Analytics:
    • Periodic jobs aggregate message, user, and trend data

Security

  • Role-based access control (RBAC)
  • Input validation, rate limiting
  • Encrypted connections (HTTPS)
  • Regular backups and audit logs

5. Bottlenecks and Refinement

Potential Bottlenecks

  • WebSocket gateway scaling:
    • Use stateless gateways, sticky sessions, and auto-scaling
  • Database write throughput:
    • Use NoSQL DB with sharding and replication
  • Attachment storage/delivery:
    • Use scalable object storage and CDN
  • Alert delivery:
    • Use async queues for urgent notifications
  • Cache contention:
    • Use Redis clustering for presence/session data
  • Single region failure:
    • Deploy across multiple availability zones/regions

Refinement

  • Monitor system metrics and auto-scale WebSocket/app servers
  • Regularly test failover and backup restores
  • Optimize queries and indexes for frequent operations
  • Consider sharding if user/message volume grows significantly

This design provides a scalable, highly available, and mobile-ready real-time chat system with robust urgent alerts, analytics, and operational best practices.