Skip to content

Notification Service System Design

1. Business Requirements

Functional Requirements

  • User registration and authentication (apps, admins)
  • Send notifications via multiple channels (email, SMS, push, in-app)
  • Support for urgent and scheduled alerts
  • Notification templates and personalization
  • Delivery status tracking and retries
  • Mobile-ready API and dashboard
  • Role-based access control
  • Analytics and reporting (delivery rates, failures, trends)
  • API for programmatic notification requests

Non-Functional Requirements

  • 99.9% availability (max ~8.76 hours downtime/year)
  • Scalability to handle high notification volume and spikes
  • Secure data storage and access control
  • Fast response times (<300ms for most API/dashboard requests)
  • Audit logging and monitoring
  • Backup and disaster recovery
  • GDPR/data privacy compliance
  • Mobile responsiveness

Out of Scope

  • Built-in chat/messaging between users
  • Third-party marketing automation integration
  • Content moderation for notification payloads

2. Estimation & Back-of-the-Envelope Calculations

  • Clients: 10,000 (apps, services)
  • Notifications/day: 5M (urgent + scheduled)
  • Peak concurrent requests: ~10,000
  • Data size:
    • Notification logs: 5M × 0.5 KB ≈ 2.5 GB/day
    • 1 year: 2.5 GB × 365 ≈ 900 GB
    • User/app data: 10,000 × 2 KB ≈ 20 MB
    • Templates: 1,000 × 1 KB ≈ 1 MB
    • Total DB size: ~1 TB/year (excluding logs, backups)
  • Availability:
    • 99.9% = 8.76 hours/year downtime max
    • Use managed DB, multi-AZ deployment, health checks, auto-scaling

3. High Level Design (Mermaid Diagrams)

Component Diagram

mermaid
flowchart LR
  Client[Client (Web/Mobile/App)]
  LB[Load Balancer]
  API[Notification API]
  App[Application Server]
  DB[(Metadata DB)]
  Queue[Message Queue]
  Channel[Channel Workers (Email/SMS/Push)]
  Alert[Alert/Notification Service]
  Analytics[Analytics Engine]

  Client --> LB --> API --> App
  App --> DB
  App --> Queue
  Queue --> Channel
  Channel --> Alert
  App --> Analytics
  Analytics --> DB

Data Flow Diagram

mermaid
sequenceDiagram
  participant C as Client
  participant A as API Server
  participant Q as Queue
  participant W as Channel Worker
  participant D as DB
  participant L as Alert Service

  C->>A: Send Notification Request
  A->>Q: Enqueue Notification
  Q->>W: Dequeue and Process
  W->>D: Log Delivery Status
  W->>L: Trigger Urgent Alert (if needed)
  W-->>A: Delivery Status
  A-->>C: Response

Key Design Decisions

  • Database: Relational DB (e.g., PostgreSQL) for metadata, logs, and templates
  • Queue: Distributed message queue (e.g., RabbitMQ, Kafka, AWS SQS) for decoupling and scaling
  • Channel Workers: Separate workers for each channel (email, SMS, push)
  • Alerting: Urgent alerts via dedicated service (e.g., Twilio, Firebase)
  • Analytics: Batch or streaming (e.g., Kafka + Spark, or managed cloud analytics)
  • Deployment: Cloud-based, multi-AZ, managed services for high availability
  • API: REST/GraphQL for programmatic access

4. Conceptual Design

Entities

  • User/App: id, name, api_key, contact_info, role, status
  • Notification: id, user_id, channel, payload, status, priority, scheduled_at, sent_at
  • Template: id, name, content, channel, created_at, updated_at
  • DeliveryLog: id, notification_id, channel, status, timestamp, error
  • Alert: id, user_id, type (urgent/scheduled), message, created_at, status
  • AuditLog: id, user_id, action, entity, entity_id, timestamp

Key Flows

  • Notification Sending:
    1. Client sends notification request
    2. API enqueues notification
    3. Channel worker processes and sends via appropriate channel
    4. Logs delivery status, triggers urgent alert if needed
  • Alerts:
    • System triggers urgent alerts for failures or high-priority events
  • Analytics:
    • Periodic jobs aggregate delivery, failure, and trend data

Security

  • Role-based access control (RBAC)
  • API key validation, input validation, rate limiting
  • Encrypted connections (HTTPS)
  • Regular backups and audit logs

5. Bottlenecks and Refinement

Potential Bottlenecks

  • Queue/message throughput:
    • Use scalable, distributed queues and auto-scaling workers
  • Channel provider rate limits:
    • Implement retry, backoff, and provider failover
  • Database contention:
    • Use read replicas, caching, and DB connection pooling
  • Alert delivery:
    • Use async queues for urgent notifications
  • Single region failure:
    • Deploy across multiple availability zones/regions

Refinement

  • Monitor system metrics and auto-scale API servers and workers
  • Regularly test failover and backup restores
  • Optimize queries and indexes for frequent operations
  • Consider sharding if notification/log volume grows significantly

This design provides a scalable, highly available, and mobile-ready notification service with robust urgent alerts, analytics, and operational best practices.