changemaker.lite/mkdocs/docs/v2/deployment/monitoring-stack.md

14 KiB

Monitoring Stack (Prometheus + Grafana)

Overview

Changemaker Lite V2 includes a complete observability stack for production monitoring:

  • Prometheus: Metrics collection + alerting rules
  • Grafana: Visualization + pre-configured dashboards
  • Alertmanager: Alert routing + notifications
  • cAdvisor: Docker container metrics
  • Node Exporter: Host system metrics
  • Redis Exporter: Redis-specific metrics
  • Gotify: Push notifications (optional)

All monitoring services behind Docker Compose profile flag (opt-in).


Architecture

graph LR
    subgraph "Application Metrics"
        API[API<br/>:4000/api/metrics]
        MEDIA[Media API<br/>:4100/metrics]
    end

    subgraph "Infrastructure Metrics"
        CADVISOR[cAdvisor<br/>Container Stats]
        NODE[Node Exporter<br/>Host Stats]
        REDIS_EXP[Redis Exporter<br/>Redis Stats]
    end

    subgraph "Monitoring Stack"
        PROM[Prometheus<br/>:9090]
        GRAFANA[Grafana<br/>:3001]
        ALERT[Alertmanager<br/>:9093]
        GOTIFY[Gotify<br/>:8889]
    end

    API --> PROM
    MEDIA --> PROM
    CADVISOR --> PROM
    NODE --> PROM
    REDIS_EXP --> PROM

    PROM --> GRAFANA
    PROM --> ALERT
    ALERT --> GOTIFY

Quick Start

Enable Monitoring

# Start with monitoring profile
docker compose --profile monitoring up -d

# Check services
docker compose ps | grep monitoring

# Access dashboards
open http://localhost:3001  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:9093  # Alertmanager

Prometheus Configuration

Scrape Targets

File: configs/prometheus/prometheus.yml

scrape_configs:
  # V2 Unified API Metrics (10s interval)
  - job_name: 'changemaker-v2-api'
    static_configs:
      - targets: ['changemaker-v2-api:4000']
    metrics_path: '/api/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

  # Redis Metrics (15s interval)
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 15s

  # cAdvisor - Docker container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  # Node Exporter - System metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Alertmanager monitoring
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
    scrape_interval: 30s

Intervals:

  • 10s: API (real-time application metrics)
  • 15s: Infrastructure (host + containers + Redis)
  • 30s: Monitoring stack itself

Custom Metrics (cm_*)

File: api/src/utils/metrics.ts

12 custom metrics for domain-specific monitoring:

Metric Type Labels Description
cm_emails_sent_total Counter campaign_id Campaign emails sent successfully
cm_emails_failed_total Counter campaign_id, error_type Failed email sends
cm_email_queue_size Gauge - Current email queue size
cm_email_send_duration_seconds Histogram - Email send latency
cm_login_attempts_total Counter status Login attempts (success/failure)
cm_active_sessions Gauge - Active refresh tokens
cm_campaign_emails_total Counter campaign_id Total campaign emails created
cm_response_submissions_total Counter - Response wall submissions
cm_canvass_visits_total Counter outcome Canvass visits by outcome
cm_active_canvass_sessions Gauge - Active canvass sessions
cm_shift_signups_total Counter - Shift signups
cm_external_service_up Gauge service External service health (1=up, 0=down)

HTTP metrics (standard prom-client):

  • http_requests_total
  • http_request_duration_seconds

Geocoding metrics:

  • cm_geocode_cache_hits_total
  • cm_geocode_cache_misses_total
  • cm_geocode_requests_total
  • cm_geocode_duration_seconds

Email template metrics:

  • cm_email_templates_updated_total
  • cm_email_test_sent_total
  • cm_email_template_rollback_total
  • cm_email_template_cache_hit/miss_total

Location query metrics:

  • cm_map_location_query_duration_seconds
  • cm_map_location_query_count_total
  • cm_map_location_result_count

Alert Rules

File: configs/prometheus/alerts.yml

12 alert rules across 4 groups:

Application Alerts

  1. ApplicationDown: API unreachable for 2 minutes
  2. HighErrorRate: >10% 5xx errors for 5 minutes
  3. EmailQueueBacklog: Queue size >100 for 10 minutes
  4. HighEmailFailureRate: >20% email failures for 10 minutes
  5. SuspiciousLoginActivity: >5 failed logins/sec for 2 minutes
  6. HighAPILatency: P95 latency >2s for 5 minutes
  7. ExternalServiceDown: External service unreachable for 5 minutes

System Alerts

  1. RedisDown: Redis unreachable for 1 minute
  2. DiskSpaceLow: <15% disk space for 5 minutes
  3. DiskSpaceCritical: <10% disk space for 2 minutes
  4. HighCPUUsage: >85% CPU for 10 minutes
  5. HighMemoryUsage: >85% memory for 10 minutes

Example Alert:

- alert: ApplicationDown
  expr: up{job="changemaker-v2-api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "V2 API is down"
    description: "The Changemaker V2 API has been down for more than 2 minutes."

Data Retention

docker-compose.yml:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # 30 days

Disk usage: ~1-5GB for 30 days (depends on scrape frequency + cardinality).

Increase retention:

# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'

# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus

Grafana Configuration

Datasource

File: configs/grafana/datasources.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Auto-provisioned on Grafana startup.


Dashboards

File: configs/grafana/dashboards.yml

apiVersion: 1

providers:
  - name: 'Default'
    folder: 'Changemaker Lite'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

3 pre-configured dashboards:

1. Application Overview

File: configs/grafana/application-overview.json

Panels:

  • API uptime (last 24h)
  • Request rate (req/sec)
  • Error rate (%)
  • Email queue size
  • Active sessions
  • Campaign emails sent

Refresh: 10s


2. API Performance

File: configs/grafana/api-performance.json

Panels:

  • Request latency (P50, P95, P99)
  • Requests by status code
  • Top 10 slowest endpoints
  • HTTP errors by route
  • Geocoding cache hit rate
  • Email send duration

Refresh: 30s


3. System Health

File: configs/grafana/system-health.json

Panels:

  • CPU usage (%)
  • Memory usage (%)
  • Disk space (GB free)
  • Network I/O (MB/s)
  • Container CPU throttling
  • Redis memory usage

Refresh: 1m


First Login

# Access Grafana
open http://localhost:3001

# Default credentials
Username: admin
Password: admin

# Change password on first login

Navigate: Dashboards → Changemaker Lite folder → Select dashboard


Alertmanager Configuration

Notification Receivers

File: configs/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    # Email (example)
    email_configs:
      - to: 'admin@cmlite.org'
        from: 'alerts@cmlite.org'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@cmlite.org'
        auth_password: 'your-password'

    # Slack (example)
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

    # Gotify (push notifications)
    webhook_configs:
      - url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'

Grouping: Combines similar alerts (prevents spam).

Repeat: Re-sends unresolved alerts every 4 hours.


Testing Alerts

Manual test:

# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -d '[{
    "labels": {"alertname":"TestAlert","severity":"warning"},
    "annotations": {"summary":"Test alert from curl"}
  }]'

# Check Alertmanager UI
open http://localhost:9093

Force alert (stop API):

# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api

# Check Prometheus alerts
open http://localhost:9090/alerts

# Wait 2 minutes → Alert fires → Notification sent

Exporters

cAdvisor (Container Metrics)

Metrics:

  • CPU usage per container
  • Memory usage per container
  • Network I/O
  • Disk I/O

Access: http://localhost:8080

Configuration (docker-compose.yml):

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor-changemaker
  privileged: true  # Required for full access
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    - /dev/disk/:/dev/disk:ro
  devices:
    - /dev/kmsg

Node Exporter (Host Metrics)

Metrics:

  • CPU usage (all cores)
  • Memory usage (total, free, cached)
  • Disk usage (filesystem, mountpoints)
  • Network I/O (bytes, packets)

Access: http://localhost:9100/metrics

Configuration:

node-exporter:
  command:
    - '--path.rootfs=/host'
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro

Redis Exporter

Metrics:

  • Memory usage
  • Commands per second
  • Connected clients
  • Keyspace hits/misses
  • Evicted keys

Access: http://localhost:9121/metrics

Configuration:

redis-exporter:
  environment:
    - REDIS_ADDR=redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}  # Authenticates with Redis

Gotify (Push Notifications)

Setup:

# Access Gotify UI
open http://localhost:8889

# Login (default: admin/admin)

# Create app → Copy token

# Add to Alertmanager config:
webhook_configs:
  - url: 'http://gotify:80/message?token=YOUR_TOKEN'

Mobile apps: Available for iOS/Android (receive push notifications).


Accessing Services

Service URL Default Credentials
Prometheus http://localhost:9090 None
Grafana http://localhost:3001 admin / admin
Alertmanager http://localhost:9093 None
cAdvisor http://localhost:8080 None
Node Exporter http://localhost:9100/metrics None
Redis Exporter http://localhost:9121/metrics None
Gotify http://localhost:8889 admin / admin

Troubleshooting

Prometheus Not Scraping

Symptoms: Missing data in Grafana dashboards

Diagnosis:

# Check Prometheus targets
open http://localhost:9090/targets

# Look for errors (red) vs success (green)

# Check API metrics endpoint
curl http://localhost:4000/api/metrics

Common causes:

  • API container not running
  • Wrong port in prometheus.yml
  • Network connectivity issue

Solution:

# Restart API
docker compose restart api

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart Prometheus
docker compose restart prometheus

Grafana Dashboards Not Loading

Symptoms: Blank dashboards or "No data" errors

Diagnosis:

# Check Grafana logs
docker compose logs grafana | tail -50

# Check datasource
open http://localhost:3001/datasources

# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up

Solution:

# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)

# Restart Grafana
docker compose restart grafana

Alerts Not Firing

Symptoms: No notifications despite issues

Diagnosis:

# Check Prometheus alerts
open http://localhost:9090/alerts

# Check Alertmanager
open http://localhost:9093

# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules

Solution:

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'

Production Best Practices

Secure Grafana

Change admin password:

# Via UI: Admin → Profile → Change Password

# Via env var (docker-compose.yml):
environment:
  - GF_SECURITY_ADMIN_PASSWORD=<strong-password>

Disable signup:

environment:
  - GF_USERS_ALLOW_SIGN_UP=false  # Already set

Alert Tuning

Avoid false positives: Increase for duration in critical alerts.

Example (before):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 1m  # Too aggressive

Example (after):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 10m  # More reasonable

External Storage (Long-Term)

Prometheus supports remote write to:

  • Thanos: Long-term storage (S3/GCS)
  • Cortex: Multi-tenant Prometheus
  • VictoriaMetrics: High-performance storage

Example (Thanos):

# prometheus.yml
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"