Monitoring Stack (Prometheus + Grafana)¶

Overview¶

Changemaker Lite V2 includes a complete observability stack for production monitoring:

Prometheus: Metrics collection + alerting rules
Grafana: Visualization + pre-configured dashboards
Alertmanager: Alert routing + notifications
cAdvisor: Docker container metrics
Node Exporter: Host system metrics
Redis Exporter: Redis-specific metrics
Gotify: Push notifications (optional)

All monitoring services behind Docker Compose profile flag (opt-in).

Architecture¶

graph LR
    subgraph "Application Metrics"
        API[API<br/>:4000/api/metrics]
        MEDIA[Media API<br/>:4100/metrics]
    end

    subgraph "Infrastructure Metrics"
        CADVISOR[cAdvisor<br/>Container Stats]
        NODE[Node Exporter<br/>Host Stats]
        REDIS_EXP[Redis Exporter<br/>Redis Stats]
    end

    subgraph "Monitoring Stack"
        PROM[Prometheus<br/>:9090]
        GRAFANA[Grafana<br/>:3001]
        ALERT[Alertmanager<br/>:9093]
        GOTIFY[Gotify<br/>:8889]
    end

    API --> PROM
    MEDIA --> PROM
    CADVISOR --> PROM
    NODE --> PROM
    REDIS_EXP --> PROM

    PROM --> GRAFANA
    PROM --> ALERT
    ALERT --> GOTIFY

Quick Start¶

Enable Monitoring¶

# Start with monitoring profile
docker compose --profile monitoring up -d

# Check services
docker compose ps | grep monitoring

# Access dashboards
open http://localhost:3001  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:9093  # Alertmanager

Prometheus Configuration¶

Scrape Targets¶

File: configs/prometheus/prometheus.yml

scrape_configs:
  # V2 Unified API Metrics (10s interval)
  - job_name: 'changemaker-v2-api'
    static_configs:
      - targets: ['changemaker-v2-api:4000']
    metrics_path: '/api/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

  # Redis Metrics (15s interval)
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 15s

  # cAdvisor - Docker container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  # Node Exporter - System metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Alertmanager monitoring
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
    scrape_interval: 30s

Intervals: - 10s: API (real-time application metrics) - 15s: Infrastructure (host + containers + Redis) - 30s: Monitoring stack itself

Custom Metrics (cm_*)¶

File: api/src/utils/metrics.ts

12 custom metrics for domain-specific monitoring:

Metric	Type	Labels	Description
`cm_emails_sent_total`	Counter	`campaign_id`	Campaign emails sent successfully
`cm_emails_failed_total`	Counter	`campaign_id`, `error_type`	Failed email sends
`cm_email_queue_size`	Gauge	-	Current email queue size
`cm_email_send_duration_seconds`	Histogram	-	Email send latency
`cm_login_attempts_total`	Counter	`status`	Login attempts (success/failure)
`cm_active_sessions`	Gauge	-	Active refresh tokens
`cm_campaign_emails_total`	Counter	`campaign_id`	Total campaign emails created
`cm_response_submissions_total`	Counter	-	Response wall submissions
`cm_canvass_visits_total`	Counter	`outcome`	Canvass visits by outcome
`cm_active_canvass_sessions`	Gauge	-	Active canvass sessions
`cm_shift_signups_total`	Counter	-	Shift signups
`cm_external_service_up`	Gauge	`service`	External service health (1=up, 0=down)

HTTP metrics (standard prom-client): - http_requests_total - http_request_duration_seconds

Geocoding metrics: - cm_geocode_cache_hits_total - cm_geocode_cache_misses_total - cm_geocode_requests_total - cm_geocode_duration_seconds

Email template metrics: - cm_email_templates_updated_total - cm_email_test_sent_total - cm_email_template_rollback_total - cm_email_template_cache_hit/miss_total

Location query metrics: - cm_map_location_query_duration_seconds - cm_map_location_query_count_total - cm_map_location_result_count

Alert Rules¶

File: configs/prometheus/alerts.yml

12 alert rules across 4 groups:

Application Alerts¶

ApplicationDown: API unreachable for 2 minutes
HighErrorRate: >10% 5xx errors for 5 minutes
EmailQueueBacklog: Queue size >100 for 10 minutes
HighEmailFailureRate: >20% email failures for 10 minutes
SuspiciousLoginActivity: >5 failed logins/sec for 2 minutes
HighAPILatency: P95 latency >2s for 5 minutes
ExternalServiceDown: External service unreachable for 5 minutes

System Alerts¶

RedisDown: Redis unreachable for 1 minute
DiskSpaceLow: <15% disk space for 5 minutes
DiskSpaceCritical: <10% disk space for 2 minutes
HighCPUUsage: >85% CPU for 10 minutes
HighMemoryUsage: >85% memory for 10 minutes

Example Alert:

- alert: ApplicationDown
  expr: up{job="changemaker-v2-api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "V2 API is down"
    description: "The Changemaker V2 API has been down for more than 2 minutes."

Data Retention¶

docker-compose.yml:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # 30 days

Disk usage: ~1-5GB for 30 days (depends on scrape frequency + cardinality).

Increase retention:

# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'

# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus

Grafana Configuration¶

Datasource¶

File: configs/grafana/datasources.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Auto-provisioned on Grafana startup.

Dashboards¶

File: configs/grafana/dashboards.yml

apiVersion: 1

providers:
  - name: 'Default'
    folder: 'Changemaker Lite'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

3 pre-configured dashboards:

1. Application Overview¶

File: configs/grafana/application-overview.json

Panels: - API uptime (last 24h) - Request rate (req/sec) - Error rate (%) - Email queue size - Active sessions - Campaign emails sent

Refresh: 10s

2. API Performance¶

File: configs/grafana/api-performance.json

Panels: - Request latency (P50, P95, P99) - Requests by status code - Top 10 slowest endpoints - HTTP errors by route - Geocoding cache hit rate - Email send duration

Refresh: 30s

3. System Health¶

File: configs/grafana/system-health.json

Panels: - CPU usage (%) - Memory usage (%) - Disk space (GB free) - Network I/O (MB/s) - Container CPU throttling - Redis memory usage

Refresh: 1m

# Access Grafana
open http://localhost:3001

# Default credentials
Username: admin
Password: admin

# Change password on first login

Navigate: Dashboards → Changemaker Lite folder → Select dashboard

Alertmanager Configuration¶

Notification Receivers¶

File: configs/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    # Email (example)
    email_configs:
      - to: 'admin@cmlite.org'
        from: 'alerts@cmlite.org'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@cmlite.org'
        auth_password: 'your-password'

    # Slack (example)
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

    # Gotify (push notifications)
    webhook_configs:
      - url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'

Grouping: Combines similar alerts (prevents spam).

Repeat: Re-sends unresolved alerts every 4 hours.

Testing Alerts¶

Manual test:

# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -d '[{
    "labels": {"alertname":"TestAlert","severity":"warning"},
    "annotations": {"summary":"Test alert from curl"}
  }]'

# Check Alertmanager UI
open http://localhost:9093

Force alert (stop API):

# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api

# Check Prometheus alerts
open http://localhost:9090/alerts

# Wait 2 minutes → Alert fires → Notification sent

Exporters¶

cAdvisor (Container Metrics)¶

Metrics: - CPU usage per container - Memory usage per container - Network I/O - Disk I/O

Access: http://localhost:8080

Configuration (docker-compose.yml):

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor-changemaker
  privileged: true  # Required for full access
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    - /dev/disk/:/dev/disk:ro
  devices:
    - /dev/kmsg

Node Exporter (Host Metrics)¶

Metrics: - CPU usage (all cores) - Memory usage (total, free, cached) - Disk usage (filesystem, mountpoints) - Network I/O (bytes, packets)

Access: http://localhost:9100/metrics

Configuration:

node-exporter:
  command:
    - '--path.rootfs=/host'
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro

Redis Exporter¶

Metrics: - Memory usage - Commands per second - Connected clients - Keyspace hits/misses - Evicted keys

Access: http://localhost:9121/metrics

Configuration:

redis-exporter:
  environment:
    - REDIS_ADDR=redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}  # Authenticates with Redis

Gotify (Push Notifications)¶

Setup:

# Access Gotify UI
open http://localhost:8889

# Login (default: admin/admin)

# Create app → Copy token

# Add to Alertmanager config:
webhook_configs:
  - url: 'http://gotify:80/message?token=YOUR_TOKEN'

Mobile apps: Available for iOS/Android (receive push notifications).

Accessing Services¶

Service	URL	Default Credentials
Prometheus	http://localhost:9090	None
Grafana	http://localhost:3001	admin / admin
Alertmanager	http://localhost:9093	None
cAdvisor	http://localhost:8080	None
Node Exporter	http://localhost:9100/metrics	None
Redis Exporter	http://localhost:9121/metrics	None
Gotify	http://localhost:8889	admin / admin

Troubleshooting¶

Prometheus Not Scraping¶

Symptoms: Missing data in Grafana dashboards

Diagnosis:

# Check Prometheus targets
open http://localhost:9090/targets

# Look for errors (red) vs success (green)

# Check API metrics endpoint
curl http://localhost:4000/api/metrics

Common causes: - API container not running - Wrong port in prometheus.yml - Network connectivity issue

Solution:

# Restart API
docker compose restart api

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart Prometheus
docker compose restart prometheus

Grafana Dashboards Not Loading¶

Symptoms: Blank dashboards or "No data" errors

Diagnosis:

# Check Grafana logs
docker compose logs grafana | tail -50

# Check datasource
open http://localhost:3001/datasources

# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up

Solution:

# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)

# Restart Grafana
docker compose restart grafana

Alerts Not Firing¶

Symptoms: No notifications despite issues

Diagnosis:

# Check Prometheus alerts
open http://localhost:9090/alerts

# Check Alertmanager
open http://localhost:9093

# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules

Solution:

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'

Production Best Practices¶

Secure Grafana¶

Change admin password:

# Via UI: Admin → Profile → Change Password

# Via env var (docker-compose.yml):
environment:
  - GF_SECURITY_ADMIN_PASSWORD=<strong-password>

Disable signup:

environment:
  - GF_USERS_ALLOW_SIGN_UP=false  # Already set

Alert Tuning¶

Avoid false positives: Increase for duration in critical alerts.

Example (before):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 1m  # Too aggressive

Example (after):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 10m  # More reasonable

External Storage (Long-Term)¶

Prometheus supports remote write to: - Thanos: Long-term storage (S3/GCS) - Cortex: Multi-tenant Prometheus - VictoriaMetrics: High-performance storage

Example (Thanos):

# prometheus.yml
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

Docker Compose — Monitoring services configuration
Environment Variables — Monitoring env vars
API Reference — Custom metrics implementation

Monitoring Stack (Prometheus + Grafana)¶

Overview¶

Architecture¶

Quick Start¶

Enable Monitoring¶

Prometheus Configuration¶

Scrape Targets¶

Custom Metrics (cm_*)¶

Alert Rules¶

Application Alerts¶

System Alerts¶

Data Retention¶

Grafana Configuration¶

Datasource¶

Dashboards¶

1. Application Overview¶

2. API Performance¶

3. System Health¶

First Login¶

Alertmanager Configuration¶

Notification Receivers¶

Testing Alerts¶

Exporters¶

cAdvisor (Container Metrics)¶

Node Exporter (Host Metrics)¶

Redis Exporter¶

Gotify (Push Notifications)¶

Accessing Services¶

Troubleshooting¶

Prometheus Not Scraping¶

Grafana Dashboards Not Loading¶

Alerts Not Firing¶

Production Best Practices¶

Secure Grafana¶

Alert Tuning¶

External Storage (Long-Term)¶

Related Documentation¶