# Monitoring Stack (Prometheus + Grafana) ## Overview Changemaker Lite V2 includes a complete observability stack for production monitoring: - **Prometheus**: Metrics collection + alerting rules - **Grafana**: Visualization + pre-configured dashboards - **Alertmanager**: Alert routing + notifications - **cAdvisor**: Docker container metrics - **Node Exporter**: Host system metrics - **Redis Exporter**: Redis-specific metrics - **Gotify**: Push notifications (optional) **All monitoring services** behind Docker Compose profile flag (opt-in). --- ## Architecture ```mermaid graph LR subgraph "Application Metrics" API[API
:4000/api/metrics] MEDIA[Media API
:4100/metrics] end subgraph "Infrastructure Metrics" CADVISOR[cAdvisor
Container Stats] NODE[Node Exporter
Host Stats] REDIS_EXP[Redis Exporter
Redis Stats] end subgraph "Monitoring Stack" PROM[Prometheus
:9090] GRAFANA[Grafana
:3001] ALERT[Alertmanager
:9093] GOTIFY[Gotify
:8889] end API --> PROM MEDIA --> PROM CADVISOR --> PROM NODE --> PROM REDIS_EXP --> PROM PROM --> GRAFANA PROM --> ALERT ALERT --> GOTIFY ``` --- ## Quick Start ### Enable Monitoring ```bash # Start with monitoring profile docker compose --profile monitoring up -d # Check services docker compose ps | grep monitoring # Access dashboards open http://localhost:3001 # Grafana (admin/admin) open http://localhost:9090 # Prometheus open http://localhost:9093 # Alertmanager ``` --- ## Prometheus Configuration ### Scrape Targets **File**: `configs/prometheus/prometheus.yml` ```yaml scrape_configs: # V2 Unified API Metrics (10s interval) - job_name: 'changemaker-v2-api' static_configs: - targets: ['changemaker-v2-api:4000'] metrics_path: '/api/metrics' scrape_interval: 10s scrape_timeout: 5s # Redis Metrics (15s interval) - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] scrape_interval: 15s # cAdvisor - Docker container metrics - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] scrape_interval: 15s # Node Exporter - System metrics - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] scrape_interval: 15s # Prometheus self-monitoring - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Alertmanager monitoring - job_name: 'alertmanager' static_configs: - targets: ['alertmanager:9093'] scrape_interval: 30s ``` **Intervals:** - **10s**: API (real-time application metrics) - **15s**: Infrastructure (host + containers + Redis) - **30s**: Monitoring stack itself --- ### Custom Metrics (cm_*) **File**: `api/src/utils/metrics.ts` **12 custom metrics** for domain-specific monitoring: | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `cm_emails_sent_total` | Counter | `campaign_id` | Campaign emails sent successfully | | `cm_emails_failed_total` | Counter | `campaign_id`, `error_type` | Failed email sends | | `cm_email_queue_size` | Gauge | - | Current email queue size | | `cm_email_send_duration_seconds` | Histogram | - | Email send latency | | `cm_login_attempts_total` | Counter | `status` | Login attempts (success/failure) | | `cm_active_sessions` | Gauge | - | Active refresh tokens | | `cm_campaign_emails_total` | Counter | `campaign_id` | Total campaign emails created | | `cm_response_submissions_total` | Counter | - | Response wall submissions | | `cm_canvass_visits_total` | Counter | `outcome` | Canvass visits by outcome | | `cm_active_canvass_sessions` | Gauge | - | Active canvass sessions | | `cm_shift_signups_total` | Counter | - | Shift signups | | `cm_external_service_up` | Gauge | `service` | External service health (1=up, 0=down) | **HTTP metrics** (standard prom-client): - `http_requests_total` - `http_request_duration_seconds` **Geocoding metrics:** - `cm_geocode_cache_hits_total` - `cm_geocode_cache_misses_total` - `cm_geocode_requests_total` - `cm_geocode_duration_seconds` **Email template metrics:** - `cm_email_templates_updated_total` - `cm_email_test_sent_total` - `cm_email_template_rollback_total` - `cm_email_template_cache_hit/miss_total` **Location query metrics:** - `cm_map_location_query_duration_seconds` - `cm_map_location_query_count_total` - `cm_map_location_result_count` --- ### Alert Rules **File**: `configs/prometheus/alerts.yml` **12 alert rules** across 4 groups: #### Application Alerts 1. **ApplicationDown**: API unreachable for 2 minutes 2. **HighErrorRate**: >10% 5xx errors for 5 minutes 3. **EmailQueueBacklog**: Queue size >100 for 10 minutes 4. **HighEmailFailureRate**: >20% email failures for 10 minutes 5. **SuspiciousLoginActivity**: >5 failed logins/sec for 2 minutes 6. **HighAPILatency**: P95 latency >2s for 5 minutes 7. **ExternalServiceDown**: External service unreachable for 5 minutes #### System Alerts 8. **RedisDown**: Redis unreachable for 1 minute 9. **DiskSpaceLow**: <15% disk space for 5 minutes 10. **DiskSpaceCritical**: <10% disk space for 2 minutes 11. **HighCPUUsage**: >85% CPU for 10 minutes 12. **HighMemoryUsage**: >85% memory for 10 minutes **Example Alert**: ```yaml - alert: ApplicationDown expr: up{job="changemaker-v2-api"} == 0 for: 2m labels: severity: critical annotations: summary: "V2 API is down" description: "The Changemaker V2 API has been down for more than 2 minutes." ``` --- ### Data Retention **docker-compose.yml**: ```yaml prometheus: command: - '--storage.tsdb.retention.time=30d' # 30 days ``` **Disk usage**: ~1-5GB for 30 days (depends on scrape frequency + cardinality). **Increase retention**: ```bash # Edit docker-compose.yml # Change to '--storage.tsdb.retention.time=90d' # Recreate container docker compose --profile monitoring up -d --force-recreate prometheus ``` --- ## Grafana Configuration ### Datasource **File**: `configs/grafana/datasources.yml` ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false ``` **Auto-provisioned** on Grafana startup. --- ### Dashboards **File**: `configs/grafana/dashboards.yml` ```yaml apiVersion: 1 providers: - name: 'Default' folder: 'Changemaker Lite' type: file options: path: /etc/grafana/provisioning/dashboards ``` **3 pre-configured dashboards**: #### 1. Application Overview **File**: `configs/grafana/application-overview.json` **Panels**: - API uptime (last 24h) - Request rate (req/sec) - Error rate (%) - Email queue size - Active sessions - Campaign emails sent **Refresh**: 10s --- #### 2. API Performance **File**: `configs/grafana/api-performance.json` **Panels**: - Request latency (P50, P95, P99) - Requests by status code - Top 10 slowest endpoints - HTTP errors by route - Geocoding cache hit rate - Email send duration **Refresh**: 30s --- #### 3. System Health **File**: `configs/grafana/system-health.json` **Panels**: - CPU usage (%) - Memory usage (%) - Disk space (GB free) - Network I/O (MB/s) - Container CPU throttling - Redis memory usage **Refresh**: 1m --- ### First Login ```bash # Access Grafana open http://localhost:3001 # Default credentials Username: admin Password: admin # Change password on first login ``` **Navigate**: Dashboards → Changemaker Lite folder → Select dashboard --- ## Alertmanager Configuration ### Notification Receivers **File**: `configs/alertmanager/alertmanager.yml` ```yaml global: resolve_timeout: 5m route: receiver: 'default' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receivers: - name: 'default' # Email (example) email_configs: - to: 'admin@cmlite.org' from: 'alerts@cmlite.org' smarthost: 'smtp.example.com:587' auth_username: 'alerts@cmlite.org' auth_password: 'your-password' # Slack (example) slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}' # Gotify (push notifications) webhook_configs: - url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN' ``` **Grouping**: Combines similar alerts (prevents spam). **Repeat**: Re-sends unresolved alerts every 4 hours. --- ### Testing Alerts **Manual test**: ```bash # Trigger test alert curl -X POST http://localhost:9093/api/v1/alerts \ -d '[{ "labels": {"alertname":"TestAlert","severity":"warning"}, "annotations": {"summary":"Test alert from curl"} }]' # Check Alertmanager UI open http://localhost:9093 ``` **Force alert** (stop API): ```bash # Stop API (triggers ApplicationDown alert after 2m) docker compose stop api # Check Prometheus alerts open http://localhost:9090/alerts # Wait 2 minutes → Alert fires → Notification sent ``` --- ## Exporters ### cAdvisor (Container Metrics) **Metrics**: - CPU usage per container - Memory usage per container - Network I/O - Disk I/O **Access**: http://localhost:8080 **Configuration** (docker-compose.yml): ```yaml cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor-changemaker privileged: true # Required for full access volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro devices: - /dev/kmsg ``` --- ### Node Exporter (Host Metrics) **Metrics**: - CPU usage (all cores) - Memory usage (total, free, cached) - Disk usage (filesystem, mountpoints) - Network I/O (bytes, packets) **Access**: http://localhost:9100/metrics **Configuration**: ```yaml node-exporter: command: - '--path.rootfs=/host' - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro ``` --- ### Redis Exporter **Metrics**: - Memory usage - Commands per second - Connected clients - Keyspace hits/misses - Evicted keys **Access**: http://localhost:9121/metrics **Configuration**: ```yaml redis-exporter: environment: - REDIS_ADDR=redis:6379 - REDIS_PASSWORD=${REDIS_PASSWORD} # Authenticates with Redis ``` --- ## Gotify (Push Notifications) **Setup**: ```bash # Access Gotify UI open http://localhost:8889 # Login (default: admin/admin) # Create app → Copy token # Add to Alertmanager config: webhook_configs: - url: 'http://gotify:80/message?token=YOUR_TOKEN' ``` **Mobile apps**: Available for iOS/Android (receive push notifications). --- ## Accessing Services | Service | URL | Default Credentials | |---------|-----|---------------------| | Prometheus | http://localhost:9090 | None | | Grafana | http://localhost:3001 | admin / admin | | Alertmanager | http://localhost:9093 | None | | cAdvisor | http://localhost:8080 | None | | Node Exporter | http://localhost:9100/metrics | None | | Redis Exporter | http://localhost:9121/metrics | None | | Gotify | http://localhost:8889 | admin / admin | --- ## Troubleshooting ### Prometheus Not Scraping **Symptoms**: Missing data in Grafana dashboards **Diagnosis**: ```bash # Check Prometheus targets open http://localhost:9090/targets # Look for errors (red) vs success (green) # Check API metrics endpoint curl http://localhost:4000/api/metrics ``` **Common causes**: - API container not running - Wrong port in `prometheus.yml` - Network connectivity issue **Solution**: ```bash # Restart API docker compose restart api # Reload Prometheus config docker compose exec prometheus kill -HUP 1 # Or restart Prometheus docker compose restart prometheus ``` --- ### Grafana Dashboards Not Loading **Symptoms**: Blank dashboards or "No data" errors **Diagnosis**: ```bash # Check Grafana logs docker compose logs grafana | tail -50 # Check datasource open http://localhost:3001/datasources # Test Prometheus query curl http://prometheus:9090/api/v1/query?query=up ``` **Solution**: ```bash # Verify datasource URL # Should be http://prometheus:9090 (container name, not localhost) # Restart Grafana docker compose restart grafana ``` --- ### Alerts Not Firing **Symptoms**: No notifications despite issues **Diagnosis**: ```bash # Check Prometheus alerts open http://localhost:9090/alerts # Check Alertmanager open http://localhost:9093 # Verify alert rules loaded curl http://localhost:9090/api/v1/rules ``` **Solution**: ```bash # Reload Prometheus config docker compose exec prometheus kill -HUP 1 # Check alerts.yml syntax docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml # Test notification receiver curl -X POST http://localhost:9093/api/v1/alerts -d '[...]' ``` --- ## Production Best Practices ### Secure Grafana **Change admin password**: ```bash # Via UI: Admin → Profile → Change Password # Via env var (docker-compose.yml): environment: - GF_SECURITY_ADMIN_PASSWORD= ``` **Disable signup**: ```yaml environment: - GF_USERS_ALLOW_SIGN_UP=false # Already set ``` --- ### Alert Tuning **Avoid false positives**: Increase `for` duration in critical alerts. **Example** (before): ```yaml - alert: DiskSpaceLow expr: disk_free_percent < 15 for: 1m # Too aggressive ``` **Example** (after): ```yaml - alert: DiskSpaceLow expr: disk_free_percent < 15 for: 10m # More reasonable ``` --- ### External Storage (Long-Term) **Prometheus** supports remote write to: - **Thanos**: Long-term storage (S3/GCS) - **Cortex**: Multi-tenant Prometheus - **VictoriaMetrics**: High-performance storage **Example** (Thanos): ```yaml # prometheus.yml remote_write: - url: "http://thanos-receive:19291/api/v1/receive" ``` --- ## Related Documentation - **[Docker Compose](docker-compose.md)** — Monitoring services configuration - **[Environment Variables](environment-variables.md)** — Monitoring env vars - **[API Reference](../api/metrics.md)** — Custom metrics implementation