Monitoring Stack (Prometheus + Grafana)¶
Overview¶
Changemaker Lite V2 includes a complete observability stack for production monitoring:
- Prometheus: Metrics collection + alerting rules
- Grafana: Visualization + pre-configured dashboards
- Alertmanager: Alert routing + notifications
- cAdvisor: Docker container metrics
- Node Exporter: Host system metrics
- Redis Exporter: Redis-specific metrics
- Gotify: Push notifications (optional)
All monitoring services behind Docker Compose profile flag (opt-in).
Architecture¶
graph LR
subgraph "Application Metrics"
API[API<br/>:4000/api/metrics]
MEDIA[Media API<br/>:4100/metrics]
end
subgraph "Infrastructure Metrics"
CADVISOR[cAdvisor<br/>Container Stats]
NODE[Node Exporter<br/>Host Stats]
REDIS_EXP[Redis Exporter<br/>Redis Stats]
end
subgraph "Monitoring Stack"
PROM[Prometheus<br/>:9090]
GRAFANA[Grafana<br/>:3001]
ALERT[Alertmanager<br/>:9093]
GOTIFY[Gotify<br/>:8889]
end
API --> PROM
MEDIA --> PROM
CADVISOR --> PROM
NODE --> PROM
REDIS_EXP --> PROM
PROM --> GRAFANA
PROM --> ALERT
ALERT --> GOTIFY
Quick Start¶
Enable Monitoring¶
# Start with monitoring profile
docker compose --profile monitoring up -d
# Check services
docker compose ps | grep monitoring
# Access dashboards
open http://localhost:3001 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
open http://localhost:9093 # Alertmanager
Prometheus Configuration¶
Scrape Targets¶
File: configs/prometheus/prometheus.yml
scrape_configs:
# V2 Unified API Metrics (10s interval)
- job_name: 'changemaker-v2-api'
static_configs:
- targets: ['changemaker-v2-api:4000']
metrics_path: '/api/metrics'
scrape_interval: 10s
scrape_timeout: 5s
# Redis Metrics (15s interval)
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
scrape_interval: 15s
# cAdvisor - Docker container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
# Node Exporter - System metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 15s
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Alertmanager monitoring
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
scrape_interval: 30s
Intervals: - 10s: API (real-time application metrics) - 15s: Infrastructure (host + containers + Redis) - 30s: Monitoring stack itself
Custom Metrics (cm_*)¶
File: api/src/utils/metrics.ts
12 custom metrics for domain-specific monitoring:
| Metric | Type | Labels | Description |
|---|---|---|---|
cm_emails_sent_total |
Counter | campaign_id |
Campaign emails sent successfully |
cm_emails_failed_total |
Counter | campaign_id, error_type |
Failed email sends |
cm_email_queue_size |
Gauge | - | Current email queue size |
cm_email_send_duration_seconds |
Histogram | - | Email send latency |
cm_login_attempts_total |
Counter | status |
Login attempts (success/failure) |
cm_active_sessions |
Gauge | - | Active refresh tokens |
cm_campaign_emails_total |
Counter | campaign_id |
Total campaign emails created |
cm_response_submissions_total |
Counter | - | Response wall submissions |
cm_canvass_visits_total |
Counter | outcome |
Canvass visits by outcome |
cm_active_canvass_sessions |
Gauge | - | Active canvass sessions |
cm_shift_signups_total |
Counter | - | Shift signups |
cm_external_service_up |
Gauge | service |
External service health (1=up, 0=down) |
HTTP metrics (standard prom-client):
- http_requests_total
- http_request_duration_seconds
Geocoding metrics:
- cm_geocode_cache_hits_total
- cm_geocode_cache_misses_total
- cm_geocode_requests_total
- cm_geocode_duration_seconds
Email template metrics:
- cm_email_templates_updated_total
- cm_email_test_sent_total
- cm_email_template_rollback_total
- cm_email_template_cache_hit/miss_total
Location query metrics:
- cm_map_location_query_duration_seconds
- cm_map_location_query_count_total
- cm_map_location_result_count
Alert Rules¶
File: configs/prometheus/alerts.yml
12 alert rules across 4 groups:
Application Alerts¶
- ApplicationDown: API unreachable for 2 minutes
- HighErrorRate: >10% 5xx errors for 5 minutes
- EmailQueueBacklog: Queue size >100 for 10 minutes
- HighEmailFailureRate: >20% email failures for 10 minutes
- SuspiciousLoginActivity: >5 failed logins/sec for 2 minutes
- HighAPILatency: P95 latency >2s for 5 minutes
- ExternalServiceDown: External service unreachable for 5 minutes
System Alerts¶
- RedisDown: Redis unreachable for 1 minute
- DiskSpaceLow: <15% disk space for 5 minutes
- DiskSpaceCritical: <10% disk space for 2 minutes
- HighCPUUsage: >85% CPU for 10 minutes
- HighMemoryUsage: >85% memory for 10 minutes
Example Alert:
- alert: ApplicationDown
expr: up{job="changemaker-v2-api"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "V2 API is down"
description: "The Changemaker V2 API has been down for more than 2 minutes."
Data Retention¶
docker-compose.yml:
Disk usage: ~1-5GB for 30 days (depends on scrape frequency + cardinality).
Increase retention:
# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'
# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus
Grafana Configuration¶
Datasource¶
File: configs/grafana/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Auto-provisioned on Grafana startup.
Dashboards¶
File: configs/grafana/dashboards.yml
apiVersion: 1
providers:
- name: 'Default'
folder: 'Changemaker Lite'
type: file
options:
path: /etc/grafana/provisioning/dashboards
3 pre-configured dashboards:
1. Application Overview¶
File: configs/grafana/application-overview.json
Panels: - API uptime (last 24h) - Request rate (req/sec) - Error rate (%) - Email queue size - Active sessions - Campaign emails sent
Refresh: 10s
2. API Performance¶
File: configs/grafana/api-performance.json
Panels: - Request latency (P50, P95, P99) - Requests by status code - Top 10 slowest endpoints - HTTP errors by route - Geocoding cache hit rate - Email send duration
Refresh: 30s
3. System Health¶
File: configs/grafana/system-health.json
Panels: - CPU usage (%) - Memory usage (%) - Disk space (GB free) - Network I/O (MB/s) - Container CPU throttling - Redis memory usage
Refresh: 1m
First Login¶
# Access Grafana
open http://localhost:3001
# Default credentials
Username: admin
Password: admin
# Change password on first login
Navigate: Dashboards → Changemaker Lite folder → Select dashboard
Alertmanager Configuration¶
Notification Receivers¶
File: configs/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
# Email (example)
email_configs:
- to: 'admin@cmlite.org'
from: 'alerts@cmlite.org'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@cmlite.org'
auth_password: 'your-password'
# Slack (example)
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
# Gotify (push notifications)
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'
Grouping: Combines similar alerts (prevents spam).
Repeat: Re-sends unresolved alerts every 4 hours.
Testing Alerts¶
Manual test:
# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-d '[{
"labels": {"alertname":"TestAlert","severity":"warning"},
"annotations": {"summary":"Test alert from curl"}
}]'
# Check Alertmanager UI
open http://localhost:9093
Force alert (stop API):
# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api
# Check Prometheus alerts
open http://localhost:9090/alerts
# Wait 2 minutes → Alert fires → Notification sent
Exporters¶
cAdvisor (Container Metrics)¶
Metrics: - CPU usage per container - Memory usage per container - Network I/O - Disk I/O
Access: http://localhost:8080
Configuration (docker-compose.yml):
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor-changemaker
privileged: true # Required for full access
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
Node Exporter (Host Metrics)¶
Metrics: - CPU usage (all cores) - Memory usage (total, free, cached) - Disk usage (filesystem, mountpoints) - Network I/O (bytes, packets)
Access: http://localhost:9100/metrics
Configuration:
node-exporter:
command:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
Redis Exporter¶
Metrics: - Memory usage - Commands per second - Connected clients - Keyspace hits/misses - Evicted keys
Access: http://localhost:9121/metrics
Configuration:
redis-exporter:
environment:
- REDIS_ADDR=redis:6379
- REDIS_PASSWORD=${REDIS_PASSWORD} # Authenticates with Redis
Gotify (Push Notifications)¶
Setup:
# Access Gotify UI
open http://localhost:8889
# Login (default: admin/admin)
# Create app → Copy token
# Add to Alertmanager config:
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_TOKEN'
Mobile apps: Available for iOS/Android (receive push notifications).
Accessing Services¶
| Service | URL | Default Credentials |
|---|---|---|
| Prometheus | http://localhost:9090 | None |
| Grafana | http://localhost:3001 | admin / admin |
| Alertmanager | http://localhost:9093 | None |
| cAdvisor | http://localhost:8080 | None |
| Node Exporter | http://localhost:9100/metrics | None |
| Redis Exporter | http://localhost:9121/metrics | None |
| Gotify | http://localhost:8889 | admin / admin |
Troubleshooting¶
Prometheus Not Scraping¶
Symptoms: Missing data in Grafana dashboards
Diagnosis:
# Check Prometheus targets
open http://localhost:9090/targets
# Look for errors (red) vs success (green)
# Check API metrics endpoint
curl http://localhost:4000/api/metrics
Common causes:
- API container not running
- Wrong port in prometheus.yml
- Network connectivity issue
Solution:
# Restart API
docker compose restart api
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Or restart Prometheus
docker compose restart prometheus
Grafana Dashboards Not Loading¶
Symptoms: Blank dashboards or "No data" errors
Diagnosis:
# Check Grafana logs
docker compose logs grafana | tail -50
# Check datasource
open http://localhost:3001/datasources
# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up
Solution:
# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)
# Restart Grafana
docker compose restart grafana
Alerts Not Firing¶
Symptoms: No notifications despite issues
Diagnosis:
# Check Prometheus alerts
open http://localhost:9090/alerts
# Check Alertmanager
open http://localhost:9093
# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules
Solution:
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'
Production Best Practices¶
Secure Grafana¶
Change admin password:
# Via UI: Admin → Profile → Change Password
# Via env var (docker-compose.yml):
environment:
- GF_SECURITY_ADMIN_PASSWORD=<strong-password>
Disable signup:
Alert Tuning¶
Avoid false positives: Increase for duration in critical alerts.
Example (before):
Example (after):
External Storage (Long-Term)¶
Prometheus supports remote write to: - Thanos: Long-term storage (S3/GCS) - Cortex: Multi-tenant Prometheus - VictoriaMetrics: High-performance storage
Example (Thanos):
Related Documentation¶
- Docker Compose — Monitoring services configuration
- Environment Variables — Monitoring env vars
- API Reference — Custom metrics implementation