14 KiB
Monitoring Stack (Prometheus + Grafana)
Overview
Changemaker Lite V2 includes a complete observability stack for production monitoring:
- Prometheus: Metrics collection + alerting rules
- Grafana: Visualization + pre-configured dashboards
- Alertmanager: Alert routing + notifications
- cAdvisor: Docker container metrics
- Node Exporter: Host system metrics
- Redis Exporter: Redis-specific metrics
- Gotify: Push notifications (optional)
All monitoring services behind Docker Compose profile flag (opt-in).
Architecture
graph LR
subgraph "Application Metrics"
API[API<br/>:4000/api/metrics]
MEDIA[Media API<br/>:4100/metrics]
end
subgraph "Infrastructure Metrics"
CADVISOR[cAdvisor<br/>Container Stats]
NODE[Node Exporter<br/>Host Stats]
REDIS_EXP[Redis Exporter<br/>Redis Stats]
end
subgraph "Monitoring Stack"
PROM[Prometheus<br/>:9090]
GRAFANA[Grafana<br/>:3001]
ALERT[Alertmanager<br/>:9093]
GOTIFY[Gotify<br/>:8889]
end
API --> PROM
MEDIA --> PROM
CADVISOR --> PROM
NODE --> PROM
REDIS_EXP --> PROM
PROM --> GRAFANA
PROM --> ALERT
ALERT --> GOTIFY
Quick Start
Enable Monitoring
# Start with monitoring profile
docker compose --profile monitoring up -d
# Check services
docker compose ps | grep monitoring
# Access dashboards
open http://localhost:3001 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
open http://localhost:9093 # Alertmanager
Prometheus Configuration
Scrape Targets
File: configs/prometheus/prometheus.yml
scrape_configs:
# V2 Unified API Metrics (10s interval)
- job_name: 'changemaker-v2-api'
static_configs:
- targets: ['changemaker-v2-api:4000']
metrics_path: '/api/metrics'
scrape_interval: 10s
scrape_timeout: 5s
# Redis Metrics (15s interval)
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
scrape_interval: 15s
# cAdvisor - Docker container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
# Node Exporter - System metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 15s
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Alertmanager monitoring
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
scrape_interval: 30s
Intervals:
- 10s: API (real-time application metrics)
- 15s: Infrastructure (host + containers + Redis)
- 30s: Monitoring stack itself
Custom Metrics (cm_*)
File: api/src/utils/metrics.ts
12 custom metrics for domain-specific monitoring:
| Metric | Type | Labels | Description |
|---|---|---|---|
cm_emails_sent_total |
Counter | campaign_id |
Campaign emails sent successfully |
cm_emails_failed_total |
Counter | campaign_id, error_type |
Failed email sends |
cm_email_queue_size |
Gauge | - | Current email queue size |
cm_email_send_duration_seconds |
Histogram | - | Email send latency |
cm_login_attempts_total |
Counter | status |
Login attempts (success/failure) |
cm_active_sessions |
Gauge | - | Active refresh tokens |
cm_campaign_emails_total |
Counter | campaign_id |
Total campaign emails created |
cm_response_submissions_total |
Counter | - | Response wall submissions |
cm_canvass_visits_total |
Counter | outcome |
Canvass visits by outcome |
cm_active_canvass_sessions |
Gauge | - | Active canvass sessions |
cm_shift_signups_total |
Counter | - | Shift signups |
cm_external_service_up |
Gauge | service |
External service health (1=up, 0=down) |
HTTP metrics (standard prom-client):
http_requests_totalhttp_request_duration_seconds
Geocoding metrics:
cm_geocode_cache_hits_totalcm_geocode_cache_misses_totalcm_geocode_requests_totalcm_geocode_duration_seconds
Email template metrics:
cm_email_templates_updated_totalcm_email_test_sent_totalcm_email_template_rollback_totalcm_email_template_cache_hit/miss_total
Location query metrics:
cm_map_location_query_duration_secondscm_map_location_query_count_totalcm_map_location_result_count
Alert Rules
File: configs/prometheus/alerts.yml
12 alert rules across 4 groups:
Application Alerts
- ApplicationDown: API unreachable for 2 minutes
- HighErrorRate: >10% 5xx errors for 5 minutes
- EmailQueueBacklog: Queue size >100 for 10 minutes
- HighEmailFailureRate: >20% email failures for 10 minutes
- SuspiciousLoginActivity: >5 failed logins/sec for 2 minutes
- HighAPILatency: P95 latency >2s for 5 minutes
- ExternalServiceDown: External service unreachable for 5 minutes
System Alerts
- RedisDown: Redis unreachable for 1 minute
- DiskSpaceLow: <15% disk space for 5 minutes
- DiskSpaceCritical: <10% disk space for 2 minutes
- HighCPUUsage: >85% CPU for 10 minutes
- HighMemoryUsage: >85% memory for 10 minutes
Example Alert:
- alert: ApplicationDown
expr: up{job="changemaker-v2-api"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "V2 API is down"
description: "The Changemaker V2 API has been down for more than 2 minutes."
Data Retention
docker-compose.yml:
prometheus:
command:
- '--storage.tsdb.retention.time=30d' # 30 days
Disk usage: ~1-5GB for 30 days (depends on scrape frequency + cardinality).
Increase retention:
# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'
# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus
Grafana Configuration
Datasource
File: configs/grafana/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Auto-provisioned on Grafana startup.
Dashboards
File: configs/grafana/dashboards.yml
apiVersion: 1
providers:
- name: 'Default'
folder: 'Changemaker Lite'
type: file
options:
path: /etc/grafana/provisioning/dashboards
3 pre-configured dashboards:
1. Application Overview
File: configs/grafana/application-overview.json
Panels:
- API uptime (last 24h)
- Request rate (req/sec)
- Error rate (%)
- Email queue size
- Active sessions
- Campaign emails sent
Refresh: 10s
2. API Performance
File: configs/grafana/api-performance.json
Panels:
- Request latency (P50, P95, P99)
- Requests by status code
- Top 10 slowest endpoints
- HTTP errors by route
- Geocoding cache hit rate
- Email send duration
Refresh: 30s
3. System Health
File: configs/grafana/system-health.json
Panels:
- CPU usage (%)
- Memory usage (%)
- Disk space (GB free)
- Network I/O (MB/s)
- Container CPU throttling
- Redis memory usage
Refresh: 1m
First Login
# Access Grafana
open http://localhost:3001
# Default credentials
Username: admin
Password: admin
# Change password on first login
Navigate: Dashboards → Changemaker Lite folder → Select dashboard
Alertmanager Configuration
Notification Receivers
File: configs/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
# Email (example)
email_configs:
- to: 'admin@cmlite.org'
from: 'alerts@cmlite.org'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@cmlite.org'
auth_password: 'your-password'
# Slack (example)
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
# Gotify (push notifications)
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'
Grouping: Combines similar alerts (prevents spam).
Repeat: Re-sends unresolved alerts every 4 hours.
Testing Alerts
Manual test:
# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-d '[{
"labels": {"alertname":"TestAlert","severity":"warning"},
"annotations": {"summary":"Test alert from curl"}
}]'
# Check Alertmanager UI
open http://localhost:9093
Force alert (stop API):
# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api
# Check Prometheus alerts
open http://localhost:9090/alerts
# Wait 2 minutes → Alert fires → Notification sent
Exporters
cAdvisor (Container Metrics)
Metrics:
- CPU usage per container
- Memory usage per container
- Network I/O
- Disk I/O
Access: http://localhost:8080
Configuration (docker-compose.yml):
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor-changemaker
privileged: true # Required for full access
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
Node Exporter (Host Metrics)
Metrics:
- CPU usage (all cores)
- Memory usage (total, free, cached)
- Disk usage (filesystem, mountpoints)
- Network I/O (bytes, packets)
Access: http://localhost:9100/metrics
Configuration:
node-exporter:
command:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
Redis Exporter
Metrics:
- Memory usage
- Commands per second
- Connected clients
- Keyspace hits/misses
- Evicted keys
Access: http://localhost:9121/metrics
Configuration:
redis-exporter:
environment:
- REDIS_ADDR=redis:6379
- REDIS_PASSWORD=${REDIS_PASSWORD} # Authenticates with Redis
Gotify (Push Notifications)
Setup:
# Access Gotify UI
open http://localhost:8889
# Login (default: admin/admin)
# Create app → Copy token
# Add to Alertmanager config:
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_TOKEN'
Mobile apps: Available for iOS/Android (receive push notifications).
Accessing Services
| Service | URL | Default Credentials |
|---|---|---|
| Prometheus | http://localhost:9090 | None |
| Grafana | http://localhost:3001 | admin / admin |
| Alertmanager | http://localhost:9093 | None |
| cAdvisor | http://localhost:8080 | None |
| Node Exporter | http://localhost:9100/metrics | None |
| Redis Exporter | http://localhost:9121/metrics | None |
| Gotify | http://localhost:8889 | admin / admin |
Troubleshooting
Prometheus Not Scraping
Symptoms: Missing data in Grafana dashboards
Diagnosis:
# Check Prometheus targets
open http://localhost:9090/targets
# Look for errors (red) vs success (green)
# Check API metrics endpoint
curl http://localhost:4000/api/metrics
Common causes:
- API container not running
- Wrong port in
prometheus.yml - Network connectivity issue
Solution:
# Restart API
docker compose restart api
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Or restart Prometheus
docker compose restart prometheus
Grafana Dashboards Not Loading
Symptoms: Blank dashboards or "No data" errors
Diagnosis:
# Check Grafana logs
docker compose logs grafana | tail -50
# Check datasource
open http://localhost:3001/datasources
# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up
Solution:
# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)
# Restart Grafana
docker compose restart grafana
Alerts Not Firing
Symptoms: No notifications despite issues
Diagnosis:
# Check Prometheus alerts
open http://localhost:9090/alerts
# Check Alertmanager
open http://localhost:9093
# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules
Solution:
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'
Production Best Practices
Secure Grafana
Change admin password:
# Via UI: Admin → Profile → Change Password
# Via env var (docker-compose.yml):
environment:
- GF_SECURITY_ADMIN_PASSWORD=<strong-password>
Disable signup:
environment:
- GF_USERS_ALLOW_SIGN_UP=false # Already set
Alert Tuning
Avoid false positives: Increase for duration in critical alerts.
Example (before):
- alert: DiskSpaceLow
expr: disk_free_percent < 15
for: 1m # Too aggressive
Example (after):
- alert: DiskSpaceLow
expr: disk_free_percent < 15
for: 10m # More reasonable
External Storage (Long-Term)
Prometheus supports remote write to:
- Thanos: Long-term storage (S3/GCS)
- Cortex: Multi-tenant Prometheus
- VictoriaMetrics: High-performance storage
Example (Thanos):
# prometheus.yml
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
Related Documentation
- Docker Compose — Monitoring services configuration
- Environment Variables — Monitoring env vars
- API Reference — Custom metrics implementation