# Monitoring Stack (Prometheus + Grafana)
## Overview
Changemaker Lite V2 includes a complete observability stack for production monitoring:
- **Prometheus**: Metrics collection + alerting rules
- **Grafana**: Visualization + pre-configured dashboards
- **Alertmanager**: Alert routing + notifications
- **cAdvisor**: Docker container metrics
- **Node Exporter**: Host system metrics
- **Redis Exporter**: Redis-specific metrics
- **Gotify**: Push notifications (optional)
**All monitoring services** behind Docker Compose profile flag (opt-in).
---
## Architecture
```mermaid
graph LR
subgraph "Application Metrics"
API[API
:4000/api/metrics]
MEDIA[Media API
:4100/metrics]
end
subgraph "Infrastructure Metrics"
CADVISOR[cAdvisor
Container Stats]
NODE[Node Exporter
Host Stats]
REDIS_EXP[Redis Exporter
Redis Stats]
end
subgraph "Monitoring Stack"
PROM[Prometheus
:9090]
GRAFANA[Grafana
:3001]
ALERT[Alertmanager
:9093]
GOTIFY[Gotify
:8889]
end
API --> PROM
MEDIA --> PROM
CADVISOR --> PROM
NODE --> PROM
REDIS_EXP --> PROM
PROM --> GRAFANA
PROM --> ALERT
ALERT --> GOTIFY
```
---
## Quick Start
### Enable Monitoring
```bash
# Start with monitoring profile
docker compose --profile monitoring up -d
# Check services
docker compose ps | grep monitoring
# Access dashboards
open http://localhost:3001 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
open http://localhost:9093 # Alertmanager
```
---
## Prometheus Configuration
### Scrape Targets
**File**: `configs/prometheus/prometheus.yml`
```yaml
scrape_configs:
# V2 Unified API Metrics (10s interval)
- job_name: 'changemaker-v2-api'
static_configs:
- targets: ['changemaker-v2-api:4000']
metrics_path: '/api/metrics'
scrape_interval: 10s
scrape_timeout: 5s
# Redis Metrics (15s interval)
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
scrape_interval: 15s
# cAdvisor - Docker container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 15s
# Node Exporter - System metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 15s
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Alertmanager monitoring
- job_name: 'alertmanager'
static_configs:
- targets: ['alertmanager:9093']
scrape_interval: 30s
```
**Intervals:**
- **10s**: API (real-time application metrics)
- **15s**: Infrastructure (host + containers + Redis)
- **30s**: Monitoring stack itself
---
### Custom Metrics (cm_*)
**File**: `api/src/utils/metrics.ts`
**12 custom metrics** for domain-specific monitoring:
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `cm_emails_sent_total` | Counter | `campaign_id` | Campaign emails sent successfully |
| `cm_emails_failed_total` | Counter | `campaign_id`, `error_type` | Failed email sends |
| `cm_email_queue_size` | Gauge | - | Current email queue size |
| `cm_email_send_duration_seconds` | Histogram | - | Email send latency |
| `cm_login_attempts_total` | Counter | `status` | Login attempts (success/failure) |
| `cm_active_sessions` | Gauge | - | Active refresh tokens |
| `cm_campaign_emails_total` | Counter | `campaign_id` | Total campaign emails created |
| `cm_response_submissions_total` | Counter | - | Response wall submissions |
| `cm_canvass_visits_total` | Counter | `outcome` | Canvass visits by outcome |
| `cm_active_canvass_sessions` | Gauge | - | Active canvass sessions |
| `cm_shift_signups_total` | Counter | - | Shift signups |
| `cm_external_service_up` | Gauge | `service` | External service health (1=up, 0=down) |
**HTTP metrics** (standard prom-client):
- `http_requests_total`
- `http_request_duration_seconds`
**Geocoding metrics:**
- `cm_geocode_cache_hits_total`
- `cm_geocode_cache_misses_total`
- `cm_geocode_requests_total`
- `cm_geocode_duration_seconds`
**Email template metrics:**
- `cm_email_templates_updated_total`
- `cm_email_test_sent_total`
- `cm_email_template_rollback_total`
- `cm_email_template_cache_hit/miss_total`
**Location query metrics:**
- `cm_map_location_query_duration_seconds`
- `cm_map_location_query_count_total`
- `cm_map_location_result_count`
---
### Alert Rules
**File**: `configs/prometheus/alerts.yml`
**12 alert rules** across 4 groups:
#### Application Alerts
1. **ApplicationDown**: API unreachable for 2 minutes
2. **HighErrorRate**: >10% 5xx errors for 5 minutes
3. **EmailQueueBacklog**: Queue size >100 for 10 minutes
4. **HighEmailFailureRate**: >20% email failures for 10 minutes
5. **SuspiciousLoginActivity**: >5 failed logins/sec for 2 minutes
6. **HighAPILatency**: P95 latency >2s for 5 minutes
7. **ExternalServiceDown**: External service unreachable for 5 minutes
#### System Alerts
8. **RedisDown**: Redis unreachable for 1 minute
9. **DiskSpaceLow**: <15% disk space for 5 minutes
10. **DiskSpaceCritical**: <10% disk space for 2 minutes
11. **HighCPUUsage**: >85% CPU for 10 minutes
12. **HighMemoryUsage**: >85% memory for 10 minutes
**Example Alert**:
```yaml
- alert: ApplicationDown
expr: up{job="changemaker-v2-api"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "V2 API is down"
description: "The Changemaker V2 API has been down for more than 2 minutes."
```
---
### Data Retention
**docker-compose.yml**:
```yaml
prometheus:
command:
- '--storage.tsdb.retention.time=30d' # 30 days
```
**Disk usage**: ~1-5GB for 30 days (depends on scrape frequency + cardinality).
**Increase retention**:
```bash
# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'
# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus
```
---
## Grafana Configuration
### Datasource
**File**: `configs/grafana/datasources.yml`
```yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
```
**Auto-provisioned** on Grafana startup.
---
### Dashboards
**File**: `configs/grafana/dashboards.yml`
```yaml
apiVersion: 1
providers:
- name: 'Default'
folder: 'Changemaker Lite'
type: file
options:
path: /etc/grafana/provisioning/dashboards
```
**3 pre-configured dashboards**:
#### 1. Application Overview
**File**: `configs/grafana/application-overview.json`
**Panels**:
- API uptime (last 24h)
- Request rate (req/sec)
- Error rate (%)
- Email queue size
- Active sessions
- Campaign emails sent
**Refresh**: 10s
---
#### 2. API Performance
**File**: `configs/grafana/api-performance.json`
**Panels**:
- Request latency (P50, P95, P99)
- Requests by status code
- Top 10 slowest endpoints
- HTTP errors by route
- Geocoding cache hit rate
- Email send duration
**Refresh**: 30s
---
#### 3. System Health
**File**: `configs/grafana/system-health.json`
**Panels**:
- CPU usage (%)
- Memory usage (%)
- Disk space (GB free)
- Network I/O (MB/s)
- Container CPU throttling
- Redis memory usage
**Refresh**: 1m
---
### First Login
```bash
# Access Grafana
open http://localhost:3001
# Default credentials
Username: admin
Password: admin
# Change password on first login
```
**Navigate**: Dashboards → Changemaker Lite folder → Select dashboard
---
## Alertmanager Configuration
### Notification Receivers
**File**: `configs/alertmanager/alertmanager.yml`
```yaml
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
# Email (example)
email_configs:
- to: 'admin@cmlite.org'
from: 'alerts@cmlite.org'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@cmlite.org'
auth_password: 'your-password'
# Slack (example)
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
# Gotify (push notifications)
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'
```
**Grouping**: Combines similar alerts (prevents spam).
**Repeat**: Re-sends unresolved alerts every 4 hours.
---
### Testing Alerts
**Manual test**:
```bash
# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-d '[{
"labels": {"alertname":"TestAlert","severity":"warning"},
"annotations": {"summary":"Test alert from curl"}
}]'
# Check Alertmanager UI
open http://localhost:9093
```
**Force alert** (stop API):
```bash
# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api
# Check Prometheus alerts
open http://localhost:9090/alerts
# Wait 2 minutes → Alert fires → Notification sent
```
---
## Exporters
### cAdvisor (Container Metrics)
**Metrics**:
- CPU usage per container
- Memory usage per container
- Network I/O
- Disk I/O
**Access**: http://localhost:8080
**Configuration** (docker-compose.yml):
```yaml
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor-changemaker
privileged: true # Required for full access
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
```
---
### Node Exporter (Host Metrics)
**Metrics**:
- CPU usage (all cores)
- Memory usage (total, free, cached)
- Disk usage (filesystem, mountpoints)
- Network I/O (bytes, packets)
**Access**: http://localhost:9100/metrics
**Configuration**:
```yaml
node-exporter:
command:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
```
---
### Redis Exporter
**Metrics**:
- Memory usage
- Commands per second
- Connected clients
- Keyspace hits/misses
- Evicted keys
**Access**: http://localhost:9121/metrics
**Configuration**:
```yaml
redis-exporter:
environment:
- REDIS_ADDR=redis:6379
- REDIS_PASSWORD=${REDIS_PASSWORD} # Authenticates with Redis
```
---
## Gotify (Push Notifications)
**Setup**:
```bash
# Access Gotify UI
open http://localhost:8889
# Login (default: admin/admin)
# Create app → Copy token
# Add to Alertmanager config:
webhook_configs:
- url: 'http://gotify:80/message?token=YOUR_TOKEN'
```
**Mobile apps**: Available for iOS/Android (receive push notifications).
---
## Accessing Services
| Service | URL | Default Credentials |
|---------|-----|---------------------|
| Prometheus | http://localhost:9090 | None |
| Grafana | http://localhost:3001 | admin / admin |
| Alertmanager | http://localhost:9093 | None |
| cAdvisor | http://localhost:8080 | None |
| Node Exporter | http://localhost:9100/metrics | None |
| Redis Exporter | http://localhost:9121/metrics | None |
| Gotify | http://localhost:8889 | admin / admin |
---
## Troubleshooting
### Prometheus Not Scraping
**Symptoms**: Missing data in Grafana dashboards
**Diagnosis**:
```bash
# Check Prometheus targets
open http://localhost:9090/targets
# Look for errors (red) vs success (green)
# Check API metrics endpoint
curl http://localhost:4000/api/metrics
```
**Common causes**:
- API container not running
- Wrong port in `prometheus.yml`
- Network connectivity issue
**Solution**:
```bash
# Restart API
docker compose restart api
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Or restart Prometheus
docker compose restart prometheus
```
---
### Grafana Dashboards Not Loading
**Symptoms**: Blank dashboards or "No data" errors
**Diagnosis**:
```bash
# Check Grafana logs
docker compose logs grafana | tail -50
# Check datasource
open http://localhost:3001/datasources
# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up
```
**Solution**:
```bash
# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)
# Restart Grafana
docker compose restart grafana
```
---
### Alerts Not Firing
**Symptoms**: No notifications despite issues
**Diagnosis**:
```bash
# Check Prometheus alerts
open http://localhost:9090/alerts
# Check Alertmanager
open http://localhost:9093
# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules
```
**Solution**:
```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'
```
---
## Production Best Practices
### Secure Grafana
**Change admin password**:
```bash
# Via UI: Admin → Profile → Change Password
# Via env var (docker-compose.yml):
environment:
- GF_SECURITY_ADMIN_PASSWORD=
```
**Disable signup**:
```yaml
environment:
- GF_USERS_ALLOW_SIGN_UP=false # Already set
```
---
### Alert Tuning
**Avoid false positives**: Increase `for` duration in critical alerts.
**Example** (before):
```yaml
- alert: DiskSpaceLow
expr: disk_free_percent < 15
for: 1m # Too aggressive
```
**Example** (after):
```yaml
- alert: DiskSpaceLow
expr: disk_free_percent < 15
for: 10m # More reasonable
```
---
### External Storage (Long-Term)
**Prometheus** supports remote write to:
- **Thanos**: Long-term storage (S3/GCS)
- **Cortex**: Multi-tenant Prometheus
- **VictoriaMetrics**: High-performance storage
**Example** (Thanos):
```yaml
# prometheus.yml
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
```
---
## Related Documentation
- **[Docker Compose](docker-compose.md)** — Monitoring services configuration
- **[Environment Variables](environment-variables.md)** — Monitoring env vars
- **[API Reference](../api/metrics.md)** — Custom metrics implementation