changemaker.lite/mkdocs/docs/v2/deployment/monitoring-stack.md

# Monitoring Stack (Prometheus + Grafana)

## Overview

Changemaker Lite V2 includes a complete observability stack for production monitoring:

- **Prometheus**: Metrics collection + alerting rules
- **Grafana**: Visualization + pre-configured dashboards
- **Alertmanager**: Alert routing + notifications
- **cAdvisor**: Docker container metrics
- **Node Exporter**: Host system metrics
- **Redis Exporter**: Redis-specific metrics
- **Gotify**: Push notifications (optional)

**All monitoring services** behind Docker Compose profile flag (opt-in).

---

## Architecture

```mermaid
graph LR
    subgraph "Application Metrics"
        API[API<br/>:4000/api/metrics]
        MEDIA[Media API<br/>:4100/metrics]
    end

    subgraph "Infrastructure Metrics"
        CADVISOR[cAdvisor<br/>Container Stats]
        NODE[Node Exporter<br/>Host Stats]
        REDIS_EXP[Redis Exporter<br/>Redis Stats]
    end

    subgraph "Monitoring Stack"
        PROM[Prometheus<br/>:9090]
        GRAFANA[Grafana<br/>:3001]
        ALERT[Alertmanager<br/>:9093]
        GOTIFY[Gotify<br/>:8889]
    end

    API --> PROM
    MEDIA --> PROM
    CADVISOR --> PROM
    NODE --> PROM
    REDIS_EXP --> PROM

    PROM --> GRAFANA
    PROM --> ALERT
    ALERT --> GOTIFY
```

---

## Quick Start

### Enable Monitoring

```bash
# Start with monitoring profile
docker compose --profile monitoring up -d

# Check services
docker compose ps | grep monitoring

# Access dashboards
open http://localhost:3001  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:9093  # Alertmanager
```

---

## Prometheus Configuration

### Scrape Targets

**File**: `configs/prometheus/prometheus.yml`

```yaml
scrape_configs:
  # V2 Unified API Metrics (10s interval)
  - job_name: 'changemaker-v2-api'
    static_configs:
      - targets: ['changemaker-v2-api:4000']
    metrics_path: '/api/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

  # Redis Metrics (15s interval)
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 15s

  # cAdvisor - Docker container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  # Node Exporter - System metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Alertmanager monitoring
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
    scrape_interval: 30s
```

**Intervals:**
- **10s**: API (real-time application metrics)
- **15s**: Infrastructure (host + containers + Redis)
- **30s**: Monitoring stack itself

---

### Custom Metrics (cm_*)

**File**: `api/src/utils/metrics.ts`

**12 custom metrics** for domain-specific monitoring:

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `cm_emails_sent_total` | Counter | `campaign_id` | Campaign emails sent successfully |
| `cm_emails_failed_total` | Counter | `campaign_id`, `error_type` | Failed email sends |
| `cm_email_queue_size` | Gauge | - | Current email queue size |
| `cm_email_send_duration_seconds` | Histogram | - | Email send latency |
| `cm_login_attempts_total` | Counter | `status` | Login attempts (success/failure) |
| `cm_active_sessions` | Gauge | - | Active refresh tokens |
| `cm_campaign_emails_total` | Counter | `campaign_id` | Total campaign emails created |
| `cm_response_submissions_total` | Counter | - | Response wall submissions |
| `cm_canvass_visits_total` | Counter | `outcome` | Canvass visits by outcome |
| `cm_active_canvass_sessions` | Gauge | - | Active canvass sessions |
| `cm_shift_signups_total` | Counter | - | Shift signups |
| `cm_external_service_up` | Gauge | `service` | External service health (1=up, 0=down) |

**HTTP metrics** (standard prom-client):
- `http_requests_total`
- `http_request_duration_seconds`

**Geocoding metrics:**
- `cm_geocode_cache_hits_total`
- `cm_geocode_cache_misses_total`
- `cm_geocode_requests_total`
- `cm_geocode_duration_seconds`

**Email template metrics:**
- `cm_email_templates_updated_total`
- `cm_email_test_sent_total`
- `cm_email_template_rollback_total`
- `cm_email_template_cache_hit/miss_total`

**Location query metrics:**
- `cm_map_location_query_duration_seconds`
- `cm_map_location_query_count_total`
- `cm_map_location_result_count`

---

### Alert Rules

**File**: `configs/prometheus/alerts.yml`

**12 alert rules** across 4 groups:

#### Application Alerts
1. **ApplicationDown**: API unreachable for 2 minutes
2. **HighErrorRate**: >10% 5xx errors for 5 minutes
3. **EmailQueueBacklog**: Queue size >100 for 10 minutes
4. **HighEmailFailureRate**: >20% email failures for 10 minutes
5. **SuspiciousLoginActivity**: >5 failed logins/sec for 2 minutes
6. **HighAPILatency**: P95 latency >2s for 5 minutes
7. **ExternalServiceDown**: External service unreachable for 5 minutes

#### System Alerts
8. **RedisDown**: Redis unreachable for 1 minute
9. **DiskSpaceLow**: <15% disk space for 5 minutes
10. **DiskSpaceCritical**: <10% disk space for 2 minutes
11. **HighCPUUsage**: >85% CPU for 10 minutes
12. **HighMemoryUsage**: >85% memory for 10 minutes

**Example Alert**:
```yaml
- alert: ApplicationDown
  expr: up{job="changemaker-v2-api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "V2 API is down"
    description: "The Changemaker V2 API has been down for more than 2 minutes."
```

---

### Data Retention

**docker-compose.yml**:
```yaml
prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # 30 days
```

**Disk usage**: ~1-5GB for 30 days (depends on scrape frequency + cardinality).

**Increase retention**:
```bash
# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'

# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus
```

---

## Grafana Configuration

### Datasource

**File**: `configs/grafana/datasources.yml`

```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
```

**Auto-provisioned** on Grafana startup.

---

### Dashboards

**File**: `configs/grafana/dashboards.yml`

```yaml
apiVersion: 1

providers:
  - name: 'Default'
    folder: 'Changemaker Lite'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards
```

**3 pre-configured dashboards**:

#### 1. Application Overview
**File**: `configs/grafana/application-overview.json`

**Panels**:
- API uptime (last 24h)
- Request rate (req/sec)
- Error rate (%)
- Email queue size
- Active sessions
- Campaign emails sent

**Refresh**: 10s

---

#### 2. API Performance
**File**: `configs/grafana/api-performance.json`

**Panels**:
- Request latency (P50, P95, P99)
- Requests by status code
- Top 10 slowest endpoints
- HTTP errors by route
- Geocoding cache hit rate
- Email send duration

**Refresh**: 30s

---

#### 3. System Health
**File**: `configs/grafana/system-health.json`

**Panels**:
- CPU usage (%)
- Memory usage (%)
- Disk space (GB free)
- Network I/O (MB/s)
- Container CPU throttling
- Redis memory usage

**Refresh**: 1m

---

### First Login

```bash
# Access Grafana
open http://localhost:3001

# Default credentials
Username: admin
Password: admin

# Change password on first login
```

**Navigate**: Dashboards → Changemaker Lite folder → Select dashboard

---

## Alertmanager Configuration

### Notification Receivers

**File**: `configs/alertmanager/alertmanager.yml`

```yaml
global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    # Email (example)
    email_configs:
      - to: 'admin@cmlite.org'
        from: 'alerts@cmlite.org'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@cmlite.org'
        auth_password: 'your-password'

    # Slack (example)
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

    # Gotify (push notifications)
    webhook_configs:
      - url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'
```

**Grouping**: Combines similar alerts (prevents spam).

**Repeat**: Re-sends unresolved alerts every 4 hours.

---

### Testing Alerts

**Manual test**:
```bash
# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -d '[{
    "labels": {"alertname":"TestAlert","severity":"warning"},
    "annotations": {"summary":"Test alert from curl"}
  }]'

# Check Alertmanager UI
open http://localhost:9093
```

**Force alert** (stop API):
```bash
# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api

# Check Prometheus alerts
open http://localhost:9090/alerts

# Wait 2 minutes → Alert fires → Notification sent
```

---

## Exporters

### cAdvisor (Container Metrics)

**Metrics**:
- CPU usage per container
- Memory usage per container
- Network I/O
- Disk I/O

**Access**: http://localhost:8080

**Configuration** (docker-compose.yml):
```yaml
cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor-changemaker
  privileged: true  # Required for full access
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    - /dev/disk/:/dev/disk:ro
  devices:
    - /dev/kmsg
```

---

### Node Exporter (Host Metrics)

**Metrics**:
- CPU usage (all cores)
- Memory usage (total, free, cached)
- Disk usage (filesystem, mountpoints)
- Network I/O (bytes, packets)

**Access**: http://localhost:9100/metrics

**Configuration**:
```yaml
node-exporter:
  command:
    - '--path.rootfs=/host'
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
```

---

### Redis Exporter

**Metrics**:
- Memory usage
- Commands per second
- Connected clients
- Keyspace hits/misses
- Evicted keys

**Access**: http://localhost:9121/metrics

**Configuration**:
```yaml
redis-exporter:
  environment:
    - REDIS_ADDR=redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}  # Authenticates with Redis
```

---

## Gotify (Push Notifications)

**Setup**:
```bash
# Access Gotify UI
open http://localhost:8889

# Login (default: admin/admin)

# Create app → Copy token

# Add to Alertmanager config:
webhook_configs:
  - url: 'http://gotify:80/message?token=YOUR_TOKEN'
```

**Mobile apps**: Available for iOS/Android (receive push notifications).

---

## Accessing Services

| Service | URL | Default Credentials |
|---------|-----|---------------------|
| Prometheus | http://localhost:9090 | None |
| Grafana | http://localhost:3001 | admin / admin |
| Alertmanager | http://localhost:9093 | None |
| cAdvisor | http://localhost:8080 | None |
| Node Exporter | http://localhost:9100/metrics | None |
| Redis Exporter | http://localhost:9121/metrics | None |
| Gotify | http://localhost:8889 | admin / admin |

---

## Troubleshooting

### Prometheus Not Scraping

**Symptoms**: Missing data in Grafana dashboards

**Diagnosis**:
```bash
# Check Prometheus targets
open http://localhost:9090/targets

# Look for errors (red) vs success (green)

# Check API metrics endpoint
curl http://localhost:4000/api/metrics
```

**Common causes**:
- API container not running
- Wrong port in `prometheus.yml`
- Network connectivity issue

**Solution**:
```bash
# Restart API
docker compose restart api

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart Prometheus
docker compose restart prometheus
```

---

### Grafana Dashboards Not Loading

**Symptoms**: Blank dashboards or "No data" errors

**Diagnosis**:
```bash
# Check Grafana logs
docker compose logs grafana | tail -50

# Check datasource
open http://localhost:3001/datasources

# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up
```

**Solution**:
```bash
# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)

# Restart Grafana
docker compose restart grafana
```

---

### Alerts Not Firing

**Symptoms**: No notifications despite issues

**Diagnosis**:
```bash
# Check Prometheus alerts
open http://localhost:9090/alerts

# Check Alertmanager
open http://localhost:9093

# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules
```

**Solution**:
```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'
```

---

## Production Best Practices

### Secure Grafana

**Change admin password**:
```bash
# Via UI: Admin → Profile → Change Password

# Via env var (docker-compose.yml):
environment:
  - GF_SECURITY_ADMIN_PASSWORD=<strong-password>
```

**Disable signup**:
```yaml
environment:
  - GF_USERS_ALLOW_SIGN_UP=false  # Already set
```

---

### Alert Tuning

**Avoid false positives**: Increase `for` duration in critical alerts.

**Example** (before):
```yaml
- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 1m  # Too aggressive
```

**Example** (after):
```yaml
- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 10m  # More reasonable
```

---

### External Storage (Long-Term)

**Prometheus** supports remote write to:
- **Thanos**: Long-term storage (S3/GCS)
- **Cortex**: Multi-tenant Prometheus
- **VictoriaMetrics**: High-performance storage

**Example** (Thanos):
```yaml
# prometheus.yml
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
```

---

## Related Documentation

- **[Docker Compose](docker-compose.md)** — Monitoring services configuration
- **[Environment Variables](environment-variables.md)** — Monitoring env vars
- **[API Reference](../api/metrics.md)** — Custom metrics implementation