668 lines
14 KiB
Markdown
668 lines
14 KiB
Markdown
# Monitoring Stack (Prometheus + Grafana)
|
|
|
|
## Overview
|
|
|
|
Changemaker Lite V2 includes a complete observability stack for production monitoring:
|
|
|
|
- **Prometheus**: Metrics collection + alerting rules
|
|
- **Grafana**: Visualization + pre-configured dashboards
|
|
- **Alertmanager**: Alert routing + notifications
|
|
- **cAdvisor**: Docker container metrics
|
|
- **Node Exporter**: Host system metrics
|
|
- **Redis Exporter**: Redis-specific metrics
|
|
- **Gotify**: Push notifications (optional)
|
|
|
|
**All monitoring services** behind Docker Compose profile flag (opt-in).
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Application Metrics"
|
|
API[API<br/>:4000/api/metrics]
|
|
MEDIA[Media API<br/>:4100/metrics]
|
|
end
|
|
|
|
subgraph "Infrastructure Metrics"
|
|
CADVISOR[cAdvisor<br/>Container Stats]
|
|
NODE[Node Exporter<br/>Host Stats]
|
|
REDIS_EXP[Redis Exporter<br/>Redis Stats]
|
|
end
|
|
|
|
subgraph "Monitoring Stack"
|
|
PROM[Prometheus<br/>:9090]
|
|
GRAFANA[Grafana<br/>:3001]
|
|
ALERT[Alertmanager<br/>:9093]
|
|
GOTIFY[Gotify<br/>:8889]
|
|
end
|
|
|
|
API --> PROM
|
|
MEDIA --> PROM
|
|
CADVISOR --> PROM
|
|
NODE --> PROM
|
|
REDIS_EXP --> PROM
|
|
|
|
PROM --> GRAFANA
|
|
PROM --> ALERT
|
|
ALERT --> GOTIFY
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Enable Monitoring
|
|
|
|
```bash
|
|
# Start with monitoring profile
|
|
docker compose --profile monitoring up -d
|
|
|
|
# Check services
|
|
docker compose ps | grep monitoring
|
|
|
|
# Access dashboards
|
|
open http://localhost:3001 # Grafana (admin/admin)
|
|
open http://localhost:9090 # Prometheus
|
|
open http://localhost:9093 # Alertmanager
|
|
```
|
|
|
|
---
|
|
|
|
## Prometheus Configuration
|
|
|
|
### Scrape Targets
|
|
|
|
**File**: `configs/prometheus/prometheus.yml`
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
# V2 Unified API Metrics (10s interval)
|
|
- job_name: 'changemaker-v2-api'
|
|
static_configs:
|
|
- targets: ['changemaker-v2-api:4000']
|
|
metrics_path: '/api/metrics'
|
|
scrape_interval: 10s
|
|
scrape_timeout: 5s
|
|
|
|
# Redis Metrics (15s interval)
|
|
- job_name: 'redis'
|
|
static_configs:
|
|
- targets: ['redis-exporter:9121']
|
|
scrape_interval: 15s
|
|
|
|
# cAdvisor - Docker container metrics
|
|
- job_name: 'cadvisor'
|
|
static_configs:
|
|
- targets: ['cadvisor:8080']
|
|
scrape_interval: 15s
|
|
|
|
# Node Exporter - System metrics
|
|
- job_name: 'node'
|
|
static_configs:
|
|
- targets: ['node-exporter:9100']
|
|
scrape_interval: 15s
|
|
|
|
# Prometheus self-monitoring
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
|
|
# Alertmanager monitoring
|
|
- job_name: 'alertmanager'
|
|
static_configs:
|
|
- targets: ['alertmanager:9093']
|
|
scrape_interval: 30s
|
|
```
|
|
|
|
**Intervals:**
|
|
- **10s**: API (real-time application metrics)
|
|
- **15s**: Infrastructure (host + containers + Redis)
|
|
- **30s**: Monitoring stack itself
|
|
|
|
---
|
|
|
|
### Custom Metrics (cm_*)
|
|
|
|
**File**: `api/src/utils/metrics.ts`
|
|
|
|
**12 custom metrics** for domain-specific monitoring:
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `cm_emails_sent_total` | Counter | `campaign_id` | Campaign emails sent successfully |
|
|
| `cm_emails_failed_total` | Counter | `campaign_id`, `error_type` | Failed email sends |
|
|
| `cm_email_queue_size` | Gauge | - | Current email queue size |
|
|
| `cm_email_send_duration_seconds` | Histogram | - | Email send latency |
|
|
| `cm_login_attempts_total` | Counter | `status` | Login attempts (success/failure) |
|
|
| `cm_active_sessions` | Gauge | - | Active refresh tokens |
|
|
| `cm_campaign_emails_total` | Counter | `campaign_id` | Total campaign emails created |
|
|
| `cm_response_submissions_total` | Counter | - | Response wall submissions |
|
|
| `cm_canvass_visits_total` | Counter | `outcome` | Canvass visits by outcome |
|
|
| `cm_active_canvass_sessions` | Gauge | - | Active canvass sessions |
|
|
| `cm_shift_signups_total` | Counter | - | Shift signups |
|
|
| `cm_external_service_up` | Gauge | `service` | External service health (1=up, 0=down) |
|
|
|
|
**HTTP metrics** (standard prom-client):
|
|
- `http_requests_total`
|
|
- `http_request_duration_seconds`
|
|
|
|
**Geocoding metrics:**
|
|
- `cm_geocode_cache_hits_total`
|
|
- `cm_geocode_cache_misses_total`
|
|
- `cm_geocode_requests_total`
|
|
- `cm_geocode_duration_seconds`
|
|
|
|
**Email template metrics:**
|
|
- `cm_email_templates_updated_total`
|
|
- `cm_email_test_sent_total`
|
|
- `cm_email_template_rollback_total`
|
|
- `cm_email_template_cache_hit/miss_total`
|
|
|
|
**Location query metrics:**
|
|
- `cm_map_location_query_duration_seconds`
|
|
- `cm_map_location_query_count_total`
|
|
- `cm_map_location_result_count`
|
|
|
|
---
|
|
|
|
### Alert Rules
|
|
|
|
**File**: `configs/prometheus/alerts.yml`
|
|
|
|
**12 alert rules** across 4 groups:
|
|
|
|
#### Application Alerts
|
|
1. **ApplicationDown**: API unreachable for 2 minutes
|
|
2. **HighErrorRate**: >10% 5xx errors for 5 minutes
|
|
3. **EmailQueueBacklog**: Queue size >100 for 10 minutes
|
|
4. **HighEmailFailureRate**: >20% email failures for 10 minutes
|
|
5. **SuspiciousLoginActivity**: >5 failed logins/sec for 2 minutes
|
|
6. **HighAPILatency**: P95 latency >2s for 5 minutes
|
|
7. **ExternalServiceDown**: External service unreachable for 5 minutes
|
|
|
|
#### System Alerts
|
|
8. **RedisDown**: Redis unreachable for 1 minute
|
|
9. **DiskSpaceLow**: <15% disk space for 5 minutes
|
|
10. **DiskSpaceCritical**: <10% disk space for 2 minutes
|
|
11. **HighCPUUsage**: >85% CPU for 10 minutes
|
|
12. **HighMemoryUsage**: >85% memory for 10 minutes
|
|
|
|
**Example Alert**:
|
|
```yaml
|
|
- alert: ApplicationDown
|
|
expr: up{job="changemaker-v2-api"} == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "V2 API is down"
|
|
description: "The Changemaker V2 API has been down for more than 2 minutes."
|
|
```
|
|
|
|
---
|
|
|
|
### Data Retention
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
prometheus:
|
|
command:
|
|
- '--storage.tsdb.retention.time=30d' # 30 days
|
|
```
|
|
|
|
**Disk usage**: ~1-5GB for 30 days (depends on scrape frequency + cardinality).
|
|
|
|
**Increase retention**:
|
|
```bash
|
|
# Edit docker-compose.yml
|
|
# Change to '--storage.tsdb.retention.time=90d'
|
|
|
|
# Recreate container
|
|
docker compose --profile monitoring up -d --force-recreate prometheus
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Configuration
|
|
|
|
### Datasource
|
|
|
|
**File**: `configs/grafana/datasources.yml`
|
|
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
datasources:
|
|
- name: Prometheus
|
|
type: prometheus
|
|
access: proxy
|
|
url: http://prometheus:9090
|
|
isDefault: true
|
|
editable: false
|
|
```
|
|
|
|
**Auto-provisioned** on Grafana startup.
|
|
|
|
---
|
|
|
|
### Dashboards
|
|
|
|
**File**: `configs/grafana/dashboards.yml`
|
|
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
providers:
|
|
- name: 'Default'
|
|
folder: 'Changemaker Lite'
|
|
type: file
|
|
options:
|
|
path: /etc/grafana/provisioning/dashboards
|
|
```
|
|
|
|
**3 pre-configured dashboards**:
|
|
|
|
#### 1. Application Overview
|
|
**File**: `configs/grafana/application-overview.json`
|
|
|
|
**Panels**:
|
|
- API uptime (last 24h)
|
|
- Request rate (req/sec)
|
|
- Error rate (%)
|
|
- Email queue size
|
|
- Active sessions
|
|
- Campaign emails sent
|
|
|
|
**Refresh**: 10s
|
|
|
|
---
|
|
|
|
#### 2. API Performance
|
|
**File**: `configs/grafana/api-performance.json`
|
|
|
|
**Panels**:
|
|
- Request latency (P50, P95, P99)
|
|
- Requests by status code
|
|
- Top 10 slowest endpoints
|
|
- HTTP errors by route
|
|
- Geocoding cache hit rate
|
|
- Email send duration
|
|
|
|
**Refresh**: 30s
|
|
|
|
---
|
|
|
|
#### 3. System Health
|
|
**File**: `configs/grafana/system-health.json`
|
|
|
|
**Panels**:
|
|
- CPU usage (%)
|
|
- Memory usage (%)
|
|
- Disk space (GB free)
|
|
- Network I/O (MB/s)
|
|
- Container CPU throttling
|
|
- Redis memory usage
|
|
|
|
**Refresh**: 1m
|
|
|
|
---
|
|
|
|
### First Login
|
|
|
|
```bash
|
|
# Access Grafana
|
|
open http://localhost:3001
|
|
|
|
# Default credentials
|
|
Username: admin
|
|
Password: admin
|
|
|
|
# Change password on first login
|
|
```
|
|
|
|
**Navigate**: Dashboards → Changemaker Lite folder → Select dashboard
|
|
|
|
---
|
|
|
|
## Alertmanager Configuration
|
|
|
|
### Notification Receivers
|
|
|
|
**File**: `configs/alertmanager/alertmanager.yml`
|
|
|
|
```yaml
|
|
global:
|
|
resolve_timeout: 5m
|
|
|
|
route:
|
|
receiver: 'default'
|
|
group_by: ['alertname', 'severity']
|
|
group_wait: 30s
|
|
group_interval: 5m
|
|
repeat_interval: 4h
|
|
|
|
receivers:
|
|
- name: 'default'
|
|
# Email (example)
|
|
email_configs:
|
|
- to: 'admin@cmlite.org'
|
|
from: 'alerts@cmlite.org'
|
|
smarthost: 'smtp.example.com:587'
|
|
auth_username: 'alerts@cmlite.org'
|
|
auth_password: 'your-password'
|
|
|
|
# Slack (example)
|
|
slack_configs:
|
|
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
|
|
channel: '#alerts'
|
|
title: '{{ .GroupLabels.alertname }}'
|
|
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
|
|
|
|
# Gotify (push notifications)
|
|
webhook_configs:
|
|
- url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'
|
|
```
|
|
|
|
**Grouping**: Combines similar alerts (prevents spam).
|
|
|
|
**Repeat**: Re-sends unresolved alerts every 4 hours.
|
|
|
|
---
|
|
|
|
### Testing Alerts
|
|
|
|
**Manual test**:
|
|
```bash
|
|
# Trigger test alert
|
|
curl -X POST http://localhost:9093/api/v1/alerts \
|
|
-d '[{
|
|
"labels": {"alertname":"TestAlert","severity":"warning"},
|
|
"annotations": {"summary":"Test alert from curl"}
|
|
}]'
|
|
|
|
# Check Alertmanager UI
|
|
open http://localhost:9093
|
|
```
|
|
|
|
**Force alert** (stop API):
|
|
```bash
|
|
# Stop API (triggers ApplicationDown alert after 2m)
|
|
docker compose stop api
|
|
|
|
# Check Prometheus alerts
|
|
open http://localhost:9090/alerts
|
|
|
|
# Wait 2 minutes → Alert fires → Notification sent
|
|
```
|
|
|
|
---
|
|
|
|
## Exporters
|
|
|
|
### cAdvisor (Container Metrics)
|
|
|
|
**Metrics**:
|
|
- CPU usage per container
|
|
- Memory usage per container
|
|
- Network I/O
|
|
- Disk I/O
|
|
|
|
**Access**: http://localhost:8080
|
|
|
|
**Configuration** (docker-compose.yml):
|
|
```yaml
|
|
cadvisor:
|
|
image: gcr.io/cadvisor/cadvisor:latest
|
|
container_name: cadvisor-changemaker
|
|
privileged: true # Required for full access
|
|
volumes:
|
|
- /:/rootfs:ro
|
|
- /var/run:/var/run:ro
|
|
- /sys:/sys:ro
|
|
- /var/lib/docker/:/var/lib/docker:ro
|
|
- /dev/disk/:/dev/disk:ro
|
|
devices:
|
|
- /dev/kmsg
|
|
```
|
|
|
|
---
|
|
|
|
### Node Exporter (Host Metrics)
|
|
|
|
**Metrics**:
|
|
- CPU usage (all cores)
|
|
- Memory usage (total, free, cached)
|
|
- Disk usage (filesystem, mountpoints)
|
|
- Network I/O (bytes, packets)
|
|
|
|
**Access**: http://localhost:9100/metrics
|
|
|
|
**Configuration**:
|
|
```yaml
|
|
node-exporter:
|
|
command:
|
|
- '--path.rootfs=/host'
|
|
- '--path.procfs=/host/proc'
|
|
- '--path.sysfs=/host/sys'
|
|
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
|
|
volumes:
|
|
- /proc:/host/proc:ro
|
|
- /sys:/host/sys:ro
|
|
- /:/rootfs:ro
|
|
```
|
|
|
|
---
|
|
|
|
### Redis Exporter
|
|
|
|
**Metrics**:
|
|
- Memory usage
|
|
- Commands per second
|
|
- Connected clients
|
|
- Keyspace hits/misses
|
|
- Evicted keys
|
|
|
|
**Access**: http://localhost:9121/metrics
|
|
|
|
**Configuration**:
|
|
```yaml
|
|
redis-exporter:
|
|
environment:
|
|
- REDIS_ADDR=redis:6379
|
|
- REDIS_PASSWORD=${REDIS_PASSWORD} # Authenticates with Redis
|
|
```
|
|
|
|
---
|
|
|
|
## Gotify (Push Notifications)
|
|
|
|
**Setup**:
|
|
```bash
|
|
# Access Gotify UI
|
|
open http://localhost:8889
|
|
|
|
# Login (default: admin/admin)
|
|
|
|
# Create app → Copy token
|
|
|
|
# Add to Alertmanager config:
|
|
webhook_configs:
|
|
- url: 'http://gotify:80/message?token=YOUR_TOKEN'
|
|
```
|
|
|
|
**Mobile apps**: Available for iOS/Android (receive push notifications).
|
|
|
|
---
|
|
|
|
## Accessing Services
|
|
|
|
| Service | URL | Default Credentials |
|
|
|---------|-----|---------------------|
|
|
| Prometheus | http://localhost:9090 | None |
|
|
| Grafana | http://localhost:3001 | admin / admin |
|
|
| Alertmanager | http://localhost:9093 | None |
|
|
| cAdvisor | http://localhost:8080 | None |
|
|
| Node Exporter | http://localhost:9100/metrics | None |
|
|
| Redis Exporter | http://localhost:9121/metrics | None |
|
|
| Gotify | http://localhost:8889 | admin / admin |
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Prometheus Not Scraping
|
|
|
|
**Symptoms**: Missing data in Grafana dashboards
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check Prometheus targets
|
|
open http://localhost:9090/targets
|
|
|
|
# Look for errors (red) vs success (green)
|
|
|
|
# Check API metrics endpoint
|
|
curl http://localhost:4000/api/metrics
|
|
```
|
|
|
|
**Common causes**:
|
|
- API container not running
|
|
- Wrong port in `prometheus.yml`
|
|
- Network connectivity issue
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Restart API
|
|
docker compose restart api
|
|
|
|
# Reload Prometheus config
|
|
docker compose exec prometheus kill -HUP 1
|
|
|
|
# Or restart Prometheus
|
|
docker compose restart prometheus
|
|
```
|
|
|
|
---
|
|
|
|
### Grafana Dashboards Not Loading
|
|
|
|
**Symptoms**: Blank dashboards or "No data" errors
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check Grafana logs
|
|
docker compose logs grafana | tail -50
|
|
|
|
# Check datasource
|
|
open http://localhost:3001/datasources
|
|
|
|
# Test Prometheus query
|
|
curl http://prometheus:9090/api/v1/query?query=up
|
|
```
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Verify datasource URL
|
|
# Should be http://prometheus:9090 (container name, not localhost)
|
|
|
|
# Restart Grafana
|
|
docker compose restart grafana
|
|
```
|
|
|
|
---
|
|
|
|
### Alerts Not Firing
|
|
|
|
**Symptoms**: No notifications despite issues
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check Prometheus alerts
|
|
open http://localhost:9090/alerts
|
|
|
|
# Check Alertmanager
|
|
open http://localhost:9093
|
|
|
|
# Verify alert rules loaded
|
|
curl http://localhost:9090/api/v1/rules
|
|
```
|
|
|
|
**Solution**:
|
|
```bash
|
|
# Reload Prometheus config
|
|
docker compose exec prometheus kill -HUP 1
|
|
|
|
# Check alerts.yml syntax
|
|
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
|
|
|
|
# Test notification receiver
|
|
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'
|
|
```
|
|
|
|
---
|
|
|
|
## Production Best Practices
|
|
|
|
### Secure Grafana
|
|
|
|
**Change admin password**:
|
|
```bash
|
|
# Via UI: Admin → Profile → Change Password
|
|
|
|
# Via env var (docker-compose.yml):
|
|
environment:
|
|
- GF_SECURITY_ADMIN_PASSWORD=<strong-password>
|
|
```
|
|
|
|
**Disable signup**:
|
|
```yaml
|
|
environment:
|
|
- GF_USERS_ALLOW_SIGN_UP=false # Already set
|
|
```
|
|
|
|
---
|
|
|
|
### Alert Tuning
|
|
|
|
**Avoid false positives**: Increase `for` duration in critical alerts.
|
|
|
|
**Example** (before):
|
|
```yaml
|
|
- alert: DiskSpaceLow
|
|
expr: disk_free_percent < 15
|
|
for: 1m # Too aggressive
|
|
```
|
|
|
|
**Example** (after):
|
|
```yaml
|
|
- alert: DiskSpaceLow
|
|
expr: disk_free_percent < 15
|
|
for: 10m # More reasonable
|
|
```
|
|
|
|
---
|
|
|
|
### External Storage (Long-Term)
|
|
|
|
**Prometheus** supports remote write to:
|
|
- **Thanos**: Long-term storage (S3/GCS)
|
|
- **Cortex**: Multi-tenant Prometheus
|
|
- **VictoriaMetrics**: High-performance storage
|
|
|
|
**Example** (Thanos):
|
|
```yaml
|
|
# prometheus.yml
|
|
remote_write:
|
|
- url: "http://thanos-receive:19291/api/v1/receive"
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **[Docker Compose](docker-compose.md)** — Monitoring services configuration
|
|
- **[Environment Variables](environment-variables.md)** — Monitoring env vars
|
|
- **[API Reference](../api/metrics.md)** — Custom metrics implementation
|