465 lines
9.6 KiB
Markdown

# Observability & Monitoring
The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration.
## Overview
The Observability stack consists of:
1. **Prometheus** - Metrics collection and storage
2. **Grafana** - Visualization dashboards
3. **Alertmanager** - Alert routing and notifications
4. **Custom Metrics** - 12 domain-specific `cm_*` metrics
5. **HTTP Metrics** - Request tracking and performance
6. **Service Health** - External service monitoring
## Features
### Metrics Collection
**Custom Domain Metrics (12 total):**
**Counters:**
- `cm_api_uptime_seconds` - API uptime counter
- `cm_canvass_visits_total` - Total canvass visits
- `cm_campaign_emails_sent_total` - Total campaign emails sent
- `cm_geocode_requests_total` - Total geocode requests
**Gauges:**
- `cm_canvass_sessions_active` - Active canvass sessions
- `cm_email_queue_size` - Email queue depth
- `cm_geocode_queue_size` - Geocode queue depth
- `cm_external_service_health` - Service health (0/1)
**Histograms:**
- `cm_geocode_duration_seconds` - Geocoding latency
- `http_request_duration_ms` - HTTP request duration
**HTTP Metrics:**
- Request count by method/route/status
- Request duration percentiles (p50, p95, p99)
- Active requests gauge
- Error rate tracking
### Grafana Dashboards
Three pre-configured dashboards:
1. **Changemaker Lite Overview** - System-wide metrics
- API uptime and request rates
- Queue sizes and health
- Active sessions
- Error rates
2. **Canvassing Metrics** - Canvass-specific metrics
- Active sessions over time
- Visits by outcome
- Session duration
- Volunteer leaderboard
3. **External Services** - Integration health
- Redis health
- PostgreSQL health
- Listmonk status
- Geocoding providers
### Alert Rules
12 predefined alert rules:
**Critical Alerts:**
- API down (>5 min)
- Database unreachable
- Redis connection lost
**Warning Alerts:**
- High error rate (>5%)
- Queue backup (>1000 jobs)
- Slow requests (p95 >2s)
- Service degradation
**Info Alerts:**
- New deployment
- Service restart
- Configuration change
### Admin Interface
Observability page (`/app/observability`) with:
- **Metrics Tab** - Live metrics display
- **Dashboards Tab** - Embedded Grafana
- **Alerts Tab** - Active alerts and rules
## Architecture
### Backend Components
**Metrics Module:**
- `api/src/utils/metrics.ts` - Prometheus metrics definitions
- `api/src/modules/observability/observability.routes.ts` - Admin API
**Instrumentation:**
- Express middleware for HTTP metrics
- Service-level metric updates
- Queue size tracking
- External service health checks
**Configuration:**
- `configs/prometheus/prometheus.yml` - Scrape config
- `configs/prometheus/alerts.yml` - Alert rules
- `configs/grafana/dashboards/` - Dashboard JSON
### Frontend Components
**Admin Page:**
- `admin/src/pages/ObservabilityPage.tsx` - Monitoring dashboard
- Three tabs: Metrics, Dashboards, Alerts
- Embedded Grafana iframes
- Live metric cards
**Observability Components:**
- `admin/src/components/observability/MetricsChart.tsx` - Chart component
- `admin/src/components/observability/ServiceHealthCard.tsx` - Health display
### Docker Services
**Monitoring Profile:**
Services run with `--profile monitoring`:
```yaml
profiles: [monitoring]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3001:3000"]
alertmanager:
image: prom/alertmanager:latest
ports: ["9093:9093"]
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports: ["8080:8080"]
node-exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
redis-exporter:
image: oliver006/redis_exporter:latest
ports: ["9121:9121"]
```
## Configuration
### Environment Variables
```bash
# Enable metrics
METRICS_ENABLED=true
# Prometheus
PROMETHEUS_PORT=9090
# Grafana
GRAFANA_PORT=3001
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
# Alertmanager
ALERTMANAGER_PORT=9093
```
### Prometheus Scrape Targets
```yaml
scrape_configs:
- job_name: 'changemaker-api'
static_configs:
- targets: ['api:4000']
- job_name: 'media-api'
static_configs:
- targets: ['media-api:4100']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
```
### Alert Rules
Example alert rule:
```yaml
groups:
- name: api_alerts
rules:
- alert: APIDown
expr: up{job="changemaker-api"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "API is down"
description: "API has been down for 5 minutes"
- alert: HighErrorRate
expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate detected"
```
## Metrics Usage
### Increment Counter
```typescript
import { metrics } from '../utils/metrics';
// Campaign email sent
metrics.campaignEmailsSent.inc();
// Geocode request
metrics.geocodeRequests.inc({ provider: 'nominatim' });
```
### Set Gauge
```typescript
// Update queue size
metrics.emailQueueSize.set(queueSize);
// Update active sessions
metrics.canvassSessionsActive.set(activeSessions);
// Set service health (1 = healthy, 0 = unhealthy)
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
```
### Observe Histogram
```typescript
// Time geocoding request
const end = metrics.geocodeDuration.startTimer();
try {
await geocode(address);
end({ success: 'true' });
} catch (error) {
end({ success: 'false' });
}
```
## Grafana Dashboards
### Dashboard Setup
Dashboards auto-provisioned from `configs/grafana/dashboards/`:
```json
{
"dashboard": {
"title": "Changemaker Lite Overview",
"panels": [
{
"title": "API Request Rate",
"targets": [
{
"expr": "rate(http_request_duration_ms_count[5m])"
}
]
}
]
}
}
```
### Accessing Dashboards
- **Direct:** http://localhost:3001 (admin/admin)
- **Embedded:** `/app/observability` → Dashboards tab
- **Subdomain:** http://grafana.cmlite.org (production)
## Alertmanager
### Alert Routing
Configure in `configs/alertmanager/alertmanager.yml`:
```yaml
route:
receiver: 'default'
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://gotify:8889/message'
- name: 'critical-alerts'
email_configs:
- to: 'admin@example.com'
```
### Notification Channels
Supported receivers:
- **Webhook** - Gotify, Slack, Discord
- **Email** - SMTP notifications
- **PagerDuty** - Incident management
- **Opsgenie** - Alert management
## Service Health Monitoring
### External Service Checks
Monitor services via health gauges:
```typescript
// Check Redis
try {
await redisClient.ping();
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'redis' }, 0);
}
// Check PostgreSQL
try {
await prisma.$queryRaw`SELECT 1`;
metrics.externalServiceHealth.set({ service: 'postgres' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'postgres' }, 0);
}
```
### Docker Healthchecks
Services with healthchecks:
- **API** - `wget --spider http://localhost:4000/health`
- **Media API** - `wget --spider http://localhost:4100/health`
- **PostgreSQL** - `pg_isready`
- **Redis** - `redis-cli ping`
- **Listmonk** - `wget --spider http://localhost:9000/health`
## Performance Monitoring
### HTTP Request Tracking
Automatic tracking of:
- Request count by route
- Request duration percentiles
- Status code distribution
- Error rates
### Queue Monitoring
Track queue depths:
- Email queue size
- Geocode queue size
- Failed job count
- Processing rate
### Resource Monitoring
Via cAdvisor and Node Exporter:
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
## Admin Interface
### Metrics Tab
Display cards:
- API uptime
- Request rate (req/sec)
- Error rate (%)
- Queue sizes
- Active sessions
- Service health
### Dashboards Tab
Embedded Grafana:
- Overview dashboard
- Canvassing metrics
- External services
- Custom queries
### Alerts Tab
Active alerts list:
- Alert name
- Severity
- Status (firing/pending/resolved)
- Duration
- Quick actions (silence, resolve)
## Starting Monitoring Stack
```bash
# Start with monitoring profile
docker compose --profile monitoring up -d
# Access services
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
# Alertmanager: http://localhost:9093
```
## API Endpoints
### Observability Endpoints
```
GET /api/observability/prometheus # Prometheus status
GET /api/observability/grafana # Grafana status
GET /api/observability/alertmanager # Alertmanager status
GET /api/observability/metrics # Current metrics values
```
### Metrics Endpoint
```
GET /metrics # Prometheus scrape endpoint
```
## Related Documentation
- [Observability Page](../../frontend/pages/admin/observability-page.md)
- [Metrics Utilities](../../backend/utilities/index.md)
- [Docker Compose](../../deployment/docker-compose.md)
- [Monitoring Stack](../../deployment/monitoring-stack.md)
- [Healthchecks](../../deployment/healthchecks.md)
- [Performance Optimization](../../troubleshooting/performance-optimization.md)