465 lines
9.6 KiB
Markdown
465 lines
9.6 KiB
Markdown
# Observability & Monitoring
|
|
|
|
The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration.
|
|
|
|
## Overview
|
|
|
|
The Observability stack consists of:
|
|
|
|
1. **Prometheus** - Metrics collection and storage
|
|
2. **Grafana** - Visualization dashboards
|
|
3. **Alertmanager** - Alert routing and notifications
|
|
4. **Custom Metrics** - 12 domain-specific `cm_*` metrics
|
|
5. **HTTP Metrics** - Request tracking and performance
|
|
6. **Service Health** - External service monitoring
|
|
|
|
## Features
|
|
|
|
### Metrics Collection
|
|
|
|
**Custom Domain Metrics (12 total):**
|
|
|
|
**Counters:**
|
|
- `cm_api_uptime_seconds` - API uptime counter
|
|
- `cm_canvass_visits_total` - Total canvass visits
|
|
- `cm_campaign_emails_sent_total` - Total campaign emails sent
|
|
- `cm_geocode_requests_total` - Total geocode requests
|
|
|
|
**Gauges:**
|
|
- `cm_canvass_sessions_active` - Active canvass sessions
|
|
- `cm_email_queue_size` - Email queue depth
|
|
- `cm_geocode_queue_size` - Geocode queue depth
|
|
- `cm_external_service_health` - Service health (0/1)
|
|
|
|
**Histograms:**
|
|
- `cm_geocode_duration_seconds` - Geocoding latency
|
|
- `http_request_duration_ms` - HTTP request duration
|
|
|
|
**HTTP Metrics:**
|
|
- Request count by method/route/status
|
|
- Request duration percentiles (p50, p95, p99)
|
|
- Active requests gauge
|
|
- Error rate tracking
|
|
|
|
### Grafana Dashboards
|
|
|
|
Three pre-configured dashboards:
|
|
|
|
1. **Changemaker Lite Overview** - System-wide metrics
|
|
- API uptime and request rates
|
|
- Queue sizes and health
|
|
- Active sessions
|
|
- Error rates
|
|
|
|
2. **Canvassing Metrics** - Canvass-specific metrics
|
|
- Active sessions over time
|
|
- Visits by outcome
|
|
- Session duration
|
|
- Volunteer leaderboard
|
|
|
|
3. **External Services** - Integration health
|
|
- Redis health
|
|
- PostgreSQL health
|
|
- Listmonk status
|
|
- Geocoding providers
|
|
|
|
### Alert Rules
|
|
|
|
12 predefined alert rules:
|
|
|
|
**Critical Alerts:**
|
|
- API down (>5 min)
|
|
- Database unreachable
|
|
- Redis connection lost
|
|
|
|
**Warning Alerts:**
|
|
- High error rate (>5%)
|
|
- Queue backup (>1000 jobs)
|
|
- Slow requests (p95 >2s)
|
|
- Service degradation
|
|
|
|
**Info Alerts:**
|
|
- New deployment
|
|
- Service restart
|
|
- Configuration change
|
|
|
|
### Admin Interface
|
|
|
|
Observability page (`/app/observability`) with:
|
|
|
|
- **Metrics Tab** - Live metrics display
|
|
- **Dashboards Tab** - Embedded Grafana
|
|
- **Alerts Tab** - Active alerts and rules
|
|
|
|
## Architecture
|
|
|
|
### Backend Components
|
|
|
|
**Metrics Module:**
|
|
- `api/src/utils/metrics.ts` - Prometheus metrics definitions
|
|
- `api/src/modules/observability/observability.routes.ts` - Admin API
|
|
|
|
**Instrumentation:**
|
|
- Express middleware for HTTP metrics
|
|
- Service-level metric updates
|
|
- Queue size tracking
|
|
- External service health checks
|
|
|
|
**Configuration:**
|
|
- `configs/prometheus/prometheus.yml` - Scrape config
|
|
- `configs/prometheus/alerts.yml` - Alert rules
|
|
- `configs/grafana/dashboards/` - Dashboard JSON
|
|
|
|
### Frontend Components
|
|
|
|
**Admin Page:**
|
|
- `admin/src/pages/ObservabilityPage.tsx` - Monitoring dashboard
|
|
- Three tabs: Metrics, Dashboards, Alerts
|
|
- Embedded Grafana iframes
|
|
- Live metric cards
|
|
|
|
**Observability Components:**
|
|
- `admin/src/components/observability/MetricsChart.tsx` - Chart component
|
|
- `admin/src/components/observability/ServiceHealthCard.tsx` - Health display
|
|
|
|
### Docker Services
|
|
|
|
**Monitoring Profile:**
|
|
|
|
Services run with `--profile monitoring`:
|
|
|
|
```yaml
|
|
profiles: [monitoring]
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
ports: ["9090:9090"]
|
|
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
ports: ["3001:3000"]
|
|
|
|
alertmanager:
|
|
image: prom/alertmanager:latest
|
|
ports: ["9093:9093"]
|
|
|
|
cadvisor:
|
|
image: gcr.io/cadvisor/cadvisor:latest
|
|
ports: ["8080:8080"]
|
|
|
|
node-exporter:
|
|
image: prom/node-exporter:latest
|
|
ports: ["9100:9100"]
|
|
|
|
redis-exporter:
|
|
image: oliver006/redis_exporter:latest
|
|
ports: ["9121:9121"]
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Enable metrics
|
|
METRICS_ENABLED=true
|
|
|
|
# Prometheus
|
|
PROMETHEUS_PORT=9090
|
|
|
|
# Grafana
|
|
GRAFANA_PORT=3001
|
|
GRAFANA_ADMIN_USER=admin
|
|
GRAFANA_ADMIN_PASSWORD=admin
|
|
|
|
# Alertmanager
|
|
ALERTMANAGER_PORT=9093
|
|
```
|
|
|
|
### Prometheus Scrape Targets
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'changemaker-api'
|
|
static_configs:
|
|
- targets: ['api:4000']
|
|
|
|
- job_name: 'media-api'
|
|
static_configs:
|
|
- targets: ['media-api:4100']
|
|
|
|
- job_name: 'redis'
|
|
static_configs:
|
|
- targets: ['redis-exporter:9121']
|
|
|
|
- job_name: 'node'
|
|
static_configs:
|
|
- targets: ['node-exporter:9100']
|
|
|
|
- job_name: 'cadvisor'
|
|
static_configs:
|
|
- targets: ['cadvisor:8080']
|
|
```
|
|
|
|
### Alert Rules
|
|
|
|
Example alert rule:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: api_alerts
|
|
rules:
|
|
- alert: APIDown
|
|
expr: up{job="changemaker-api"} == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "API is down"
|
|
description: "API has been down for 5 minutes"
|
|
|
|
- alert: HighErrorRate
|
|
expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
```
|
|
|
|
## Metrics Usage
|
|
|
|
### Increment Counter
|
|
|
|
```typescript
|
|
import { metrics } from '../utils/metrics';
|
|
|
|
// Campaign email sent
|
|
metrics.campaignEmailsSent.inc();
|
|
|
|
// Geocode request
|
|
metrics.geocodeRequests.inc({ provider: 'nominatim' });
|
|
```
|
|
|
|
### Set Gauge
|
|
|
|
```typescript
|
|
// Update queue size
|
|
metrics.emailQueueSize.set(queueSize);
|
|
|
|
// Update active sessions
|
|
metrics.canvassSessionsActive.set(activeSessions);
|
|
|
|
// Set service health (1 = healthy, 0 = unhealthy)
|
|
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
|
|
```
|
|
|
|
### Observe Histogram
|
|
|
|
```typescript
|
|
// Time geocoding request
|
|
const end = metrics.geocodeDuration.startTimer();
|
|
try {
|
|
await geocode(address);
|
|
end({ success: 'true' });
|
|
} catch (error) {
|
|
end({ success: 'false' });
|
|
}
|
|
```
|
|
|
|
## Grafana Dashboards
|
|
|
|
### Dashboard Setup
|
|
|
|
Dashboards auto-provisioned from `configs/grafana/dashboards/`:
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "Changemaker Lite Overview",
|
|
"panels": [
|
|
{
|
|
"title": "API Request Rate",
|
|
"targets": [
|
|
{
|
|
"expr": "rate(http_request_duration_ms_count[5m])"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Accessing Dashboards
|
|
|
|
- **Direct:** http://localhost:3001 (admin/admin)
|
|
- **Embedded:** `/app/observability` → Dashboards tab
|
|
- **Subdomain:** http://grafana.cmlite.org (production)
|
|
|
|
## Alertmanager
|
|
|
|
### Alert Routing
|
|
|
|
Configure in `configs/alertmanager/alertmanager.yml`:
|
|
|
|
```yaml
|
|
route:
|
|
receiver: 'default'
|
|
group_by: ['alertname', 'severity']
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'critical-alerts'
|
|
|
|
receivers:
|
|
- name: 'default'
|
|
webhook_configs:
|
|
- url: 'http://gotify:8889/message'
|
|
|
|
- name: 'critical-alerts'
|
|
email_configs:
|
|
- to: 'admin@example.com'
|
|
```
|
|
|
|
### Notification Channels
|
|
|
|
Supported receivers:
|
|
|
|
- **Webhook** - Gotify, Slack, Discord
|
|
- **Email** - SMTP notifications
|
|
- **PagerDuty** - Incident management
|
|
- **Opsgenie** - Alert management
|
|
|
|
## Service Health Monitoring
|
|
|
|
### External Service Checks
|
|
|
|
Monitor services via health gauges:
|
|
|
|
```typescript
|
|
// Check Redis
|
|
try {
|
|
await redisClient.ping();
|
|
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
|
|
} catch (error) {
|
|
metrics.externalServiceHealth.set({ service: 'redis' }, 0);
|
|
}
|
|
|
|
// Check PostgreSQL
|
|
try {
|
|
await prisma.$queryRaw`SELECT 1`;
|
|
metrics.externalServiceHealth.set({ service: 'postgres' }, 1);
|
|
} catch (error) {
|
|
metrics.externalServiceHealth.set({ service: 'postgres' }, 0);
|
|
}
|
|
```
|
|
|
|
### Docker Healthchecks
|
|
|
|
Services with healthchecks:
|
|
|
|
- **API** - `wget --spider http://localhost:4000/health`
|
|
- **Media API** - `wget --spider http://localhost:4100/health`
|
|
- **PostgreSQL** - `pg_isready`
|
|
- **Redis** - `redis-cli ping`
|
|
- **Listmonk** - `wget --spider http://localhost:9000/health`
|
|
|
|
## Performance Monitoring
|
|
|
|
### HTTP Request Tracking
|
|
|
|
Automatic tracking of:
|
|
|
|
- Request count by route
|
|
- Request duration percentiles
|
|
- Status code distribution
|
|
- Error rates
|
|
|
|
### Queue Monitoring
|
|
|
|
Track queue depths:
|
|
|
|
- Email queue size
|
|
- Geocode queue size
|
|
- Failed job count
|
|
- Processing rate
|
|
|
|
### Resource Monitoring
|
|
|
|
Via cAdvisor and Node Exporter:
|
|
|
|
- CPU usage
|
|
- Memory usage
|
|
- Disk I/O
|
|
- Network traffic
|
|
|
|
## Admin Interface
|
|
|
|
### Metrics Tab
|
|
|
|
Display cards:
|
|
|
|
- API uptime
|
|
- Request rate (req/sec)
|
|
- Error rate (%)
|
|
- Queue sizes
|
|
- Active sessions
|
|
- Service health
|
|
|
|
### Dashboards Tab
|
|
|
|
Embedded Grafana:
|
|
|
|
- Overview dashboard
|
|
- Canvassing metrics
|
|
- External services
|
|
- Custom queries
|
|
|
|
### Alerts Tab
|
|
|
|
Active alerts list:
|
|
|
|
- Alert name
|
|
- Severity
|
|
- Status (firing/pending/resolved)
|
|
- Duration
|
|
- Quick actions (silence, resolve)
|
|
|
|
## Starting Monitoring Stack
|
|
|
|
```bash
|
|
# Start with monitoring profile
|
|
docker compose --profile monitoring up -d
|
|
|
|
# Access services
|
|
# Prometheus: http://localhost:9090
|
|
# Grafana: http://localhost:3001 (admin/admin)
|
|
# Alertmanager: http://localhost:9093
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Observability Endpoints
|
|
|
|
```
|
|
GET /api/observability/prometheus # Prometheus status
|
|
GET /api/observability/grafana # Grafana status
|
|
GET /api/observability/alertmanager # Alertmanager status
|
|
GET /api/observability/metrics # Current metrics values
|
|
```
|
|
|
|
### Metrics Endpoint
|
|
|
|
```
|
|
GET /metrics # Prometheus scrape endpoint
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Observability Page](../../frontend/pages/admin/observability-page.md)
|
|
- [Metrics Utilities](../../backend/utilities/index.md)
|
|
- [Docker Compose](../../deployment/docker-compose.md)
|
|
- [Monitoring Stack](../../deployment/monitoring-stack.md)
|
|
- [Healthchecks](../../deployment/healthchecks.md)
|
|
- [Performance Optimization](../../troubleshooting/performance-optimization.md)
|