Observability & Monitoring¶
The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration.
Overview¶
The Observability stack consists of:
- Prometheus - Metrics collection and storage
- Grafana - Visualization dashboards
- Alertmanager - Alert routing and notifications
- Custom Metrics - 12 domain-specific
cm_*metrics - HTTP Metrics - Request tracking and performance
- Service Health - External service monitoring
Features¶
Metrics Collection¶
Custom Domain Metrics (12 total):
Counters:
- cm_api_uptime_seconds - API uptime counter
- cm_canvass_visits_total - Total canvass visits
- cm_campaign_emails_sent_total - Total campaign emails sent
- cm_geocode_requests_total - Total geocode requests
Gauges:
- cm_canvass_sessions_active - Active canvass sessions
- cm_email_queue_size - Email queue depth
- cm_geocode_queue_size - Geocode queue depth
- cm_external_service_health - Service health (0/1)
Histograms:
- cm_geocode_duration_seconds - Geocoding latency
- http_request_duration_ms - HTTP request duration
HTTP Metrics: - Request count by method/route/status - Request duration percentiles (p50, p95, p99) - Active requests gauge - Error rate tracking
Grafana Dashboards¶
Three pre-configured dashboards:
- Changemaker Lite Overview - System-wide metrics
- API uptime and request rates
- Queue sizes and health
- Active sessions
-
Error rates
-
Canvassing Metrics - Canvass-specific metrics
- Active sessions over time
- Visits by outcome
- Session duration
-
Volunteer leaderboard
-
External Services - Integration health
- Redis health
- PostgreSQL health
- Listmonk status
- Geocoding providers
Alert Rules¶
12 predefined alert rules:
Critical Alerts: - API down (>5 min) - Database unreachable - Redis connection lost
Warning Alerts: - High error rate (>5%) - Queue backup (>1000 jobs) - Slow requests (p95 >2s) - Service degradation
Info Alerts: - New deployment - Service restart - Configuration change
Admin Interface¶
Observability page (/app/observability) with:
- Metrics Tab - Live metrics display
- Dashboards Tab - Embedded Grafana
- Alerts Tab - Active alerts and rules
Architecture¶
Backend Components¶
Metrics Module:
- api/src/utils/metrics.ts - Prometheus metrics definitions
- api/src/modules/observability/observability.routes.ts - Admin API
Instrumentation: - Express middleware for HTTP metrics - Service-level metric updates - Queue size tracking - External service health checks
Configuration:
- configs/prometheus/prometheus.yml - Scrape config
- configs/prometheus/alerts.yml - Alert rules
- configs/grafana/dashboards/ - Dashboard JSON
Frontend Components¶
Admin Page:
- admin/src/pages/ObservabilityPage.tsx - Monitoring dashboard
- Three tabs: Metrics, Dashboards, Alerts
- Embedded Grafana iframes
- Live metric cards
Observability Components:
- admin/src/components/observability/MetricsChart.tsx - Chart component
- admin/src/components/observability/ServiceHealthCard.tsx - Health display
Docker Services¶
Monitoring Profile:
Services run with --profile monitoring:
profiles: [monitoring]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3001:3000"]
alertmanager:
image: prom/alertmanager:latest
ports: ["9093:9093"]
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports: ["8080:8080"]
node-exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
redis-exporter:
image: oliver006/redis_exporter:latest
ports: ["9121:9121"]
Configuration¶
Environment Variables¶
# Enable metrics
METRICS_ENABLED=true
# Prometheus
PROMETHEUS_PORT=9090
# Grafana
GRAFANA_PORT=3001
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
# Alertmanager
ALERTMANAGER_PORT=9093
Prometheus Scrape Targets¶
scrape_configs:
- job_name: 'changemaker-api'
static_configs:
- targets: ['api:4000']
- job_name: 'media-api'
static_configs:
- targets: ['media-api:4100']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Alert Rules¶
Example alert rule:
groups:
- name: api_alerts
rules:
- alert: APIDown
expr: up{job="changemaker-api"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "API is down"
description: "API has been down for 5 minutes"
- alert: HighErrorRate
expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate detected"
Metrics Usage¶
Increment Counter¶
import { metrics } from '../utils/metrics';
// Campaign email sent
metrics.campaignEmailsSent.inc();
// Geocode request
metrics.geocodeRequests.inc({ provider: 'nominatim' });
Set Gauge¶
// Update queue size
metrics.emailQueueSize.set(queueSize);
// Update active sessions
metrics.canvassSessionsActive.set(activeSessions);
// Set service health (1 = healthy, 0 = unhealthy)
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
Observe Histogram¶
// Time geocoding request
const end = metrics.geocodeDuration.startTimer();
try {
await geocode(address);
end({ success: 'true' });
} catch (error) {
end({ success: 'false' });
}
Grafana Dashboards¶
Dashboard Setup¶
Dashboards auto-provisioned from configs/grafana/dashboards/:
{
"dashboard": {
"title": "Changemaker Lite Overview",
"panels": [
{
"title": "API Request Rate",
"targets": [
{
"expr": "rate(http_request_duration_ms_count[5m])"
}
]
}
]
}
}
Accessing Dashboards¶
- Direct: http://localhost:3001 (admin/admin)
- Embedded:
/app/observability→ Dashboards tab - Subdomain: http://grafana.cmlite.org (production)
Alertmanager¶
Alert Routing¶
Configure in configs/alertmanager/alertmanager.yml:
route:
receiver: 'default'
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://gotify:8889/message'
- name: 'critical-alerts'
email_configs:
- to: 'admin@example.com'
Notification Channels¶
Supported receivers:
- Webhook - Gotify, Slack, Discord
- Email - SMTP notifications
- PagerDuty - Incident management
- Opsgenie - Alert management
Service Health Monitoring¶
External Service Checks¶
Monitor services via health gauges:
// Check Redis
try {
await redisClient.ping();
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'redis' }, 0);
}
// Check PostgreSQL
try {
await prisma.$queryRaw`SELECT 1`;
metrics.externalServiceHealth.set({ service: 'postgres' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'postgres' }, 0);
}
Docker Healthchecks¶
Services with healthchecks:
- API -
wget --spider http://localhost:4000/health - Media API -
wget --spider http://localhost:4100/health - PostgreSQL -
pg_isready - Redis -
redis-cli ping - Listmonk -
wget --spider http://localhost:9000/health
Performance Monitoring¶
HTTP Request Tracking¶
Automatic tracking of:
- Request count by route
- Request duration percentiles
- Status code distribution
- Error rates
Queue Monitoring¶
Track queue depths:
- Email queue size
- Geocode queue size
- Failed job count
- Processing rate
Resource Monitoring¶
Via cAdvisor and Node Exporter:
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
Admin Interface¶
Metrics Tab¶
Display cards:
- API uptime
- Request rate (req/sec)
- Error rate (%)
- Queue sizes
- Active sessions
- Service health
Dashboards Tab¶
Embedded Grafana:
- Overview dashboard
- Canvassing metrics
- External services
- Custom queries
Alerts Tab¶
Active alerts list:
- Alert name
- Severity
- Status (firing/pending/resolved)
- Duration
- Quick actions (silence, resolve)
Starting Monitoring Stack¶
# Start with monitoring profile
docker compose --profile monitoring up -d
# Access services
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
# Alertmanager: http://localhost:9093
API Endpoints¶
Observability Endpoints¶
GET /api/observability/prometheus # Prometheus status
GET /api/observability/grafana # Grafana status
GET /api/observability/alertmanager # Alertmanager status
GET /api/observability/metrics # Current metrics values