9.6 KiB
Observability & Monitoring
The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration.
Overview
The Observability stack consists of:
- Prometheus - Metrics collection and storage
- Grafana - Visualization dashboards
- Alertmanager - Alert routing and notifications
- Custom Metrics - 12 domain-specific
cm_*metrics - HTTP Metrics - Request tracking and performance
- Service Health - External service monitoring
Features
Metrics Collection
Custom Domain Metrics (12 total):
Counters:
cm_api_uptime_seconds- API uptime countercm_canvass_visits_total- Total canvass visitscm_campaign_emails_sent_total- Total campaign emails sentcm_geocode_requests_total- Total geocode requests
Gauges:
cm_canvass_sessions_active- Active canvass sessionscm_email_queue_size- Email queue depthcm_geocode_queue_size- Geocode queue depthcm_external_service_health- Service health (0/1)
Histograms:
cm_geocode_duration_seconds- Geocoding latencyhttp_request_duration_ms- HTTP request duration
HTTP Metrics:
- Request count by method/route/status
- Request duration percentiles (p50, p95, p99)
- Active requests gauge
- Error rate tracking
Grafana Dashboards
Three pre-configured dashboards:
-
Changemaker Lite Overview - System-wide metrics
- API uptime and request rates
- Queue sizes and health
- Active sessions
- Error rates
-
Canvassing Metrics - Canvass-specific metrics
- Active sessions over time
- Visits by outcome
- Session duration
- Volunteer leaderboard
-
External Services - Integration health
- Redis health
- PostgreSQL health
- Listmonk status
- Geocoding providers
Alert Rules
12 predefined alert rules:
Critical Alerts:
- API down (>5 min)
- Database unreachable
- Redis connection lost
Warning Alerts:
- High error rate (>5%)
- Queue backup (>1000 jobs)
- Slow requests (p95 >2s)
- Service degradation
Info Alerts:
- New deployment
- Service restart
- Configuration change
Admin Interface
Observability page (/app/observability) with:
- Metrics Tab - Live metrics display
- Dashboards Tab - Embedded Grafana
- Alerts Tab - Active alerts and rules
Architecture
Backend Components
Metrics Module:
api/src/utils/metrics.ts- Prometheus metrics definitionsapi/src/modules/observability/observability.routes.ts- Admin API
Instrumentation:
- Express middleware for HTTP metrics
- Service-level metric updates
- Queue size tracking
- External service health checks
Configuration:
configs/prometheus/prometheus.yml- Scrape configconfigs/prometheus/alerts.yml- Alert rulesconfigs/grafana/dashboards/- Dashboard JSON
Frontend Components
Admin Page:
admin/src/pages/ObservabilityPage.tsx- Monitoring dashboard- Three tabs: Metrics, Dashboards, Alerts
- Embedded Grafana iframes
- Live metric cards
Observability Components:
admin/src/components/observability/MetricsChart.tsx- Chart componentadmin/src/components/observability/ServiceHealthCard.tsx- Health display
Docker Services
Monitoring Profile:
Services run with --profile monitoring:
profiles: [monitoring]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3001:3000"]
alertmanager:
image: prom/alertmanager:latest
ports: ["9093:9093"]
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
ports: ["8080:8080"]
node-exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
redis-exporter:
image: oliver006/redis_exporter:latest
ports: ["9121:9121"]
Configuration
Environment Variables
# Enable metrics
METRICS_ENABLED=true
# Prometheus
PROMETHEUS_PORT=9090
# Grafana
GRAFANA_PORT=3001
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin
# Alertmanager
ALERTMANAGER_PORT=9093
Prometheus Scrape Targets
scrape_configs:
- job_name: 'changemaker-api'
static_configs:
- targets: ['api:4000']
- job_name: 'media-api'
static_configs:
- targets: ['media-api:4100']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Alert Rules
Example alert rule:
groups:
- name: api_alerts
rules:
- alert: APIDown
expr: up{job="changemaker-api"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "API is down"
description: "API has been down for 5 minutes"
- alert: HighErrorRate
expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate detected"
Metrics Usage
Increment Counter
import { metrics } from '../utils/metrics';
// Campaign email sent
metrics.campaignEmailsSent.inc();
// Geocode request
metrics.geocodeRequests.inc({ provider: 'nominatim' });
Set Gauge
// Update queue size
metrics.emailQueueSize.set(queueSize);
// Update active sessions
metrics.canvassSessionsActive.set(activeSessions);
// Set service health (1 = healthy, 0 = unhealthy)
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
Observe Histogram
// Time geocoding request
const end = metrics.geocodeDuration.startTimer();
try {
await geocode(address);
end({ success: 'true' });
} catch (error) {
end({ success: 'false' });
}
Grafana Dashboards
Dashboard Setup
Dashboards auto-provisioned from configs/grafana/dashboards/:
{
"dashboard": {
"title": "Changemaker Lite Overview",
"panels": [
{
"title": "API Request Rate",
"targets": [
{
"expr": "rate(http_request_duration_ms_count[5m])"
}
]
}
]
}
}
Accessing Dashboards
- Direct: http://localhost:3001 (admin/admin)
- Embedded:
/app/observability→ Dashboards tab - Subdomain: http://grafana.cmlite.org (production)
Alertmanager
Alert Routing
Configure in configs/alertmanager/alertmanager.yml:
route:
receiver: 'default'
group_by: ['alertname', 'severity']
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://gotify:8889/message'
- name: 'critical-alerts'
email_configs:
- to: 'admin@example.com'
Notification Channels
Supported receivers:
- Webhook - Gotify, Slack, Discord
- Email - SMTP notifications
- PagerDuty - Incident management
- Opsgenie - Alert management
Service Health Monitoring
External Service Checks
Monitor services via health gauges:
// Check Redis
try {
await redisClient.ping();
metrics.externalServiceHealth.set({ service: 'redis' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'redis' }, 0);
}
// Check PostgreSQL
try {
await prisma.$queryRaw`SELECT 1`;
metrics.externalServiceHealth.set({ service: 'postgres' }, 1);
} catch (error) {
metrics.externalServiceHealth.set({ service: 'postgres' }, 0);
}
Docker Healthchecks
Services with healthchecks:
- API -
wget --spider http://localhost:4000/health - Media API -
wget --spider http://localhost:4100/health - PostgreSQL -
pg_isready - Redis -
redis-cli ping - Listmonk -
wget --spider http://localhost:9000/health
Performance Monitoring
HTTP Request Tracking
Automatic tracking of:
- Request count by route
- Request duration percentiles
- Status code distribution
- Error rates
Queue Monitoring
Track queue depths:
- Email queue size
- Geocode queue size
- Failed job count
- Processing rate
Resource Monitoring
Via cAdvisor and Node Exporter:
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
Admin Interface
Metrics Tab
Display cards:
- API uptime
- Request rate (req/sec)
- Error rate (%)
- Queue sizes
- Active sessions
- Service health
Dashboards Tab
Embedded Grafana:
- Overview dashboard
- Canvassing metrics
- External services
- Custom queries
Alerts Tab
Active alerts list:
- Alert name
- Severity
- Status (firing/pending/resolved)
- Duration
- Quick actions (silence, resolve)
Starting Monitoring Stack
# Start with monitoring profile
docker compose --profile monitoring up -d
# Access services
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
# Alertmanager: http://localhost:9093
API Endpoints
Observability Endpoints
GET /api/observability/prometheus # Prometheus status
GET /api/observability/grafana # Grafana status
GET /api/observability/alertmanager # Alertmanager status
GET /api/observability/metrics # Current metrics values
Metrics Endpoint
GET /metrics # Prometheus scrape endpoint