# Observability & Monitoring The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration. ## Overview The Observability stack consists of: 1. **Prometheus** - Metrics collection and storage 2. **Grafana** - Visualization dashboards 3. **Alertmanager** - Alert routing and notifications 4. **Custom Metrics** - 12 domain-specific `cm_*` metrics 5. **HTTP Metrics** - Request tracking and performance 6. **Service Health** - External service monitoring ## Features ### Metrics Collection **Custom Domain Metrics (12 total):** **Counters:** - `cm_api_uptime_seconds` - API uptime counter - `cm_canvass_visits_total` - Total canvass visits - `cm_campaign_emails_sent_total` - Total campaign emails sent - `cm_geocode_requests_total` - Total geocode requests **Gauges:** - `cm_canvass_sessions_active` - Active canvass sessions - `cm_email_queue_size` - Email queue depth - `cm_geocode_queue_size` - Geocode queue depth - `cm_external_service_health` - Service health (0/1) **Histograms:** - `cm_geocode_duration_seconds` - Geocoding latency - `http_request_duration_ms` - HTTP request duration **HTTP Metrics:** - Request count by method/route/status - Request duration percentiles (p50, p95, p99) - Active requests gauge - Error rate tracking ### Grafana Dashboards Three pre-configured dashboards: 1. **Changemaker Lite Overview** - System-wide metrics - API uptime and request rates - Queue sizes and health - Active sessions - Error rates 2. **Canvassing Metrics** - Canvass-specific metrics - Active sessions over time - Visits by outcome - Session duration - Volunteer leaderboard 3. **External Services** - Integration health - Redis health - PostgreSQL health - Listmonk status - Geocoding providers ### Alert Rules 12 predefined alert rules: **Critical Alerts:** - API down (>5 min) - Database unreachable - Redis connection lost **Warning Alerts:** - High error rate (>5%) - Queue backup (>1000 jobs) - Slow requests (p95 >2s) - Service degradation **Info Alerts:** - New deployment - Service restart - Configuration change ### Admin Interface Observability page (`/app/observability`) with: - **Metrics Tab** - Live metrics display - **Dashboards Tab** - Embedded Grafana - **Alerts Tab** - Active alerts and rules ## Architecture ### Backend Components **Metrics Module:** - `api/src/utils/metrics.ts` - Prometheus metrics definitions - `api/src/modules/observability/observability.routes.ts` - Admin API **Instrumentation:** - Express middleware for HTTP metrics - Service-level metric updates - Queue size tracking - External service health checks **Configuration:** - `configs/prometheus/prometheus.yml` - Scrape config - `configs/prometheus/alerts.yml` - Alert rules - `configs/grafana/dashboards/` - Dashboard JSON ### Frontend Components **Admin Page:** - `admin/src/pages/ObservabilityPage.tsx` - Monitoring dashboard - Three tabs: Metrics, Dashboards, Alerts - Embedded Grafana iframes - Live metric cards **Observability Components:** - `admin/src/components/observability/MetricsChart.tsx` - Chart component - `admin/src/components/observability/ServiceHealthCard.tsx` - Health display ### Docker Services **Monitoring Profile:** Services run with `--profile monitoring`: ```yaml profiles: [monitoring] prometheus: image: prom/prometheus:latest ports: ["9090:9090"] grafana: image: grafana/grafana:latest ports: ["3001:3000"] alertmanager: image: prom/alertmanager:latest ports: ["9093:9093"] cadvisor: image: gcr.io/cadvisor/cadvisor:latest ports: ["8080:8080"] node-exporter: image: prom/node-exporter:latest ports: ["9100:9100"] redis-exporter: image: oliver006/redis_exporter:latest ports: ["9121:9121"] ``` ## Configuration ### Environment Variables ```bash # Enable metrics METRICS_ENABLED=true # Prometheus PROMETHEUS_PORT=9090 # Grafana GRAFANA_PORT=3001 GRAFANA_ADMIN_USER=admin GRAFANA_ADMIN_PASSWORD=admin # Alertmanager ALERTMANAGER_PORT=9093 ``` ### Prometheus Scrape Targets ```yaml scrape_configs: - job_name: 'changemaker-api' static_configs: - targets: ['api:4000'] - job_name: 'media-api' static_configs: - targets: ['media-api:4100'] - job_name: 'redis' static_configs: - targets: ['redis-exporter:9121'] - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] - job_name: 'cadvisor' static_configs: - targets: ['cadvisor:8080'] ``` ### Alert Rules Example alert rule: ```yaml groups: - name: api_alerts rules: - alert: APIDown expr: up{job="changemaker-api"} == 0 for: 5m labels: severity: critical annotations: summary: "API is down" description: "API has been down for 5 minutes" - alert: HighErrorRate expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05 for: 10m labels: severity: warning annotations: summary: "High error rate detected" ``` ## Metrics Usage ### Increment Counter ```typescript import { metrics } from '../utils/metrics'; // Campaign email sent metrics.campaignEmailsSent.inc(); // Geocode request metrics.geocodeRequests.inc({ provider: 'nominatim' }); ``` ### Set Gauge ```typescript // Update queue size metrics.emailQueueSize.set(queueSize); // Update active sessions metrics.canvassSessionsActive.set(activeSessions); // Set service health (1 = healthy, 0 = unhealthy) metrics.externalServiceHealth.set({ service: 'redis' }, 1); ``` ### Observe Histogram ```typescript // Time geocoding request const end = metrics.geocodeDuration.startTimer(); try { await geocode(address); end({ success: 'true' }); } catch (error) { end({ success: 'false' }); } ``` ## Grafana Dashboards ### Dashboard Setup Dashboards auto-provisioned from `configs/grafana/dashboards/`: ```json { "dashboard": { "title": "Changemaker Lite Overview", "panels": [ { "title": "API Request Rate", "targets": [ { "expr": "rate(http_request_duration_ms_count[5m])" } ] } ] } } ``` ### Accessing Dashboards - **Direct:** http://localhost:3001 (admin/admin) - **Embedded:** `/app/observability` → Dashboards tab - **Subdomain:** http://grafana.cmlite.org (production) ## Alertmanager ### Alert Routing Configure in `configs/alertmanager/alertmanager.yml`: ```yaml route: receiver: 'default' group_by: ['alertname', 'severity'] routes: - match: severity: critical receiver: 'critical-alerts' receivers: - name: 'default' webhook_configs: - url: 'http://gotify:8889/message' - name: 'critical-alerts' email_configs: - to: 'admin@example.com' ``` ### Notification Channels Supported receivers: - **Webhook** - Gotify, Slack, Discord - **Email** - SMTP notifications - **PagerDuty** - Incident management - **Opsgenie** - Alert management ## Service Health Monitoring ### External Service Checks Monitor services via health gauges: ```typescript // Check Redis try { await redisClient.ping(); metrics.externalServiceHealth.set({ service: 'redis' }, 1); } catch (error) { metrics.externalServiceHealth.set({ service: 'redis' }, 0); } // Check PostgreSQL try { await prisma.$queryRaw`SELECT 1`; metrics.externalServiceHealth.set({ service: 'postgres' }, 1); } catch (error) { metrics.externalServiceHealth.set({ service: 'postgres' }, 0); } ``` ### Docker Healthchecks Services with healthchecks: - **API** - `wget --spider http://localhost:4000/health` - **Media API** - `wget --spider http://localhost:4100/health` - **PostgreSQL** - `pg_isready` - **Redis** - `redis-cli ping` - **Listmonk** - `wget --spider http://localhost:9000/health` ## Performance Monitoring ### HTTP Request Tracking Automatic tracking of: - Request count by route - Request duration percentiles - Status code distribution - Error rates ### Queue Monitoring Track queue depths: - Email queue size - Geocode queue size - Failed job count - Processing rate ### Resource Monitoring Via cAdvisor and Node Exporter: - CPU usage - Memory usage - Disk I/O - Network traffic ## Admin Interface ### Metrics Tab Display cards: - API uptime - Request rate (req/sec) - Error rate (%) - Queue sizes - Active sessions - Service health ### Dashboards Tab Embedded Grafana: - Overview dashboard - Canvassing metrics - External services - Custom queries ### Alerts Tab Active alerts list: - Alert name - Severity - Status (firing/pending/resolved) - Duration - Quick actions (silence, resolve) ## Starting Monitoring Stack ```bash # Start with monitoring profile docker compose --profile monitoring up -d # Access services # Prometheus: http://localhost:9090 # Grafana: http://localhost:3001 (admin/admin) # Alertmanager: http://localhost:9093 ``` ## API Endpoints ### Observability Endpoints ``` GET /api/observability/prometheus # Prometheus status GET /api/observability/grafana # Grafana status GET /api/observability/alertmanager # Alertmanager status GET /api/observability/metrics # Current metrics values ``` ### Metrics Endpoint ``` GET /metrics # Prometheus scrape endpoint ``` ## Related Documentation - [Observability Page](../../frontend/pages/admin/observability-page.md) - [Metrics Utilities](../../backend/utilities/index.md) - [Docker Compose](../../deployment/docker-compose.md) - [Monitoring Stack](../../deployment/monitoring-stack.md) - [Healthchecks](../../deployment/healthchecks.md) - [Performance Optimization](../../troubleshooting/performance-optimization.md)