9.6 KiB

Observability & Monitoring

The Observability feature provides comprehensive monitoring, metrics collection, and alerting for the Changemaker Lite platform. Built on the Prometheus ecosystem with Grafana dashboards and Alertmanager integration.

Overview

The Observability stack consists of:

  1. Prometheus - Metrics collection and storage
  2. Grafana - Visualization dashboards
  3. Alertmanager - Alert routing and notifications
  4. Custom Metrics - 12 domain-specific cm_* metrics
  5. HTTP Metrics - Request tracking and performance
  6. Service Health - External service monitoring

Features

Metrics Collection

Custom Domain Metrics (12 total):

Counters:

  • cm_api_uptime_seconds - API uptime counter
  • cm_canvass_visits_total - Total canvass visits
  • cm_campaign_emails_sent_total - Total campaign emails sent
  • cm_geocode_requests_total - Total geocode requests

Gauges:

  • cm_canvass_sessions_active - Active canvass sessions
  • cm_email_queue_size - Email queue depth
  • cm_geocode_queue_size - Geocode queue depth
  • cm_external_service_health - Service health (0/1)

Histograms:

  • cm_geocode_duration_seconds - Geocoding latency
  • http_request_duration_ms - HTTP request duration

HTTP Metrics:

  • Request count by method/route/status
  • Request duration percentiles (p50, p95, p99)
  • Active requests gauge
  • Error rate tracking

Grafana Dashboards

Three pre-configured dashboards:

  1. Changemaker Lite Overview - System-wide metrics

    • API uptime and request rates
    • Queue sizes and health
    • Active sessions
    • Error rates
  2. Canvassing Metrics - Canvass-specific metrics

    • Active sessions over time
    • Visits by outcome
    • Session duration
    • Volunteer leaderboard
  3. External Services - Integration health

    • Redis health
    • PostgreSQL health
    • Listmonk status
    • Geocoding providers

Alert Rules

12 predefined alert rules:

Critical Alerts:

  • API down (>5 min)
  • Database unreachable
  • Redis connection lost

Warning Alerts:

  • High error rate (>5%)
  • Queue backup (>1000 jobs)
  • Slow requests (p95 >2s)
  • Service degradation

Info Alerts:

  • New deployment
  • Service restart
  • Configuration change

Admin Interface

Observability page (/app/observability) with:

  • Metrics Tab - Live metrics display
  • Dashboards Tab - Embedded Grafana
  • Alerts Tab - Active alerts and rules

Architecture

Backend Components

Metrics Module:

  • api/src/utils/metrics.ts - Prometheus metrics definitions
  • api/src/modules/observability/observability.routes.ts - Admin API

Instrumentation:

  • Express middleware for HTTP metrics
  • Service-level metric updates
  • Queue size tracking
  • External service health checks

Configuration:

  • configs/prometheus/prometheus.yml - Scrape config
  • configs/prometheus/alerts.yml - Alert rules
  • configs/grafana/dashboards/ - Dashboard JSON

Frontend Components

Admin Page:

  • admin/src/pages/ObservabilityPage.tsx - Monitoring dashboard
  • Three tabs: Metrics, Dashboards, Alerts
  • Embedded Grafana iframes
  • Live metric cards

Observability Components:

  • admin/src/components/observability/MetricsChart.tsx - Chart component
  • admin/src/components/observability/ServiceHealthCard.tsx - Health display

Docker Services

Monitoring Profile:

Services run with --profile monitoring:

profiles: [monitoring]
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]

  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    ports: ["8080:8080"]

  node-exporter:
    image: prom/node-exporter:latest
    ports: ["9100:9100"]

  redis-exporter:
    image: oliver006/redis_exporter:latest
    ports: ["9121:9121"]

Configuration

Environment Variables

# Enable metrics
METRICS_ENABLED=true

# Prometheus
PROMETHEUS_PORT=9090

# Grafana
GRAFANA_PORT=3001
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin

# Alertmanager
ALERTMANAGER_PORT=9093

Prometheus Scrape Targets

scrape_configs:
  - job_name: 'changemaker-api'
    static_configs:
      - targets: ['api:4000']

  - job_name: 'media-api'
    static_configs:
      - targets: ['media-api:4100']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

Alert Rules

Example alert rule:

groups:
  - name: api_alerts
    rules:
      - alert: APIDown
        expr: up{job="changemaker-api"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API is down"
          description: "API has been down for 5 minutes"

      - alert: HighErrorRate
        expr: rate(http_request_duration_ms_count{status=~"5.."}[5m]) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"

Metrics Usage

Increment Counter

import { metrics } from '../utils/metrics';

// Campaign email sent
metrics.campaignEmailsSent.inc();

// Geocode request
metrics.geocodeRequests.inc({ provider: 'nominatim' });

Set Gauge

// Update queue size
metrics.emailQueueSize.set(queueSize);

// Update active sessions
metrics.canvassSessionsActive.set(activeSessions);

// Set service health (1 = healthy, 0 = unhealthy)
metrics.externalServiceHealth.set({ service: 'redis' }, 1);

Observe Histogram

// Time geocoding request
const end = metrics.geocodeDuration.startTimer();
try {
  await geocode(address);
  end({ success: 'true' });
} catch (error) {
  end({ success: 'false' });
}

Grafana Dashboards

Dashboard Setup

Dashboards auto-provisioned from configs/grafana/dashboards/:

{
  "dashboard": {
    "title": "Changemaker Lite Overview",
    "panels": [
      {
        "title": "API Request Rate",
        "targets": [
          {
            "expr": "rate(http_request_duration_ms_count[5m])"
          }
        ]
      }
    ]
  }
}

Accessing Dashboards

Alertmanager

Alert Routing

Configure in configs/alertmanager/alertmanager.yml:

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://gotify:8889/message'

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@example.com'

Notification Channels

Supported receivers:

  • Webhook - Gotify, Slack, Discord
  • Email - SMTP notifications
  • PagerDuty - Incident management
  • Opsgenie - Alert management

Service Health Monitoring

External Service Checks

Monitor services via health gauges:

// Check Redis
try {
  await redisClient.ping();
  metrics.externalServiceHealth.set({ service: 'redis' }, 1);
} catch (error) {
  metrics.externalServiceHealth.set({ service: 'redis' }, 0);
}

// Check PostgreSQL
try {
  await prisma.$queryRaw`SELECT 1`;
  metrics.externalServiceHealth.set({ service: 'postgres' }, 1);
} catch (error) {
  metrics.externalServiceHealth.set({ service: 'postgres' }, 0);
}

Docker Healthchecks

Services with healthchecks:

  • API - wget --spider http://localhost:4000/health
  • Media API - wget --spider http://localhost:4100/health
  • PostgreSQL - pg_isready
  • Redis - redis-cli ping
  • Listmonk - wget --spider http://localhost:9000/health

Performance Monitoring

HTTP Request Tracking

Automatic tracking of:

  • Request count by route
  • Request duration percentiles
  • Status code distribution
  • Error rates

Queue Monitoring

Track queue depths:

  • Email queue size
  • Geocode queue size
  • Failed job count
  • Processing rate

Resource Monitoring

Via cAdvisor and Node Exporter:

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network traffic

Admin Interface

Metrics Tab

Display cards:

  • API uptime
  • Request rate (req/sec)
  • Error rate (%)
  • Queue sizes
  • Active sessions
  • Service health

Dashboards Tab

Embedded Grafana:

  • Overview dashboard
  • Canvassing metrics
  • External services
  • Custom queries

Alerts Tab

Active alerts list:

  • Alert name
  • Severity
  • Status (firing/pending/resolved)
  • Duration
  • Quick actions (silence, resolve)

Starting Monitoring Stack

# Start with monitoring profile
docker compose --profile monitoring up -d

# Access services
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
# Alertmanager: http://localhost:9093

API Endpoints

Observability Endpoints

GET    /api/observability/prometheus   # Prometheus status
GET    /api/observability/grafana      # Grafana status
GET    /api/observability/alertmanager # Alertmanager status
GET    /api/observability/metrics      # Current metrics values

Metrics Endpoint

GET    /metrics                         # Prometheus scrape endpoint