admin/changemaker.lite

Fork 0

bunker-admin 7895ce683e Tonne of debugging - getting ready for the production builds

2026-02-16 10:44:18 -07:00

14 KiB

Raw Blame History

Monitoring Stack (Prometheus + Grafana)

Overview

Changemaker Lite V2 includes a complete observability stack for production monitoring:

Prometheus: Metrics collection + alerting rules
Grafana: Visualization + pre-configured dashboards
Alertmanager: Alert routing + notifications
cAdvisor: Docker container metrics
Node Exporter: Host system metrics
Redis Exporter: Redis-specific metrics
Gotify: Push notifications (optional)

All monitoring services behind Docker Compose profile flag (opt-in).

Architecture

graph LR
    subgraph "Application Metrics"
        API[API<br/>:4000/api/metrics]
        MEDIA[Media API<br/>:4100/metrics]
    end

    subgraph "Infrastructure Metrics"
        CADVISOR[cAdvisor<br/>Container Stats]
        NODE[Node Exporter<br/>Host Stats]
        REDIS_EXP[Redis Exporter<br/>Redis Stats]
    end

    subgraph "Monitoring Stack"
        PROM[Prometheus<br/>:9090]
        GRAFANA[Grafana<br/>:3001]
        ALERT[Alertmanager<br/>:9093]
        GOTIFY[Gotify<br/>:8889]
    end

    API --> PROM
    MEDIA --> PROM
    CADVISOR --> PROM
    NODE --> PROM
    REDIS_EXP --> PROM

    PROM --> GRAFANA
    PROM --> ALERT
    ALERT --> GOTIFY

Quick Start

Enable Monitoring

# Start with monitoring profile
docker compose --profile monitoring up -d

# Check services
docker compose ps | grep monitoring

# Access dashboards
open http://localhost:3001  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
open http://localhost:9093  # Alertmanager

Prometheus Configuration

Scrape Targets

File: configs/prometheus/prometheus.yml

scrape_configs:
  # V2 Unified API Metrics (10s interval)
  - job_name: 'changemaker-v2-api'
    static_configs:
      - targets: ['changemaker-v2-api:4000']
    metrics_path: '/api/metrics'
    scrape_interval: 10s
    scrape_timeout: 5s

  # Redis Metrics (15s interval)
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 15s

  # cAdvisor - Docker container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 15s

  # Node Exporter - System metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 15s

  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Alertmanager monitoring
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
    scrape_interval: 30s

Intervals:

10s: API (real-time application metrics)
15s: Infrastructure (host + containers + Redis)
30s: Monitoring stack itself

Custom Metrics (cm_*)

File: api/src/utils/metrics.ts

12 custom metrics for domain-specific monitoring:

Metric	Type	Labels	Description
`cm_emails_sent_total`	Counter	`campaign_id`	Campaign emails sent successfully
`cm_emails_failed_total`	Counter	`campaign_id`, `error_type`	Failed email sends
`cm_email_queue_size`	Gauge	-	Current email queue size
`cm_email_send_duration_seconds`	Histogram	-	Email send latency
`cm_login_attempts_total`	Counter	`status`	Login attempts (success/failure)
`cm_active_sessions`	Gauge	-	Active refresh tokens
`cm_campaign_emails_total`	Counter	`campaign_id`	Total campaign emails created
`cm_response_submissions_total`	Counter	-	Response wall submissions
`cm_canvass_visits_total`	Counter	`outcome`	Canvass visits by outcome
`cm_active_canvass_sessions`	Gauge	-	Active canvass sessions
`cm_shift_signups_total`	Counter	-	Shift signups
`cm_external_service_up`	Gauge	`service`	External service health (1=up, 0=down)

HTTP metrics (standard prom-client):

http_requests_total
http_request_duration_seconds

Geocoding metrics:

cm_geocode_cache_hits_total
cm_geocode_cache_misses_total
cm_geocode_requests_total
cm_geocode_duration_seconds

Email template metrics:

cm_email_templates_updated_total
cm_email_test_sent_total
cm_email_template_rollback_total
cm_email_template_cache_hit/miss_total

Location query metrics:

cm_map_location_query_duration_seconds
cm_map_location_query_count_total
cm_map_location_result_count

Alert Rules

File: configs/prometheus/alerts.yml

12 alert rules across 4 groups:

Application Alerts

ApplicationDown: API unreachable for 2 minutes
HighErrorRate: >10% 5xx errors for 5 minutes
EmailQueueBacklog: Queue size >100 for 10 minutes
HighEmailFailureRate: >20% email failures for 10 minutes
SuspiciousLoginActivity: >5 failed logins/sec for 2 minutes
HighAPILatency: P95 latency >2s for 5 minutes
ExternalServiceDown: External service unreachable for 5 minutes

System Alerts

RedisDown: Redis unreachable for 1 minute
DiskSpaceLow: <15% disk space for 5 minutes
DiskSpaceCritical: <10% disk space for 2 minutes
HighCPUUsage: >85% CPU for 10 minutes
HighMemoryUsage: >85% memory for 10 minutes

Example Alert:

- alert: ApplicationDown
  expr: up{job="changemaker-v2-api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "V2 API is down"
    description: "The Changemaker V2 API has been down for more than 2 minutes."

Data Retention

docker-compose.yml:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # 30 days

Disk usage: ~1-5GB for 30 days (depends on scrape frequency + cardinality).

Increase retention:

# Edit docker-compose.yml
# Change to '--storage.tsdb.retention.time=90d'

# Recreate container
docker compose --profile monitoring up -d --force-recreate prometheus

Grafana Configuration

Datasource

File: configs/grafana/datasources.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Auto-provisioned on Grafana startup.

Dashboards

File: configs/grafana/dashboards.yml

apiVersion: 1

providers:
  - name: 'Default'
    folder: 'Changemaker Lite'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards

3 pre-configured dashboards:

1. Application Overview

File: configs/grafana/application-overview.json

Panels:

API uptime (last 24h)
Request rate (req/sec)
Error rate (%)
Email queue size
Active sessions
Campaign emails sent

Refresh: 10s

2. API Performance

File: configs/grafana/api-performance.json

Panels:

Request latency (P50, P95, P99)
Requests by status code
Top 10 slowest endpoints
HTTP errors by route
Geocoding cache hit rate
Email send duration

Refresh: 30s

3. System Health

File: configs/grafana/system-health.json

Panels:

CPU usage (%)
Memory usage (%)
Disk space (GB free)
Network I/O (MB/s)
Container CPU throttling
Redis memory usage

Refresh: 1m

# Access Grafana
open http://localhost:3001

# Default credentials
Username: admin
Password: admin

# Change password on first login

Navigate: Dashboards → Changemaker Lite folder → Select dashboard

Alertmanager Configuration

Notification Receivers

File: configs/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
    # Email (example)
    email_configs:
      - to: 'admin@cmlite.org'
        from: 'alerts@cmlite.org'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alerts@cmlite.org'
        auth_password: 'your-password'

    # Slack (example)
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

    # Gotify (push notifications)
    webhook_configs:
      - url: 'http://gotify:80/message?token=YOUR_GOTIFY_TOKEN'

Grouping: Combines similar alerts (prevents spam).

Repeat: Re-sends unresolved alerts every 4 hours.

Testing Alerts

Manual test:

# Trigger test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -d '[{
    "labels": {"alertname":"TestAlert","severity":"warning"},
    "annotations": {"summary":"Test alert from curl"}
  }]'

# Check Alertmanager UI
open http://localhost:9093

Force alert (stop API):

# Stop API (triggers ApplicationDown alert after 2m)
docker compose stop api

# Check Prometheus alerts
open http://localhost:9090/alerts

# Wait 2 minutes → Alert fires → Notification sent

Exporters

cAdvisor (Container Metrics)

Metrics:

CPU usage per container
Memory usage per container
Network I/O
Disk I/O

Access: http://localhost:8080

Configuration (docker-compose.yml):

cadvisor:
  image: gcr.io/cadvisor/cadvisor:latest
  container_name: cadvisor-changemaker
  privileged: true  # Required for full access
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:ro
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro
    - /dev/disk/:/dev/disk:ro
  devices:
    - /dev/kmsg

Node Exporter (Host Metrics)

Metrics:

CPU usage (all cores)
Memory usage (total, free, cached)
Disk usage (filesystem, mountpoints)
Network I/O (bytes, packets)

Access: http://localhost:9100/metrics

Configuration:

node-exporter:
  command:
    - '--path.rootfs=/host'
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro

Redis Exporter

Metrics:

Memory usage
Commands per second
Connected clients
Keyspace hits/misses
Evicted keys

Access: http://localhost:9121/metrics

Configuration:

redis-exporter:
  environment:
    - REDIS_ADDR=redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}  # Authenticates with Redis

Gotify (Push Notifications)

Setup:

# Access Gotify UI
open http://localhost:8889

# Login (default: admin/admin)

# Create app → Copy token

# Add to Alertmanager config:
webhook_configs:
  - url: 'http://gotify:80/message?token=YOUR_TOKEN'

Mobile apps: Available for iOS/Android (receive push notifications).

Accessing Services

Service	URL	Default Credentials
Prometheus	http://localhost:9090	None
Grafana	http://localhost:3001	admin / admin
Alertmanager	http://localhost:9093	None
cAdvisor	http://localhost:8080	None
Node Exporter	http://localhost:9100/metrics	None
Redis Exporter	http://localhost:9121/metrics	None
Gotify	http://localhost:8889	admin / admin

Troubleshooting

Prometheus Not Scraping

Symptoms: Missing data in Grafana dashboards

Diagnosis:

# Check Prometheus targets
open http://localhost:9090/targets

# Look for errors (red) vs success (green)

# Check API metrics endpoint
curl http://localhost:4000/api/metrics

Common causes:

API container not running
Wrong port in prometheus.yml
Network connectivity issue

Solution:

# Restart API
docker compose restart api

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart Prometheus
docker compose restart prometheus

Grafana Dashboards Not Loading

Symptoms: Blank dashboards or "No data" errors

Diagnosis:

# Check Grafana logs
docker compose logs grafana | tail -50

# Check datasource
open http://localhost:3001/datasources

# Test Prometheus query
curl http://prometheus:9090/api/v1/query?query=up

Solution:

# Verify datasource URL
# Should be http://prometheus:9090 (container name, not localhost)

# Restart Grafana
docker compose restart grafana

Alerts Not Firing

Symptoms: No notifications despite issues

Diagnosis:

# Check Prometheus alerts
open http://localhost:9090/alerts

# Check Alertmanager
open http://localhost:9093

# Verify alert rules loaded
curl http://localhost:9090/api/v1/rules

Solution:

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check alerts.yml syntax
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

# Test notification receiver
curl -X POST http://localhost:9093/api/v1/alerts -d '[...]'

Production Best Practices

Secure Grafana

Change admin password:

# Via UI: Admin → Profile → Change Password

# Via env var (docker-compose.yml):
environment:
  - GF_SECURITY_ADMIN_PASSWORD=<strong-password>

Disable signup:

environment:
  - GF_USERS_ALLOW_SIGN_UP=false  # Already set

Alert Tuning

Avoid false positives: Increase for duration in critical alerts.

Example (before):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 1m  # Too aggressive

Example (after):

- alert: DiskSpaceLow
  expr: disk_free_percent < 15
  for: 10m  # More reasonable

External Storage (Long-Term)

Prometheus supports remote write to:

Thanos: Long-term storage (S3/GCS)
Cortex: Multi-tenant Prometheus
VictoriaMetrics: High-performance storage

Example (Thanos):

# prometheus.yml
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"

Docker Compose — Monitoring services configuration
Environment Variables — Monitoring env vars
API Reference — Custom metrics implementation

14 KiB Raw Blame History

Monitoring Stack (Prometheus + Grafana)

Overview

Architecture

Quick Start

Enable Monitoring

Prometheus Configuration

Scrape Targets

Custom Metrics (cm_*)

Alert Rules

Application Alerts

System Alerts

Data Retention

Grafana Configuration

Datasource

Dashboards

1. Application Overview

2. API Performance

3. System Health

First Login

Alertmanager Configuration

Notification Receivers

Testing Alerts

Exporters

cAdvisor (Container Metrics)

Node Exporter (Host Metrics)

Redis Exporter

Gotify (Push Notifications)

Accessing Services

Troubleshooting

Prometheus Not Scraping

Grafana Dashboards Not Loading

Alerts Not Firing

Production Best Practices

Secure Grafana

Alert Tuning

External Storage (Long-Term)

Related Documentation

14 KiB

Raw Blame History