admin/changemaker.lite

Fork 0

bunker-admin 7895ce683e Tonne of debugging - getting ready for the production builds

2026-02-16 10:44:18 -07:00

24 KiB

Raw Blame History

Monitoring and Observability Issues

This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.

Overview

Monitoring Stack

Changemaker Lite V2 uses profile-based monitoring (optional):

# Start with monitoring
docker compose --profile monitoring up -d

Components:

Prometheus - Metrics collection and storage (port 9090)
Grafana - Metrics visualization (port 3001)
Alertmanager - Alert routing and notification (port 9093)
cAdvisor - Container metrics (port 8080)
Node Exporter - Host metrics (port 9100)
Redis Exporter - Redis metrics (port 9121)

Custom Metrics

12 custom cm_* Prometheus metrics:

cm_api_uptime_seconds - API uptime
cm_database_uptime_seconds - Database uptime
cm_email_queue_size - Email queue depth
cm_geocoding_queue_size - Geocoding queue depth
cm_users_total - Total users
cm_campaigns_total - Total campaigns
cm_locations_total - Total locations
cm_geocoded_locations_total - Geocoded locations
cm_active_canvass_sessions - Active sessions
cm_external_service_up - Service health (0/1)
cm_listmonk_subscribers_total - Listmonk subscribers
cm_media_videos_total - Total videos

Plus standard HTTP metrics:

http_request_duration_seconds
http_requests_total

Prometheus Not Scraping

Target Down

Severity: 🔴 Critical

Symptoms

Prometheus UI (localhost:9090) shows targets as "DOWN":

Target: api (localhost:4000/metrics)
State: DOWN
Error: Get "http://api:4000/metrics": connection refused

No data in Grafana dashboards.

Common Causes

Service not running - API container stopped
Metrics endpoint missing - /metrics endpoint not registered
Network issue - Prometheus can't reach service
Authentication required - Metrics endpoint requires auth

Solutions

Solution 1: Check service is running

# Is API running?
docker compose ps api

# Should show "Up"
# If not:
docker compose up -d api

Solution 2: Test metrics endpoint

# From host
curl http://localhost:4000/metrics

# Should return Prometheus metrics:
# # HELP cm_api_uptime_seconds API uptime in seconds
# # TYPE cm_api_uptime_seconds gauge
# cm_api_uptime_seconds 123.45

# From Prometheus container
docker compose exec prometheus wget -O- http://api:4000/metrics

Solution 3: Check Prometheus config

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']  # Use service name, not localhost

Solution 4: Verify network

# Both on same network?
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
docker inspect changemaker-lite-api-1 | grep NetworkMode

# Should both show "changemaker-lite"

Solution 5: Check metrics are registered

In API logs:

docker compose logs api | grep -i "metrics\|prometheus"

# Should show:
# Metrics endpoint registered at /metrics
# Prometheus metrics initialized

Prevention

Health checks - Monitor Prometheus target health
Service dependencies - Ensure services start in order
Network config - Use Docker service names
Testing - Test /metrics endpoint on deploy

Scrape Timeout

Severity: 🟡 Medium

Symptoms

Target: api
State: UP
Last Scrape: 5.2s (slow)
Last Error: context deadline exceeded

Scrapes taking too long or timing out.

Solutions

Solution 1: Increase scrape timeout

In configs/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  scrape_timeout: 10s  # Increase from 10s to 30s

scrape_configs:
  - job_name: 'api'
    scrape_interval: 30s  # Scrape less frequently
    scrape_timeout: 20s
    static_configs:
      - targets: ['api:4000']

Reload config:

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart
docker compose restart prometheus

Solution 2: Optimize metrics generation

// In api/src/utils/metrics.ts
// Cache expensive metrics
let cachedUserCount = 0;
let lastUserCountUpdate = 0;

register.registerMetric(new Gauge({
  name: 'cm_users_total',
  help: 'Total number of users',
  async collect() {
    const now = Date.now();
    // Only query database every 60 seconds
    if (now - lastUserCountUpdate > 60000) {
      cachedUserCount = await prisma.user.count();
      lastUserCountUpdate = now;
    }
    this.set(cachedUserCount);
  }
}));

Solution 3: Reduce metric cardinality

// Bad - high cardinality (creates metric per user)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']  // Don't do this!
});

// Good - low cardinality
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']  // Only 5 roles
});

Prevention

Cache expensive metrics - Don't query DB on every scrape
Reasonable timeouts - 10-30s timeouts
Low cardinality - Avoid high-cardinality labels
Optimize queries - Fast metric queries

Authentication Errors

Severity: 🟡 Medium

Symptoms

Error: 401 Unauthorized when scraping /metrics

Solutions

Changemaker Lite V2 metrics endpoint is public (no auth required).

If you see auth errors:

Solution 1: Remove auth middleware from /metrics

In api/src/server.ts:

// Metrics endpoint should be BEFORE authenticate middleware
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Auth middleware comes after
app.use(authenticate);

Solution 2: Configure basic auth in Prometheus

If you DO want to protect /metrics:

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    basic_auth:
      username: 'prometheus'
      password: 'your-password'

Prevention

Public metrics - Keep /metrics public for simplicity
Network isolation - Use Docker networks for security
IP whitelist - Only allow Prometheus IP

Grafana Issues

Dashboards Not Loading

Severity: 🟠 High

Symptoms

Grafana shows blank dashboards or "No data" panels.

Solutions

Solution 1: Check Grafana is running

docker compose --profile monitoring ps grafana

# Should show "Up"
# If not:
docker compose --profile monitoring up -d grafana

Solution 2: Verify Prometheus datasource

Open Grafana: http://localhost:3001
Login (admin/admin)
Settings → Data Sources
Click Prometheus
URL should be: http://prometheus:9090
Click "Save & Test"
Should show "Data source is working"

Solution 3: Check dashboard provisioning

# List provisioned dashboards
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/

# Should show:
# dashboard-provider.yml
# changemaker-api.json
# changemaker-queue.json
# changemaker-external-services.json

Solution 4: Import dashboard manually

If auto-provisioning fails:

Grafana → Dashboards → Import
Upload JSON from configs/grafana/dashboards/
Select Prometheus datasource
Click Import

Solution 5: Check for data

# Test query in Grafana Explore
# Query: cm_api_uptime_seconds

# Or test in Prometheus:
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'

Prevention

Dashboard versioning - Keep dashboards in git
Auto-provisioning - Use provisioning instead of manual import
Testing - Test dashboards after changes
Documentation - Document dashboard variables

Datasource Errors

Severity: 🟠 High

Symptoms

Error: Failed to query Prometheus
Error: connection refused

Red error bars on Grafana panels.

Solutions

Solution 1: Test Prometheus connection

# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up

# Should return JSON:
# {"status":"success","data":{"resultType":"vector","result":[...]}}

Solution 2: Check Prometheus is running

docker compose --profile monitoring ps prometheus

# Should show "Up"

Solution 3: Verify datasource URL

In Grafana datasource settings:

URL: http://prometheus:9090 (NOT http://localhost:9090)
Access: Server (NOT Browser)

Solution 4: Check Docker network

# Same network?
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode

Prevention

Health checks - Monitor datasource health
Service dependencies - Start Prometheus before Grafana
Error handling - Graceful error messages

Query Errors

Severity: 🟡 Medium

Symptoms

Error executing query: parse error at char X: unexpected identifier

Panel shows "Error loading data".

Solutions

Solution 1: Validate PromQL syntax

Common errors:

# Bad - missing {}
cm_api_uptime_seconds{job=api}

# Good
cm_api_uptime_seconds{job="api"}

# Bad - wrong function
average(cm_api_uptime_seconds)

# Good
avg(cm_api_uptime_seconds)

Solution 2: Test query in Explore

Grafana → Explore
Enter query
Run
Fix errors before adding to dashboard

Solution 3: Check metric exists

# List all metrics
curl http://localhost:9090/api/v1/label/__name__/values | jq

# Search for metric
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_

Solution 4: Use metric browser

In Grafana query editor:

Click "Metrics" button
Browse available metrics
Select metric (auto-fills query)

Prevention

Query validation - Validate before saving
Testing - Test queries in Explore
Documentation - Document available metrics
Examples - Provide query examples

Alertmanager Issues

Alerts Not Firing

Severity: 🟠 High

Symptoms

Conditions met but alert not triggering.

Solutions

Solution 1: Check alert rules

In Prometheus UI (localhost:9090):

Click "Alerts"
Find your alert
Check state:
- Inactive: Condition not met
- Pending: Met but waiting for for: duration
- Firing: Alert active

Solution 2: Verify alert rule syntax

In configs/prometheus/alerts.yml:

groups:
  - name: changemaker_alerts
    interval: 30s
    rules:
      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m  # Must be down for 1 minute before firing
        labels:
          severity: critical
        annotations:
          summary: "API is down"
          description: "API has been down for 1 minute"

Solution 3: Check Alertmanager config

# Test Alertmanager
curl http://localhost:9093/api/v1/alerts

# Should return alert list

Solution 4: View Prometheus logs

docker compose logs prometheus | grep -i alert

# Shows:
# Loaded alert rules
# Alert X is firing

Solution 5: Reload alert rules

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check rules loaded
curl http://localhost:9090/api/v1/rules

Prevention

Test alert conditions - Trigger manually to test
Reasonable thresholds - Not too sensitive or too lenient
Documentation - Document alert thresholds
Regular review - Review alert effectiveness

Notifications Not Sent

Severity: 🟡 Medium

Symptoms

Alert firing in Prometheus but no notification received.

Solutions

Solution 1: Check Alertmanager config

In configs/alertmanager/alertmanager.yml:

route:
  receiver: 'email'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

receivers:
  - name: 'email'
    email_configs:
      - to: 'alerts@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your-email@gmail.com'
        auth_password: 'your-app-password'

Solution 2: Test Alertmanager notification

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "Test",
      "severity": "critical"
    },
    "annotations": {
      "summary": "Test alert"
    }
  }]'

# Check if notification sent
docker compose logs alertmanager | grep -i "notification\|email"

Solution 3: Check SMTP config

See Email Issues for SMTP troubleshooting.

Solution 4: Use alternative notification channels

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: 'webhook'
    webhook_configs:
      - url: 'http://your-webhook-url.com/alerts'

Prevention

Test notifications - Regular notification tests
Multiple channels - Email + Slack + webhook
Fallback receivers - Backup notification method
Documentation - Document notification setup

Routing Errors

Severity: 🟡 Medium

Symptoms

Alerts going to wrong receiver or being silenced incorrectly.

Solutions

Solution 1: Check routing rules

In configs/alertmanager/alertmanager.yml:

route:
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'
    - match:
        severity: warning
      receiver: 'email'

Solution 2: Test routing

# Use amtool to test routing
docker compose exec alertmanager amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  alertname=TestAlert severity=critical

# Shows which receiver will be used

Solution 3: View active silences

In Alertmanager UI (localhost:9093):

Click "Silences"
Check if alert is silenced
Expire or delete silence if wrong

Solution 4: Check inhibition rules

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']
# Critical alerts inhibit warnings for same instance

Prevention

Clear routing logic - Simple, understandable rules
Test routing - Test before deploying
Documentation - Document routing rules
Regular review - Review silences and inhibitions

Metrics Issues

Missing Metrics

Severity: 🟡 Medium

Symptoms

Expected metric not appearing in Prometheus or Grafana.

Solutions

Solution 1: Check metric is registered

In API code (api/src/utils/metrics.ts):

import { Counter } from 'prom-client';

const requestCounter = new Counter({
  name: 'cm_my_metric_total',
  help: 'Description of metric'
});

register.registerMetric(requestCounter);  // Must register!

Solution 2: Check metric is collected

# Test /metrics endpoint
curl http://localhost:4000/metrics | grep cm_my_metric

# Should show:
# # HELP cm_my_metric_total Description of metric
# # TYPE cm_my_metric_total counter
# cm_my_metric_total 42

Solution 3: Check scrape config

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:  # Don't accidentally drop metric
      - source_labels: [__name__]
        regex: 'cm_.*'  # Keep cm_* metrics
        action: keep

Solution 4: Verify metric type

// Counter - only increases (counts)
const counter = new Counter({ name: 'cm_requests_total' });
counter.inc();  // Increment

// Gauge - can go up or down (current value)
const gauge = new Gauge({ name: 'cm_queue_size' });
gauge.set(42);  // Set value

// Histogram - distribution of values
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
histogram.observe(0.5);  // Record duration

Prevention

Register all metrics - Don't forget register.registerMetric()
Test endpoint - Check /metrics shows metric
Naming convention - Use cm_* prefix for custom metrics
Documentation - Document all custom metrics

Incorrect Values

Severity: 🟡 Medium

Symptoms

Metric showing wrong or unexpected values.

Solutions

Solution 1: Check metric logic

// Wrong - gauge not updated
const gauge = new Gauge({ name: 'cm_users_total' });
// Never set, always 0

// Right - gauge updated
const gauge = new Gauge({
  name: 'cm_users_total',
  async collect() {
    const count = await prisma.user.count();
    this.set(count);
  }
});

Solution 2: Check metric type

// Wrong - using Counter for value that can decrease
const queueSize = new Counter({ name: 'cm_queue_size' });
queueSize.inc(50);  // Add 50
queueSize.inc(-20);  // Try to subtract 20 - ERROR!

// Right - use Gauge
const queueSize = new Gauge({ name: 'cm_queue_size' });
queueSize.set(50);  // Set to 50
queueSize.set(30);  // Set to 30 (can decrease)

Solution 3: Check label values

// Labels must match exactly
const counter = new Counter({
  name: 'requests_total',
  labelNames: ['method', 'status']
});

counter.inc({ method: 'GET', status: '200' });
// Creates: requests_total{method="GET",status="200"} 1

counter.inc({ method: 'GET', status: 200 });  // Wrong - number not string
// Creates separate metric: requests_total{method="GET",status=200} 1

Solution 4: Check query aggregation

# Wrong - sums across all labels
sum(cm_requests_total)

# Right - sum by specific label
sum by (status) (cm_requests_total)

Prevention

Correct metric type - Counter vs Gauge vs Histogram
Type consistency - Label values always same type
Testing - Test metric values with sample data
Validation - Validate metric values are reasonable

Stale Metrics

Severity: 🟢 Low

Symptoms

Metric values not updating, showing old data.

Solutions

Solution 1: Check collection frequency

// Metrics only updated when scraped
const gauge = new Gauge({
  name: 'cm_queue_size',
  async collect() {
    // This runs on every Prometheus scrape (every 15s)
    const size = await getQueueSize();
    this.set(size);
  }
});

Solution 2: Force metric update

// Update metric on event, not just scrape
eventEmitter.on('queueSizeChanged', (size) => {
  queueSizeGauge.set(size);
});

Solution 3: Check scrape interval

In configs/prometheus/prometheus.yml:

global:
  scrape_interval: 15s  # Scrape every 15 seconds

# Increase for more frequent updates
global:
  scrape_interval: 5s  # Scrape every 5 seconds

Prevention

Appropriate intervals - Balance freshness vs overhead
Event-driven updates - Update on change, not just scrape
Cache expensive metrics - Don't query DB every scrape
Staleness markers - Set metrics to NaN when stale

Performance Issues

High Memory Usage

Severity: 🟠 High

Symptoms

Prometheus container using excessive memory (multiple GB).

Solutions

Solution 1: Reduce retention period

In docker-compose.yml:

prometheus:
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.retention.time=7d'  # Reduce from 15d to 7d
    - '--storage.tsdb.retention.size=10GB'  # Add size limit

Restart:

docker compose --profile monitoring restart prometheus

Solution 2: Reduce metric cardinality

// Bad - creates metric per user (thousands)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']
});

// Good - creates metric per role (5)
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']
});

Solution 3: Drop unnecessary metrics

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:
      # Drop metrics we don't use
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'  # Drop Go/process metrics
        action: drop

Solution 4: Increase memory limit

prometheus:
  deploy:
    resources:
      limits:
        memory: 4G  # Increase from 2G

Prevention

Low cardinality - Avoid high-cardinality labels
Appropriate retention - 7-30 days is usually enough
Regular cleanup - Drop unused metrics
Monitor memory - Alert on high usage

Slow Queries

Severity: 🟡 Medium

Symptoms

Grafana dashboards slow to load. Queries taking 10+ seconds.

Solutions

Solution 1: Optimize query

# Slow - calculates rate for all time
rate(cm_requests_total[1y])

# Fast - only last 5 minutes
rate(cm_requests_total[5m])

# Slow - many time series
sum(rate(cm_requests_total[5m]))

# Faster - aggregate before rate
sum(increase(cm_requests_total[5m])) / 300

Solution 2: Use recording rules

In configs/prometheus/alerts.yml:

groups:
  - name: recording_rules
    interval: 30s
    rules:
      # Pre-calculate expensive query every 30s
      - record: job:cm_request_rate:sum
        expr: sum(rate(cm_requests_total[5m])) by (job)

# Then use in dashboard:
# job:cm_request_rate:sum  # Fast!

Solution 3: Reduce time range

In Grafana:

Change dashboard time range from "Last 30 days" to "Last 24 hours"
Queries are faster with less data

Solution 4: Increase Prometheus resources

prometheus:
  deploy:
    resources:
      limits:
        cpus: '2.0'  # More CPU for queries
        memory: 4G

Prevention

Efficient queries - Keep queries simple
Recording rules - Pre-calculate expensive queries
Appropriate time ranges - Don't query months of data
Indexing - Prometheus auto-indexes, but cardinality affects performance

Useful Commands

Prometheus Operations

# Check targets
curl http://localhost:9090/api/v1/targets

# Query metric
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'

# Query range
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'

# Reload config
docker compose exec prometheus kill -HUP 1

# Check config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Check rules
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

Grafana Operations

# Test datasource
curl http://admin:admin@localhost:3001/api/datasources/1/health

# List dashboards
curl http://admin:admin@localhost:3001/api/search?type=dash-db

# Export dashboard
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json

# Import dashboard
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @dashboard.json

Alertmanager Operations

# Check alerts
curl http://localhost:9093/api/v1/alerts

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'

# List silences
curl http://localhost:9093/api/v1/silences

# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
  -H 'Content-Type: application/json' \
  -d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'

Monitoring Documentation

Monitoring Issues - This guide
Observability Dashboard - Using dashboard
Monitoring Guide - Setup and configuration

Other Troubleshooting

Common Errors - General errors
Performance Optimization - Performance tuning

External Resources

Last Updated: February 2026 Version: V2.0 Status: Complete

24 KiB Raw Blame History

Monitoring and Observability Issues

Overview

Monitoring Stack

Custom Metrics

Prometheus Not Scraping

Target Down

Symptoms

Common Causes

Solutions

Prevention

Scrape Timeout

Symptoms

Solutions

Prevention

Authentication Errors

Symptoms

Solutions

Prevention

Grafana Issues

Dashboards Not Loading

Symptoms

Solutions

Prevention

Datasource Errors

Symptoms

Solutions

Prevention

Query Errors

Symptoms

Solutions

Prevention

Alertmanager Issues

Alerts Not Firing

Symptoms

Solutions

Prevention

Notifications Not Sent

Symptoms

Solutions

Prevention

Routing Errors

Symptoms

Solutions

Prevention

Metrics Issues

Missing Metrics

Symptoms

Solutions

Prevention

Incorrect Values

Symptoms

Solutions

Prevention

Stale Metrics

Symptoms

Solutions

Prevention

Performance Issues

High Memory Usage

Symptoms

Solutions

Prevention

Slow Queries

Symptoms

Solutions

Prevention

Useful Commands

Prometheus Operations

Grafana Operations

Alertmanager Operations

Related Documentation

Monitoring Documentation

Other Troubleshooting

External Resources

24 KiB

Raw Blame History