24 KiB

Monitoring and Observability Issues

This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.

Overview

Monitoring Stack

Changemaker Lite V2 uses profile-based monitoring (optional):

# Start with monitoring
docker compose --profile monitoring up -d

Components:

  • Prometheus - Metrics collection and storage (port 9090)
  • Grafana - Metrics visualization (port 3001)
  • Alertmanager - Alert routing and notification (port 9093)
  • cAdvisor - Container metrics (port 8080)
  • Node Exporter - Host metrics (port 9100)
  • Redis Exporter - Redis metrics (port 9121)

Custom Metrics

12 custom cm_* Prometheus metrics:

  1. cm_api_uptime_seconds - API uptime
  2. cm_database_uptime_seconds - Database uptime
  3. cm_email_queue_size - Email queue depth
  4. cm_geocoding_queue_size - Geocoding queue depth
  5. cm_users_total - Total users
  6. cm_campaigns_total - Total campaigns
  7. cm_locations_total - Total locations
  8. cm_geocoded_locations_total - Geocoded locations
  9. cm_active_canvass_sessions - Active sessions
  10. cm_external_service_up - Service health (0/1)
  11. cm_listmonk_subscribers_total - Listmonk subscribers
  12. cm_media_videos_total - Total videos

Plus standard HTTP metrics:

  • http_request_duration_seconds
  • http_requests_total

Prometheus Not Scraping

Target Down

Severity: 🔴 Critical

Symptoms

Prometheus UI (localhost:9090) shows targets as "DOWN":

Target: api (localhost:4000/metrics)
State: DOWN
Error: Get "http://api:4000/metrics": connection refused

No data in Grafana dashboards.

Common Causes

  1. Service not running - API container stopped
  2. Metrics endpoint missing - /metrics endpoint not registered
  3. Network issue - Prometheus can't reach service
  4. Authentication required - Metrics endpoint requires auth

Solutions

Solution 1: Check service is running

# Is API running?
docker compose ps api

# Should show "Up"
# If not:
docker compose up -d api

Solution 2: Test metrics endpoint

# From host
curl http://localhost:4000/metrics

# Should return Prometheus metrics:
# # HELP cm_api_uptime_seconds API uptime in seconds
# # TYPE cm_api_uptime_seconds gauge
# cm_api_uptime_seconds 123.45

# From Prometheus container
docker compose exec prometheus wget -O- http://api:4000/metrics

Solution 3: Check Prometheus config

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']  # Use service name, not localhost

Solution 4: Verify network

# Both on same network?
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
docker inspect changemaker-lite-api-1 | grep NetworkMode

# Should both show "changemaker-lite"

Solution 5: Check metrics are registered

In API logs:

docker compose logs api | grep -i "metrics\|prometheus"

# Should show:
# Metrics endpoint registered at /metrics
# Prometheus metrics initialized

Prevention

  • Health checks - Monitor Prometheus target health
  • Service dependencies - Ensure services start in order
  • Network config - Use Docker service names
  • Testing - Test /metrics endpoint on deploy

Scrape Timeout

Severity: 🟡 Medium

Symptoms

Target: api
State: UP
Last Scrape: 5.2s (slow)
Last Error: context deadline exceeded

Scrapes taking too long or timing out.

Solutions

Solution 1: Increase scrape timeout

In configs/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  scrape_timeout: 10s  # Increase from 10s to 30s

scrape_configs:
  - job_name: 'api'
    scrape_interval: 30s  # Scrape less frequently
    scrape_timeout: 20s
    static_configs:
      - targets: ['api:4000']

Reload config:

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart
docker compose restart prometheus

Solution 2: Optimize metrics generation

// In api/src/utils/metrics.ts
// Cache expensive metrics
let cachedUserCount = 0;
let lastUserCountUpdate = 0;

register.registerMetric(new Gauge({
  name: 'cm_users_total',
  help: 'Total number of users',
  async collect() {
    const now = Date.now();
    // Only query database every 60 seconds
    if (now - lastUserCountUpdate > 60000) {
      cachedUserCount = await prisma.user.count();
      lastUserCountUpdate = now;
    }
    this.set(cachedUserCount);
  }
}));

Solution 3: Reduce metric cardinality

// Bad - high cardinality (creates metric per user)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']  // Don't do this!
});

// Good - low cardinality
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']  // Only 5 roles
});

Prevention

  • Cache expensive metrics - Don't query DB on every scrape
  • Reasonable timeouts - 10-30s timeouts
  • Low cardinality - Avoid high-cardinality labels
  • Optimize queries - Fast metric queries

Authentication Errors

Severity: 🟡 Medium

Symptoms

Error: 401 Unauthorized when scraping /metrics

Solutions

Changemaker Lite V2 metrics endpoint is public (no auth required).

If you see auth errors:

Solution 1: Remove auth middleware from /metrics

In api/src/server.ts:

// Metrics endpoint should be BEFORE authenticate middleware
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Auth middleware comes after
app.use(authenticate);

Solution 2: Configure basic auth in Prometheus

If you DO want to protect /metrics:

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    basic_auth:
      username: 'prometheus'
      password: 'your-password'

Prevention

  • Public metrics - Keep /metrics public for simplicity
  • Network isolation - Use Docker networks for security
  • IP whitelist - Only allow Prometheus IP

Grafana Issues

Dashboards Not Loading

Severity: 🟠 High

Symptoms

Grafana shows blank dashboards or "No data" panels.

Solutions

Solution 1: Check Grafana is running

docker compose --profile monitoring ps grafana

# Should show "Up"
# If not:
docker compose --profile monitoring up -d grafana

Solution 2: Verify Prometheus datasource

  1. Open Grafana: http://localhost:3001
  2. Login (admin/admin)
  3. Settings → Data Sources
  4. Click Prometheus
  5. URL should be: http://prometheus:9090
  6. Click "Save & Test"
  7. Should show "Data source is working"

Solution 3: Check dashboard provisioning

# List provisioned dashboards
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/

# Should show:
# dashboard-provider.yml
# changemaker-api.json
# changemaker-queue.json
# changemaker-external-services.json

Solution 4: Import dashboard manually

If auto-provisioning fails:

  1. Grafana → Dashboards → Import
  2. Upload JSON from configs/grafana/dashboards/
  3. Select Prometheus datasource
  4. Click Import

Solution 5: Check for data

# Test query in Grafana Explore
# Query: cm_api_uptime_seconds

# Or test in Prometheus:
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'

Prevention

  • Dashboard versioning - Keep dashboards in git
  • Auto-provisioning - Use provisioning instead of manual import
  • Testing - Test dashboards after changes
  • Documentation - Document dashboard variables

Datasource Errors

Severity: 🟠 High

Symptoms

Error: Failed to query Prometheus
Error: connection refused

Red error bars on Grafana panels.

Solutions

Solution 1: Test Prometheus connection

# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up

# Should return JSON:
# {"status":"success","data":{"resultType":"vector","result":[...]}}

Solution 2: Check Prometheus is running

docker compose --profile monitoring ps prometheus

# Should show "Up"

Solution 3: Verify datasource URL

In Grafana datasource settings:

  • URL: http://prometheus:9090 (NOT http://localhost:9090)
  • Access: Server (NOT Browser)

Solution 4: Check Docker network

# Same network?
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode

Prevention

  • Health checks - Monitor datasource health
  • Service dependencies - Start Prometheus before Grafana
  • Error handling - Graceful error messages

Query Errors

Severity: 🟡 Medium

Symptoms

Error executing query: parse error at char X: unexpected identifier

Panel shows "Error loading data".

Solutions

Solution 1: Validate PromQL syntax

Common errors:

# Bad - missing {}
cm_api_uptime_seconds{job=api}

# Good
cm_api_uptime_seconds{job="api"}

# Bad - wrong function
average(cm_api_uptime_seconds)

# Good
avg(cm_api_uptime_seconds)

Solution 2: Test query in Explore

  1. Grafana → Explore
  2. Enter query
  3. Run
  4. Fix errors before adding to dashboard

Solution 3: Check metric exists

# List all metrics
curl http://localhost:9090/api/v1/label/__name__/values | jq

# Search for metric
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_

Solution 4: Use metric browser

In Grafana query editor:

  1. Click "Metrics" button
  2. Browse available metrics
  3. Select metric (auto-fills query)

Prevention

  • Query validation - Validate before saving
  • Testing - Test queries in Explore
  • Documentation - Document available metrics
  • Examples - Provide query examples

Alertmanager Issues

Alerts Not Firing

Severity: 🟠 High

Symptoms

Conditions met but alert not triggering.

Solutions

Solution 1: Check alert rules

In Prometheus UI (localhost:9090):

  1. Click "Alerts"
  2. Find your alert
  3. Check state:
    • Inactive: Condition not met
    • Pending: Met but waiting for for: duration
    • Firing: Alert active

Solution 2: Verify alert rule syntax

In configs/prometheus/alerts.yml:

groups:
  - name: changemaker_alerts
    interval: 30s
    rules:
      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m  # Must be down for 1 minute before firing
        labels:
          severity: critical
        annotations:
          summary: "API is down"
          description: "API has been down for 1 minute"

Solution 3: Check Alertmanager config

# Test Alertmanager
curl http://localhost:9093/api/v1/alerts

# Should return alert list

Solution 4: View Prometheus logs

docker compose logs prometheus | grep -i alert

# Shows:
# Loaded alert rules
# Alert X is firing

Solution 5: Reload alert rules

# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check rules loaded
curl http://localhost:9090/api/v1/rules

Prevention

  • Test alert conditions - Trigger manually to test
  • Reasonable thresholds - Not too sensitive or too lenient
  • Documentation - Document alert thresholds
  • Regular review - Review alert effectiveness

Notifications Not Sent

Severity: 🟡 Medium

Symptoms

Alert firing in Prometheus but no notification received.

Solutions

Solution 1: Check Alertmanager config

In configs/alertmanager/alertmanager.yml:

route:
  receiver: 'email'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

receivers:
  - name: 'email'
    email_configs:
      - to: 'alerts@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your-email@gmail.com'
        auth_password: 'your-app-password'

Solution 2: Test Alertmanager notification

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "Test",
      "severity": "critical"
    },
    "annotations": {
      "summary": "Test alert"
    }
  }]'

# Check if notification sent
docker compose logs alertmanager | grep -i "notification\|email"

Solution 3: Check SMTP config

See Email Issues for SMTP troubleshooting.

Solution 4: Use alternative notification channels

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: 'webhook'
    webhook_configs:
      - url: 'http://your-webhook-url.com/alerts'

Prevention

  • Test notifications - Regular notification tests
  • Multiple channels - Email + Slack + webhook
  • Fallback receivers - Backup notification method
  • Documentation - Document notification setup

Routing Errors

Severity: 🟡 Medium

Symptoms

Alerts going to wrong receiver or being silenced incorrectly.

Solutions

Solution 1: Check routing rules

In configs/alertmanager/alertmanager.yml:

route:
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'
    - match:
        severity: warning
      receiver: 'email'

Solution 2: Test routing

# Use amtool to test routing
docker compose exec alertmanager amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  alertname=TestAlert severity=critical

# Shows which receiver will be used

Solution 3: View active silences

In Alertmanager UI (localhost:9093):

  1. Click "Silences"
  2. Check if alert is silenced
  3. Expire or delete silence if wrong

Solution 4: Check inhibition rules

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']
# Critical alerts inhibit warnings for same instance

Prevention

  • Clear routing logic - Simple, understandable rules
  • Test routing - Test before deploying
  • Documentation - Document routing rules
  • Regular review - Review silences and inhibitions

Metrics Issues

Missing Metrics

Severity: 🟡 Medium

Symptoms

Expected metric not appearing in Prometheus or Grafana.

Solutions

Solution 1: Check metric is registered

In API code (api/src/utils/metrics.ts):

import { Counter } from 'prom-client';

const requestCounter = new Counter({
  name: 'cm_my_metric_total',
  help: 'Description of metric'
});

register.registerMetric(requestCounter);  // Must register!

Solution 2: Check metric is collected

# Test /metrics endpoint
curl http://localhost:4000/metrics | grep cm_my_metric

# Should show:
# # HELP cm_my_metric_total Description of metric
# # TYPE cm_my_metric_total counter
# cm_my_metric_total 42

Solution 3: Check scrape config

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:  # Don't accidentally drop metric
      - source_labels: [__name__]
        regex: 'cm_.*'  # Keep cm_* metrics
        action: keep

Solution 4: Verify metric type

// Counter - only increases (counts)
const counter = new Counter({ name: 'cm_requests_total' });
counter.inc();  // Increment

// Gauge - can go up or down (current value)
const gauge = new Gauge({ name: 'cm_queue_size' });
gauge.set(42);  // Set value

// Histogram - distribution of values
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
histogram.observe(0.5);  // Record duration

Prevention

  • Register all metrics - Don't forget register.registerMetric()
  • Test endpoint - Check /metrics shows metric
  • Naming convention - Use cm_* prefix for custom metrics
  • Documentation - Document all custom metrics

Incorrect Values

Severity: 🟡 Medium

Symptoms

Metric showing wrong or unexpected values.

Solutions

Solution 1: Check metric logic

// Wrong - gauge not updated
const gauge = new Gauge({ name: 'cm_users_total' });
// Never set, always 0

// Right - gauge updated
const gauge = new Gauge({
  name: 'cm_users_total',
  async collect() {
    const count = await prisma.user.count();
    this.set(count);
  }
});

Solution 2: Check metric type

// Wrong - using Counter for value that can decrease
const queueSize = new Counter({ name: 'cm_queue_size' });
queueSize.inc(50);  // Add 50
queueSize.inc(-20);  // Try to subtract 20 - ERROR!

// Right - use Gauge
const queueSize = new Gauge({ name: 'cm_queue_size' });
queueSize.set(50);  // Set to 50
queueSize.set(30);  // Set to 30 (can decrease)

Solution 3: Check label values

// Labels must match exactly
const counter = new Counter({
  name: 'requests_total',
  labelNames: ['method', 'status']
});

counter.inc({ method: 'GET', status: '200' });
// Creates: requests_total{method="GET",status="200"} 1

counter.inc({ method: 'GET', status: 200 });  // Wrong - number not string
// Creates separate metric: requests_total{method="GET",status=200} 1

Solution 4: Check query aggregation

# Wrong - sums across all labels
sum(cm_requests_total)

# Right - sum by specific label
sum by (status) (cm_requests_total)

Prevention

  • Correct metric type - Counter vs Gauge vs Histogram
  • Type consistency - Label values always same type
  • Testing - Test metric values with sample data
  • Validation - Validate metric values are reasonable

Stale Metrics

Severity: 🟢 Low

Symptoms

Metric values not updating, showing old data.

Solutions

Solution 1: Check collection frequency

// Metrics only updated when scraped
const gauge = new Gauge({
  name: 'cm_queue_size',
  async collect() {
    // This runs on every Prometheus scrape (every 15s)
    const size = await getQueueSize();
    this.set(size);
  }
});

Solution 2: Force metric update

// Update metric on event, not just scrape
eventEmitter.on('queueSizeChanged', (size) => {
  queueSizeGauge.set(size);
});

Solution 3: Check scrape interval

In configs/prometheus/prometheus.yml:

global:
  scrape_interval: 15s  # Scrape every 15 seconds

# Increase for more frequent updates
global:
  scrape_interval: 5s  # Scrape every 5 seconds

Prevention

  • Appropriate intervals - Balance freshness vs overhead
  • Event-driven updates - Update on change, not just scrape
  • Cache expensive metrics - Don't query DB every scrape
  • Staleness markers - Set metrics to NaN when stale

Performance Issues

High Memory Usage

Severity: 🟠 High

Symptoms

Prometheus container using excessive memory (multiple GB).

Solutions

Solution 1: Reduce retention period

In docker-compose.yml:

prometheus:
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.retention.time=7d'  # Reduce from 15d to 7d
    - '--storage.tsdb.retention.size=10GB'  # Add size limit

Restart:

docker compose --profile monitoring restart prometheus

Solution 2: Reduce metric cardinality

// Bad - creates metric per user (thousands)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']
});

// Good - creates metric per role (5)
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']
});

Solution 3: Drop unnecessary metrics

In configs/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:
      # Drop metrics we don't use
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'  # Drop Go/process metrics
        action: drop

Solution 4: Increase memory limit

prometheus:
  deploy:
    resources:
      limits:
        memory: 4G  # Increase from 2G

Prevention

  • Low cardinality - Avoid high-cardinality labels
  • Appropriate retention - 7-30 days is usually enough
  • Regular cleanup - Drop unused metrics
  • Monitor memory - Alert on high usage

Slow Queries

Severity: 🟡 Medium

Symptoms

Grafana dashboards slow to load. Queries taking 10+ seconds.

Solutions

Solution 1: Optimize query

# Slow - calculates rate for all time
rate(cm_requests_total[1y])

# Fast - only last 5 minutes
rate(cm_requests_total[5m])

# Slow - many time series
sum(rate(cm_requests_total[5m]))

# Faster - aggregate before rate
sum(increase(cm_requests_total[5m])) / 300

Solution 2: Use recording rules

In configs/prometheus/alerts.yml:

groups:
  - name: recording_rules
    interval: 30s
    rules:
      # Pre-calculate expensive query every 30s
      - record: job:cm_request_rate:sum
        expr: sum(rate(cm_requests_total[5m])) by (job)

# Then use in dashboard:
# job:cm_request_rate:sum  # Fast!

Solution 3: Reduce time range

In Grafana:

  • Change dashboard time range from "Last 30 days" to "Last 24 hours"
  • Queries are faster with less data

Solution 4: Increase Prometheus resources

prometheus:
  deploy:
    resources:
      limits:
        cpus: '2.0'  # More CPU for queries
        memory: 4G

Prevention

  • Efficient queries - Keep queries simple
  • Recording rules - Pre-calculate expensive queries
  • Appropriate time ranges - Don't query months of data
  • Indexing - Prometheus auto-indexes, but cardinality affects performance

Useful Commands

Prometheus Operations

# Check targets
curl http://localhost:9090/api/v1/targets

# Query metric
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'

# Query range
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'

# Reload config
docker compose exec prometheus kill -HUP 1

# Check config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Check rules
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml

Grafana Operations

# Test datasource
curl http://admin:admin@localhost:3001/api/datasources/1/health

# List dashboards
curl http://admin:admin@localhost:3001/api/search?type=dash-db

# Export dashboard
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json

# Import dashboard
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @dashboard.json

Alertmanager Operations

# Check alerts
curl http://localhost:9093/api/v1/alerts

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'

# List silences
curl http://localhost:9093/api/v1/silences

# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
  -H 'Content-Type: application/json' \
  -d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'

Monitoring Documentation

Other Troubleshooting

External Resources


Last Updated: February 2026 Version: V2.0 Status: Complete