24 KiB
Monitoring and Observability Issues
This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.
Overview
Monitoring Stack
Changemaker Lite V2 uses profile-based monitoring (optional):
# Start with monitoring
docker compose --profile monitoring up -d
Components:
- Prometheus - Metrics collection and storage (port 9090)
- Grafana - Metrics visualization (port 3001)
- Alertmanager - Alert routing and notification (port 9093)
- cAdvisor - Container metrics (port 8080)
- Node Exporter - Host metrics (port 9100)
- Redis Exporter - Redis metrics (port 9121)
Custom Metrics
12 custom cm_* Prometheus metrics:
cm_api_uptime_seconds- API uptimecm_database_uptime_seconds- Database uptimecm_email_queue_size- Email queue depthcm_geocoding_queue_size- Geocoding queue depthcm_users_total- Total userscm_campaigns_total- Total campaignscm_locations_total- Total locationscm_geocoded_locations_total- Geocoded locationscm_active_canvass_sessions- Active sessionscm_external_service_up- Service health (0/1)cm_listmonk_subscribers_total- Listmonk subscriberscm_media_videos_total- Total videos
Plus standard HTTP metrics:
http_request_duration_secondshttp_requests_total
Prometheus Not Scraping
Target Down
Severity: 🔴 Critical
Symptoms
Prometheus UI (localhost:9090) shows targets as "DOWN":
Target: api (localhost:4000/metrics)
State: DOWN
Error: Get "http://api:4000/metrics": connection refused
No data in Grafana dashboards.
Common Causes
- Service not running - API container stopped
- Metrics endpoint missing - /metrics endpoint not registered
- Network issue - Prometheus can't reach service
- Authentication required - Metrics endpoint requires auth
Solutions
Solution 1: Check service is running
# Is API running?
docker compose ps api
# Should show "Up"
# If not:
docker compose up -d api
Solution 2: Test metrics endpoint
# From host
curl http://localhost:4000/metrics
# Should return Prometheus metrics:
# # HELP cm_api_uptime_seconds API uptime in seconds
# # TYPE cm_api_uptime_seconds gauge
# cm_api_uptime_seconds 123.45
# From Prometheus container
docker compose exec prometheus wget -O- http://api:4000/metrics
Solution 3: Check Prometheus config
In configs/prometheus/prometheus.yml:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000'] # Use service name, not localhost
Solution 4: Verify network
# Both on same network?
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
docker inspect changemaker-lite-api-1 | grep NetworkMode
# Should both show "changemaker-lite"
Solution 5: Check metrics are registered
In API logs:
docker compose logs api | grep -i "metrics\|prometheus"
# Should show:
# Metrics endpoint registered at /metrics
# Prometheus metrics initialized
Prevention
- Health checks - Monitor Prometheus target health
- Service dependencies - Ensure services start in order
- Network config - Use Docker service names
- Testing - Test /metrics endpoint on deploy
Scrape Timeout
Severity: 🟡 Medium
Symptoms
Target: api
State: UP
Last Scrape: 5.2s (slow)
Last Error: context deadline exceeded
Scrapes taking too long or timing out.
Solutions
Solution 1: Increase scrape timeout
In configs/prometheus/prometheus.yml:
global:
scrape_interval: 15s
scrape_timeout: 10s # Increase from 10s to 30s
scrape_configs:
- job_name: 'api'
scrape_interval: 30s # Scrape less frequently
scrape_timeout: 20s
static_configs:
- targets: ['api:4000']
Reload config:
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Or restart
docker compose restart prometheus
Solution 2: Optimize metrics generation
// In api/src/utils/metrics.ts
// Cache expensive metrics
let cachedUserCount = 0;
let lastUserCountUpdate = 0;
register.registerMetric(new Gauge({
name: 'cm_users_total',
help: 'Total number of users',
async collect() {
const now = Date.now();
// Only query database every 60 seconds
if (now - lastUserCountUpdate > 60000) {
cachedUserCount = await prisma.user.count();
lastUserCountUpdate = now;
}
this.set(cachedUserCount);
}
}));
Solution 3: Reduce metric cardinality
// Bad - high cardinality (creates metric per user)
new Counter({
name: 'requests_by_user',
labelNames: ['userId'] // Don't do this!
});
// Good - low cardinality
new Counter({
name: 'requests_by_role',
labelNames: ['role'] // Only 5 roles
});
Prevention
- Cache expensive metrics - Don't query DB on every scrape
- Reasonable timeouts - 10-30s timeouts
- Low cardinality - Avoid high-cardinality labels
- Optimize queries - Fast metric queries
Authentication Errors
Severity: 🟡 Medium
Symptoms
Error: 401 Unauthorized when scraping /metrics
Solutions
Changemaker Lite V2 metrics endpoint is public (no auth required).
If you see auth errors:
Solution 1: Remove auth middleware from /metrics
In api/src/server.ts:
// Metrics endpoint should be BEFORE authenticate middleware
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Auth middleware comes after
app.use(authenticate);
Solution 2: Configure basic auth in Prometheus
If you DO want to protect /metrics:
In configs/prometheus/prometheus.yml:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
basic_auth:
username: 'prometheus'
password: 'your-password'
Prevention
- Public metrics - Keep /metrics public for simplicity
- Network isolation - Use Docker networks for security
- IP whitelist - Only allow Prometheus IP
Grafana Issues
Dashboards Not Loading
Severity: 🟠 High
Symptoms
Grafana shows blank dashboards or "No data" panels.
Solutions
Solution 1: Check Grafana is running
docker compose --profile monitoring ps grafana
# Should show "Up"
# If not:
docker compose --profile monitoring up -d grafana
Solution 2: Verify Prometheus datasource
- Open Grafana: http://localhost:3001
- Login (admin/admin)
- Settings → Data Sources
- Click Prometheus
- URL should be:
http://prometheus:9090 - Click "Save & Test"
- Should show "Data source is working"
Solution 3: Check dashboard provisioning
# List provisioned dashboards
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/
# Should show:
# dashboard-provider.yml
# changemaker-api.json
# changemaker-queue.json
# changemaker-external-services.json
Solution 4: Import dashboard manually
If auto-provisioning fails:
- Grafana → Dashboards → Import
- Upload JSON from
configs/grafana/dashboards/ - Select Prometheus datasource
- Click Import
Solution 5: Check for data
# Test query in Grafana Explore
# Query: cm_api_uptime_seconds
# Or test in Prometheus:
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
Prevention
- Dashboard versioning - Keep dashboards in git
- Auto-provisioning - Use provisioning instead of manual import
- Testing - Test dashboards after changes
- Documentation - Document dashboard variables
Datasource Errors
Severity: 🟠 High
Symptoms
Error: Failed to query Prometheus
Error: connection refused
Red error bars on Grafana panels.
Solutions
Solution 1: Test Prometheus connection
# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up
# Should return JSON:
# {"status":"success","data":{"resultType":"vector","result":[...]}}
Solution 2: Check Prometheus is running
docker compose --profile monitoring ps prometheus
# Should show "Up"
Solution 3: Verify datasource URL
In Grafana datasource settings:
- URL:
http://prometheus:9090(NOThttp://localhost:9090) - Access: Server (NOT Browser)
Solution 4: Check Docker network
# Same network?
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
Prevention
- Health checks - Monitor datasource health
- Service dependencies - Start Prometheus before Grafana
- Error handling - Graceful error messages
Query Errors
Severity: 🟡 Medium
Symptoms
Error executing query: parse error at char X: unexpected identifier
Panel shows "Error loading data".
Solutions
Solution 1: Validate PromQL syntax
Common errors:
# Bad - missing {}
cm_api_uptime_seconds{job=api}
# Good
cm_api_uptime_seconds{job="api"}
# Bad - wrong function
average(cm_api_uptime_seconds)
# Good
avg(cm_api_uptime_seconds)
Solution 2: Test query in Explore
- Grafana → Explore
- Enter query
- Run
- Fix errors before adding to dashboard
Solution 3: Check metric exists
# List all metrics
curl http://localhost:9090/api/v1/label/__name__/values | jq
# Search for metric
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_
Solution 4: Use metric browser
In Grafana query editor:
- Click "Metrics" button
- Browse available metrics
- Select metric (auto-fills query)
Prevention
- Query validation - Validate before saving
- Testing - Test queries in Explore
- Documentation - Document available metrics
- Examples - Provide query examples
Alertmanager Issues
Alerts Not Firing
Severity: 🟠 High
Symptoms
Conditions met but alert not triggering.
Solutions
Solution 1: Check alert rules
In Prometheus UI (localhost:9090):
- Click "Alerts"
- Find your alert
- Check state:
- Inactive: Condition not met
- Pending: Met but waiting for
for:duration - Firing: Alert active
Solution 2: Verify alert rule syntax
In configs/prometheus/alerts.yml:
groups:
- name: changemaker_alerts
interval: 30s
rules:
- alert: APIDown
expr: up{job="api"} == 0
for: 1m # Must be down for 1 minute before firing
labels:
severity: critical
annotations:
summary: "API is down"
description: "API has been down for 1 minute"
Solution 3: Check Alertmanager config
# Test Alertmanager
curl http://localhost:9093/api/v1/alerts
# Should return alert list
Solution 4: View Prometheus logs
docker compose logs prometheus | grep -i alert
# Shows:
# Loaded alert rules
# Alert X is firing
Solution 5: Reload alert rules
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Check rules loaded
curl http://localhost:9090/api/v1/rules
Prevention
- Test alert conditions - Trigger manually to test
- Reasonable thresholds - Not too sensitive or too lenient
- Documentation - Document alert thresholds
- Regular review - Review alert effectiveness
Notifications Not Sent
Severity: 🟡 Medium
Symptoms
Alert firing in Prometheus but no notification received.
Solutions
Solution 1: Check Alertmanager config
In configs/alertmanager/alertmanager.yml:
route:
receiver: 'email'
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: 'email'
email_configs:
- to: 'alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'your-email@gmail.com'
auth_password: 'your-app-password'
Solution 2: Test Alertmanager notification
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "Test",
"severity": "critical"
},
"annotations": {
"summary": "Test alert"
}
}]'
# Check if notification sent
docker compose logs alertmanager | grep -i "notification\|email"
Solution 3: Check SMTP config
See Email Issues for SMTP troubleshooting.
Solution 4: Use alternative notification channels
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
- name: 'webhook'
webhook_configs:
- url: 'http://your-webhook-url.com/alerts'
Prevention
- Test notifications - Regular notification tests
- Multiple channels - Email + Slack + webhook
- Fallback receivers - Backup notification method
- Documentation - Document notification setup
Routing Errors
Severity: 🟡 Medium
Symptoms
Alerts going to wrong receiver or being silenced incorrectly.
Solutions
Solution 1: Check routing rules
In configs/alertmanager/alertmanager.yml:
route:
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pager'
- match:
severity: warning
receiver: 'email'
Solution 2: Test routing
# Use amtool to test routing
docker compose exec alertmanager amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
alertname=TestAlert severity=critical
# Shows which receiver will be used
Solution 3: View active silences
In Alertmanager UI (localhost:9093):
- Click "Silences"
- Check if alert is silenced
- Expire or delete silence if wrong
Solution 4: Check inhibition rules
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']
# Critical alerts inhibit warnings for same instance
Prevention
- Clear routing logic - Simple, understandable rules
- Test routing - Test before deploying
- Documentation - Document routing rules
- Regular review - Review silences and inhibitions
Metrics Issues
Missing Metrics
Severity: 🟡 Medium
Symptoms
Expected metric not appearing in Prometheus or Grafana.
Solutions
Solution 1: Check metric is registered
In API code (api/src/utils/metrics.ts):
import { Counter } from 'prom-client';
const requestCounter = new Counter({
name: 'cm_my_metric_total',
help: 'Description of metric'
});
register.registerMetric(requestCounter); // Must register!
Solution 2: Check metric is collected
# Test /metrics endpoint
curl http://localhost:4000/metrics | grep cm_my_metric
# Should show:
# # HELP cm_my_metric_total Description of metric
# # TYPE cm_my_metric_total counter
# cm_my_metric_total 42
Solution 3: Check scrape config
In configs/prometheus/prometheus.yml:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
metric_relabel_configs: # Don't accidentally drop metric
- source_labels: [__name__]
regex: 'cm_.*' # Keep cm_* metrics
action: keep
Solution 4: Verify metric type
// Counter - only increases (counts)
const counter = new Counter({ name: 'cm_requests_total' });
counter.inc(); // Increment
// Gauge - can go up or down (current value)
const gauge = new Gauge({ name: 'cm_queue_size' });
gauge.set(42); // Set value
// Histogram - distribution of values
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
histogram.observe(0.5); // Record duration
Prevention
- Register all metrics - Don't forget register.registerMetric()
- Test endpoint - Check /metrics shows metric
- Naming convention - Use cm_* prefix for custom metrics
- Documentation - Document all custom metrics
Incorrect Values
Severity: 🟡 Medium
Symptoms
Metric showing wrong or unexpected values.
Solutions
Solution 1: Check metric logic
// Wrong - gauge not updated
const gauge = new Gauge({ name: 'cm_users_total' });
// Never set, always 0
// Right - gauge updated
const gauge = new Gauge({
name: 'cm_users_total',
async collect() {
const count = await prisma.user.count();
this.set(count);
}
});
Solution 2: Check metric type
// Wrong - using Counter for value that can decrease
const queueSize = new Counter({ name: 'cm_queue_size' });
queueSize.inc(50); // Add 50
queueSize.inc(-20); // Try to subtract 20 - ERROR!
// Right - use Gauge
const queueSize = new Gauge({ name: 'cm_queue_size' });
queueSize.set(50); // Set to 50
queueSize.set(30); // Set to 30 (can decrease)
Solution 3: Check label values
// Labels must match exactly
const counter = new Counter({
name: 'requests_total',
labelNames: ['method', 'status']
});
counter.inc({ method: 'GET', status: '200' });
// Creates: requests_total{method="GET",status="200"} 1
counter.inc({ method: 'GET', status: 200 }); // Wrong - number not string
// Creates separate metric: requests_total{method="GET",status=200} 1
Solution 4: Check query aggregation
# Wrong - sums across all labels
sum(cm_requests_total)
# Right - sum by specific label
sum by (status) (cm_requests_total)
Prevention
- Correct metric type - Counter vs Gauge vs Histogram
- Type consistency - Label values always same type
- Testing - Test metric values with sample data
- Validation - Validate metric values are reasonable
Stale Metrics
Severity: 🟢 Low
Symptoms
Metric values not updating, showing old data.
Solutions
Solution 1: Check collection frequency
// Metrics only updated when scraped
const gauge = new Gauge({
name: 'cm_queue_size',
async collect() {
// This runs on every Prometheus scrape (every 15s)
const size = await getQueueSize();
this.set(size);
}
});
Solution 2: Force metric update
// Update metric on event, not just scrape
eventEmitter.on('queueSizeChanged', (size) => {
queueSizeGauge.set(size);
});
Solution 3: Check scrape interval
In configs/prometheus/prometheus.yml:
global:
scrape_interval: 15s # Scrape every 15 seconds
# Increase for more frequent updates
global:
scrape_interval: 5s # Scrape every 5 seconds
Prevention
- Appropriate intervals - Balance freshness vs overhead
- Event-driven updates - Update on change, not just scrape
- Cache expensive metrics - Don't query DB every scrape
- Staleness markers - Set metrics to NaN when stale
Performance Issues
High Memory Usage
Severity: 🟠 High
Symptoms
Prometheus container using excessive memory (multiple GB).
Solutions
Solution 1: Reduce retention period
In docker-compose.yml:
prometheus:
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d' # Reduce from 15d to 7d
- '--storage.tsdb.retention.size=10GB' # Add size limit
Restart:
docker compose --profile monitoring restart prometheus
Solution 2: Reduce metric cardinality
// Bad - creates metric per user (thousands)
new Counter({
name: 'requests_by_user',
labelNames: ['userId']
});
// Good - creates metric per role (5)
new Counter({
name: 'requests_by_role',
labelNames: ['role']
});
Solution 3: Drop unnecessary metrics
In configs/prometheus/prometheus.yml:
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
metric_relabel_configs:
# Drop metrics we don't use
- source_labels: [__name__]
regex: 'go_.*|process_.*' # Drop Go/process metrics
action: drop
Solution 4: Increase memory limit
prometheus:
deploy:
resources:
limits:
memory: 4G # Increase from 2G
Prevention
- Low cardinality - Avoid high-cardinality labels
- Appropriate retention - 7-30 days is usually enough
- Regular cleanup - Drop unused metrics
- Monitor memory - Alert on high usage
Slow Queries
Severity: 🟡 Medium
Symptoms
Grafana dashboards slow to load. Queries taking 10+ seconds.
Solutions
Solution 1: Optimize query
# Slow - calculates rate for all time
rate(cm_requests_total[1y])
# Fast - only last 5 minutes
rate(cm_requests_total[5m])
# Slow - many time series
sum(rate(cm_requests_total[5m]))
# Faster - aggregate before rate
sum(increase(cm_requests_total[5m])) / 300
Solution 2: Use recording rules
In configs/prometheus/alerts.yml:
groups:
- name: recording_rules
interval: 30s
rules:
# Pre-calculate expensive query every 30s
- record: job:cm_request_rate:sum
expr: sum(rate(cm_requests_total[5m])) by (job)
# Then use in dashboard:
# job:cm_request_rate:sum # Fast!
Solution 3: Reduce time range
In Grafana:
- Change dashboard time range from "Last 30 days" to "Last 24 hours"
- Queries are faster with less data
Solution 4: Increase Prometheus resources
prometheus:
deploy:
resources:
limits:
cpus: '2.0' # More CPU for queries
memory: 4G
Prevention
- Efficient queries - Keep queries simple
- Recording rules - Pre-calculate expensive queries
- Appropriate time ranges - Don't query months of data
- Indexing - Prometheus auto-indexes, but cardinality affects performance
Useful Commands
Prometheus Operations
# Check targets
curl http://localhost:9090/api/v1/targets
# Query metric
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
# Query range
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'
# Reload config
docker compose exec prometheus kill -HUP 1
# Check config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
# Check rules
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
Grafana Operations
# Test datasource
curl http://admin:admin@localhost:3001/api/datasources/1/health
# List dashboards
curl http://admin:admin@localhost:3001/api/search?type=dash-db
# Export dashboard
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json
# Import dashboard
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
-H "Content-Type: application/json" \
-d @dashboard.json
Alertmanager Operations
# Check alerts
curl http://localhost:9093/api/v1/alerts
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'
# List silences
curl http://localhost:9093/api/v1/silences
# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
-H 'Content-Type: application/json' \
-d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'
Related Documentation
Monitoring Documentation
- Monitoring Issues - This guide
- Observability Dashboard - Using dashboard
- Monitoring Guide - Setup and configuration
Other Troubleshooting
- Common Errors - General errors
- Performance Optimization - Performance tuning
External Resources
Last Updated: February 2026 Version: V2.0 Status: Complete