# Monitoring and Observability Issues This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2. ## Overview ### Monitoring Stack Changemaker Lite V2 uses **profile-based monitoring** (optional): ```bash # Start with monitoring docker compose --profile monitoring up -d ``` **Components:** - **Prometheus** - Metrics collection and storage (port 9090) - **Grafana** - Metrics visualization (port 3001) - **Alertmanager** - Alert routing and notification (port 9093) - **cAdvisor** - Container metrics (port 8080) - **Node Exporter** - Host metrics (port 9100) - **Redis Exporter** - Redis metrics (port 9121) ### Custom Metrics 12 custom `cm_*` Prometheus metrics: 1. `cm_api_uptime_seconds` - API uptime 2. `cm_database_uptime_seconds` - Database uptime 3. `cm_email_queue_size` - Email queue depth 4. `cm_geocoding_queue_size` - Geocoding queue depth 5. `cm_users_total` - Total users 6. `cm_campaigns_total` - Total campaigns 7. `cm_locations_total` - Total locations 8. `cm_geocoded_locations_total` - Geocoded locations 9. `cm_active_canvass_sessions` - Active sessions 10. `cm_external_service_up` - Service health (0/1) 11. `cm_listmonk_subscribers_total` - Listmonk subscribers 12. `cm_media_videos_total` - Total videos Plus standard HTTP metrics: - `http_request_duration_seconds` - `http_requests_total` --- ## Prometheus Not Scraping ### Target Down **Severity:** 🔴 Critical #### Symptoms Prometheus UI (localhost:9090) shows targets as "DOWN": ``` Target: api (localhost:4000/metrics) State: DOWN Error: Get "http://api:4000/metrics": connection refused ``` No data in Grafana dashboards. #### Common Causes 1. **Service not running** - API container stopped 2. **Metrics endpoint missing** - /metrics endpoint not registered 3. **Network issue** - Prometheus can't reach service 4. **Authentication required** - Metrics endpoint requires auth #### Solutions **Solution 1: Check service is running** ```bash # Is API running? docker compose ps api # Should show "Up" # If not: docker compose up -d api ``` **Solution 2: Test metrics endpoint** ```bash # From host curl http://localhost:4000/metrics # Should return Prometheus metrics: # # HELP cm_api_uptime_seconds API uptime in seconds # # TYPE cm_api_uptime_seconds gauge # cm_api_uptime_seconds 123.45 # From Prometheus container docker compose exec prometheus wget -O- http://api:4000/metrics ``` **Solution 3: Check Prometheus config** In `configs/prometheus/prometheus.yml`: ```yaml scrape_configs: - job_name: 'api' static_configs: - targets: ['api:4000'] # Use service name, not localhost ``` **Solution 4: Verify network** ```bash # Both on same network? docker inspect changemaker-lite-prometheus-1 | grep NetworkMode docker inspect changemaker-lite-api-1 | grep NetworkMode # Should both show "changemaker-lite" ``` **Solution 5: Check metrics are registered** In API logs: ```bash docker compose logs api | grep -i "metrics\|prometheus" # Should show: # Metrics endpoint registered at /metrics # Prometheus metrics initialized ``` #### Prevention - **Health checks** - Monitor Prometheus target health - **Service dependencies** - Ensure services start in order - **Network config** - Use Docker service names - **Testing** - Test /metrics endpoint on deploy --- ### Scrape Timeout **Severity:** 🟡 Medium #### Symptoms ``` Target: api State: UP Last Scrape: 5.2s (slow) Last Error: context deadline exceeded ``` Scrapes taking too long or timing out. #### Solutions **Solution 1: Increase scrape timeout** In `configs/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s scrape_timeout: 10s # Increase from 10s to 30s scrape_configs: - job_name: 'api' scrape_interval: 30s # Scrape less frequently scrape_timeout: 20s static_configs: - targets: ['api:4000'] ``` Reload config: ```bash # Reload Prometheus config docker compose exec prometheus kill -HUP 1 # Or restart docker compose restart prometheus ``` **Solution 2: Optimize metrics generation** ```typescript // In api/src/utils/metrics.ts // Cache expensive metrics let cachedUserCount = 0; let lastUserCountUpdate = 0; register.registerMetric(new Gauge({ name: 'cm_users_total', help: 'Total number of users', async collect() { const now = Date.now(); // Only query database every 60 seconds if (now - lastUserCountUpdate > 60000) { cachedUserCount = await prisma.user.count(); lastUserCountUpdate = now; } this.set(cachedUserCount); } })); ``` **Solution 3: Reduce metric cardinality** ```typescript // Bad - high cardinality (creates metric per user) new Counter({ name: 'requests_by_user', labelNames: ['userId'] // Don't do this! }); // Good - low cardinality new Counter({ name: 'requests_by_role', labelNames: ['role'] // Only 5 roles }); ``` #### Prevention - **Cache expensive metrics** - Don't query DB on every scrape - **Reasonable timeouts** - 10-30s timeouts - **Low cardinality** - Avoid high-cardinality labels - **Optimize queries** - Fast metric queries --- ### Authentication Errors **Severity:** 🟡 Medium #### Symptoms ``` Error: 401 Unauthorized when scraping /metrics ``` #### Solutions Changemaker Lite V2 metrics endpoint is **public** (no auth required). If you see auth errors: **Solution 1: Remove auth middleware from /metrics** In `api/src/server.ts`: ```typescript // Metrics endpoint should be BEFORE authenticate middleware app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); }); // Auth middleware comes after app.use(authenticate); ``` **Solution 2: Configure basic auth in Prometheus** If you DO want to protect /metrics: In `configs/prometheus/prometheus.yml`: ```yaml scrape_configs: - job_name: 'api' static_configs: - targets: ['api:4000'] basic_auth: username: 'prometheus' password: 'your-password' ``` #### Prevention - **Public metrics** - Keep /metrics public for simplicity - **Network isolation** - Use Docker networks for security - **IP whitelist** - Only allow Prometheus IP --- ## Grafana Issues ### Dashboards Not Loading **Severity:** 🟠 High #### Symptoms Grafana shows blank dashboards or "No data" panels. #### Solutions **Solution 1: Check Grafana is running** ```bash docker compose --profile monitoring ps grafana # Should show "Up" # If not: docker compose --profile monitoring up -d grafana ``` **Solution 2: Verify Prometheus datasource** 1. Open Grafana: http://localhost:3001 2. Login (admin/admin) 3. Settings → Data Sources 4. Click Prometheus 5. URL should be: `http://prometheus:9090` 6. Click "Save & Test" 7. Should show "Data source is working" **Solution 3: Check dashboard provisioning** ```bash # List provisioned dashboards docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/ # Should show: # dashboard-provider.yml # changemaker-api.json # changemaker-queue.json # changemaker-external-services.json ``` **Solution 4: Import dashboard manually** If auto-provisioning fails: 1. Grafana → Dashboards → Import 2. Upload JSON from `configs/grafana/dashboards/` 3. Select Prometheus datasource 4. Click Import **Solution 5: Check for data** ```bash # Test query in Grafana Explore # Query: cm_api_uptime_seconds # Or test in Prometheus: curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds' ``` #### Prevention - **Dashboard versioning** - Keep dashboards in git - **Auto-provisioning** - Use provisioning instead of manual import - **Testing** - Test dashboards after changes - **Documentation** - Document dashboard variables --- ### Datasource Errors **Severity:** 🟠 High #### Symptoms ``` Error: Failed to query Prometheus Error: connection refused ``` Red error bars on Grafana panels. #### Solutions **Solution 1: Test Prometheus connection** ```bash # From Grafana container docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up # Should return JSON: # {"status":"success","data":{"resultType":"vector","result":[...]}} ``` **Solution 2: Check Prometheus is running** ```bash docker compose --profile monitoring ps prometheus # Should show "Up" ``` **Solution 3: Verify datasource URL** In Grafana datasource settings: - URL: `http://prometheus:9090` (NOT `http://localhost:9090`) - Access: Server (NOT Browser) **Solution 4: Check Docker network** ```bash # Same network? docker inspect changemaker-lite-grafana-1 | grep NetworkMode docker inspect changemaker-lite-prometheus-1 | grep NetworkMode ``` #### Prevention - **Health checks** - Monitor datasource health - **Service dependencies** - Start Prometheus before Grafana - **Error handling** - Graceful error messages --- ### Query Errors **Severity:** 🟡 Medium #### Symptoms ``` Error executing query: parse error at char X: unexpected identifier ``` Panel shows "Error loading data". #### Solutions **Solution 1: Validate PromQL syntax** Common errors: ```promql # Bad - missing {} cm_api_uptime_seconds{job=api} # Good cm_api_uptime_seconds{job="api"} # Bad - wrong function average(cm_api_uptime_seconds) # Good avg(cm_api_uptime_seconds) ``` **Solution 2: Test query in Explore** 1. Grafana → Explore 2. Enter query 3. Run 4. Fix errors before adding to dashboard **Solution 3: Check metric exists** ```bash # List all metrics curl http://localhost:9090/api/v1/label/__name__/values | jq # Search for metric curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_ ``` **Solution 4: Use metric browser** In Grafana query editor: 1. Click "Metrics" button 2. Browse available metrics 3. Select metric (auto-fills query) #### Prevention - **Query validation** - Validate before saving - **Testing** - Test queries in Explore - **Documentation** - Document available metrics - **Examples** - Provide query examples --- ## Alertmanager Issues ### Alerts Not Firing **Severity:** 🟠 High #### Symptoms Conditions met but alert not triggering. #### Solutions **Solution 1: Check alert rules** In Prometheus UI (localhost:9090): 1. Click "Alerts" 2. Find your alert 3. Check state: - Inactive: Condition not met - Pending: Met but waiting for `for:` duration - Firing: Alert active **Solution 2: Verify alert rule syntax** In `configs/prometheus/alerts.yml`: ```yaml groups: - name: changemaker_alerts interval: 30s rules: - alert: APIDown expr: up{job="api"} == 0 for: 1m # Must be down for 1 minute before firing labels: severity: critical annotations: summary: "API is down" description: "API has been down for 1 minute" ``` **Solution 3: Check Alertmanager config** ```bash # Test Alertmanager curl http://localhost:9093/api/v1/alerts # Should return alert list ``` **Solution 4: View Prometheus logs** ```bash docker compose logs prometheus | grep -i alert # Shows: # Loaded alert rules # Alert X is firing ``` **Solution 5: Reload alert rules** ```bash # Reload Prometheus config docker compose exec prometheus kill -HUP 1 # Check rules loaded curl http://localhost:9090/api/v1/rules ``` #### Prevention - **Test alert conditions** - Trigger manually to test - **Reasonable thresholds** - Not too sensitive or too lenient - **Documentation** - Document alert thresholds - **Regular review** - Review alert effectiveness --- ### Notifications Not Sent **Severity:** 🟡 Medium #### Symptoms Alert firing in Prometheus but no notification received. #### Solutions **Solution 1: Check Alertmanager config** In `configs/alertmanager/alertmanager.yml`: ```yaml route: receiver: 'email' group_wait: 30s group_interval: 5m repeat_interval: 12h receivers: - name: 'email' email_configs: - to: 'alerts@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'your-email@gmail.com' auth_password: 'your-app-password' ``` **Solution 2: Test Alertmanager notification** ```bash # Send test alert curl -X POST http://localhost:9093/api/v1/alerts \ -H 'Content-Type: application/json' \ -d '[{ "labels": { "alertname": "Test", "severity": "critical" }, "annotations": { "summary": "Test alert" } }]' # Check if notification sent docker compose logs alertmanager | grep -i "notification\|email" ``` **Solution 3: Check SMTP config** See [Email Issues](email-issues.md#smtp-configuration) for SMTP troubleshooting. **Solution 4: Use alternative notification channels** ```yaml receivers: - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/...' channel: '#alerts' - name: 'webhook' webhook_configs: - url: 'http://your-webhook-url.com/alerts' ``` #### Prevention - **Test notifications** - Regular notification tests - **Multiple channels** - Email + Slack + webhook - **Fallback receivers** - Backup notification method - **Documentation** - Document notification setup --- ### Routing Errors **Severity:** 🟡 Medium #### Symptoms Alerts going to wrong receiver or being silenced incorrectly. #### Solutions **Solution 1: Check routing rules** In `configs/alertmanager/alertmanager.yml`: ```yaml route: receiver: 'default' routes: - match: severity: critical receiver: 'pager' - match: severity: warning receiver: 'email' ``` **Solution 2: Test routing** ```bash # Use amtool to test routing docker compose exec alertmanager amtool config routes test \ --config.file=/etc/alertmanager/alertmanager.yml \ alertname=TestAlert severity=critical # Shows which receiver will be used ``` **Solution 3: View active silences** In Alertmanager UI (localhost:9093): 1. Click "Silences" 2. Check if alert is silenced 3. Expire or delete silence if wrong **Solution 4: Check inhibition rules** ```yaml inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: ['alertname', 'instance'] # Critical alerts inhibit warnings for same instance ``` #### Prevention - **Clear routing logic** - Simple, understandable rules - **Test routing** - Test before deploying - **Documentation** - Document routing rules - **Regular review** - Review silences and inhibitions --- ## Metrics Issues ### Missing Metrics **Severity:** 🟡 Medium #### Symptoms Expected metric not appearing in Prometheus or Grafana. #### Solutions **Solution 1: Check metric is registered** In API code (`api/src/utils/metrics.ts`): ```typescript import { Counter } from 'prom-client'; const requestCounter = new Counter({ name: 'cm_my_metric_total', help: 'Description of metric' }); register.registerMetric(requestCounter); // Must register! ``` **Solution 2: Check metric is collected** ```bash # Test /metrics endpoint curl http://localhost:4000/metrics | grep cm_my_metric # Should show: # # HELP cm_my_metric_total Description of metric # # TYPE cm_my_metric_total counter # cm_my_metric_total 42 ``` **Solution 3: Check scrape config** In `configs/prometheus/prometheus.yml`: ```yaml scrape_configs: - job_name: 'api' static_configs: - targets: ['api:4000'] metric_relabel_configs: # Don't accidentally drop metric - source_labels: [__name__] regex: 'cm_.*' # Keep cm_* metrics action: keep ``` **Solution 4: Verify metric type** ```typescript // Counter - only increases (counts) const counter = new Counter({ name: 'cm_requests_total' }); counter.inc(); // Increment // Gauge - can go up or down (current value) const gauge = new Gauge({ name: 'cm_queue_size' }); gauge.set(42); // Set value // Histogram - distribution of values const histogram = new Histogram({ name: 'cm_request_duration_seconds' }); histogram.observe(0.5); // Record duration ``` #### Prevention - **Register all metrics** - Don't forget register.registerMetric() - **Test endpoint** - Check /metrics shows metric - **Naming convention** - Use cm_* prefix for custom metrics - **Documentation** - Document all custom metrics --- ### Incorrect Values **Severity:** 🟡 Medium #### Symptoms Metric showing wrong or unexpected values. #### Solutions **Solution 1: Check metric logic** ```typescript // Wrong - gauge not updated const gauge = new Gauge({ name: 'cm_users_total' }); // Never set, always 0 // Right - gauge updated const gauge = new Gauge({ name: 'cm_users_total', async collect() { const count = await prisma.user.count(); this.set(count); } }); ``` **Solution 2: Check metric type** ```typescript // Wrong - using Counter for value that can decrease const queueSize = new Counter({ name: 'cm_queue_size' }); queueSize.inc(50); // Add 50 queueSize.inc(-20); // Try to subtract 20 - ERROR! // Right - use Gauge const queueSize = new Gauge({ name: 'cm_queue_size' }); queueSize.set(50); // Set to 50 queueSize.set(30); // Set to 30 (can decrease) ``` **Solution 3: Check label values** ```typescript // Labels must match exactly const counter = new Counter({ name: 'requests_total', labelNames: ['method', 'status'] }); counter.inc({ method: 'GET', status: '200' }); // Creates: requests_total{method="GET",status="200"} 1 counter.inc({ method: 'GET', status: 200 }); // Wrong - number not string // Creates separate metric: requests_total{method="GET",status=200} 1 ``` **Solution 4: Check query aggregation** ```promql # Wrong - sums across all labels sum(cm_requests_total) # Right - sum by specific label sum by (status) (cm_requests_total) ``` #### Prevention - **Correct metric type** - Counter vs Gauge vs Histogram - **Type consistency** - Label values always same type - **Testing** - Test metric values with sample data - **Validation** - Validate metric values are reasonable --- ### Stale Metrics **Severity:** 🟢 Low #### Symptoms Metric values not updating, showing old data. #### Solutions **Solution 1: Check collection frequency** ```typescript // Metrics only updated when scraped const gauge = new Gauge({ name: 'cm_queue_size', async collect() { // This runs on every Prometheus scrape (every 15s) const size = await getQueueSize(); this.set(size); } }); ``` **Solution 2: Force metric update** ```typescript // Update metric on event, not just scrape eventEmitter.on('queueSizeChanged', (size) => { queueSizeGauge.set(size); }); ``` **Solution 3: Check scrape interval** In `configs/prometheus/prometheus.yml`: ```yaml global: scrape_interval: 15s # Scrape every 15 seconds # Increase for more frequent updates global: scrape_interval: 5s # Scrape every 5 seconds ``` #### Prevention - **Appropriate intervals** - Balance freshness vs overhead - **Event-driven updates** - Update on change, not just scrape - **Cache expensive metrics** - Don't query DB every scrape - **Staleness markers** - Set metrics to NaN when stale --- ## Performance Issues ### High Memory Usage **Severity:** 🟠 High #### Symptoms Prometheus container using excessive memory (multiple GB). #### Solutions **Solution 1: Reduce retention period** In `docker-compose.yml`: ```yaml prometheus: command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.retention.time=7d' # Reduce from 15d to 7d - '--storage.tsdb.retention.size=10GB' # Add size limit ``` Restart: ```bash docker compose --profile monitoring restart prometheus ``` **Solution 2: Reduce metric cardinality** ```typescript // Bad - creates metric per user (thousands) new Counter({ name: 'requests_by_user', labelNames: ['userId'] }); // Good - creates metric per role (5) new Counter({ name: 'requests_by_role', labelNames: ['role'] }); ``` **Solution 3: Drop unnecessary metrics** In `configs/prometheus/prometheus.yml`: ```yaml scrape_configs: - job_name: 'api' static_configs: - targets: ['api:4000'] metric_relabel_configs: # Drop metrics we don't use - source_labels: [__name__] regex: 'go_.*|process_.*' # Drop Go/process metrics action: drop ``` **Solution 4: Increase memory limit** ```yaml prometheus: deploy: resources: limits: memory: 4G # Increase from 2G ``` #### Prevention - **Low cardinality** - Avoid high-cardinality labels - **Appropriate retention** - 7-30 days is usually enough - **Regular cleanup** - Drop unused metrics - **Monitor memory** - Alert on high usage --- ### Slow Queries **Severity:** 🟡 Medium #### Symptoms Grafana dashboards slow to load. Queries taking 10+ seconds. #### Solutions **Solution 1: Optimize query** ```promql # Slow - calculates rate for all time rate(cm_requests_total[1y]) # Fast - only last 5 minutes rate(cm_requests_total[5m]) # Slow - many time series sum(rate(cm_requests_total[5m])) # Faster - aggregate before rate sum(increase(cm_requests_total[5m])) / 300 ``` **Solution 2: Use recording rules** In `configs/prometheus/alerts.yml`: ```yaml groups: - name: recording_rules interval: 30s rules: # Pre-calculate expensive query every 30s - record: job:cm_request_rate:sum expr: sum(rate(cm_requests_total[5m])) by (job) # Then use in dashboard: # job:cm_request_rate:sum # Fast! ``` **Solution 3: Reduce time range** In Grafana: - Change dashboard time range from "Last 30 days" to "Last 24 hours" - Queries are faster with less data **Solution 4: Increase Prometheus resources** ```yaml prometheus: deploy: resources: limits: cpus: '2.0' # More CPU for queries memory: 4G ``` #### Prevention - **Efficient queries** - Keep queries simple - **Recording rules** - Pre-calculate expensive queries - **Appropriate time ranges** - Don't query months of data - **Indexing** - Prometheus auto-indexes, but cardinality affects performance --- ## Useful Commands ### Prometheus Operations ```bash # Check targets curl http://localhost:9090/api/v1/targets # Query metric curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds' # Query range curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s' # Reload config docker compose exec prometheus kill -HUP 1 # Check config docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml # Check rules docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml ``` ### Grafana Operations ```bash # Test datasource curl http://admin:admin@localhost:3001/api/datasources/1/health # List dashboards curl http://admin:admin@localhost:3001/api/search?type=dash-db # Export dashboard curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json # Import dashboard curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \ -H "Content-Type: application/json" \ -d @dashboard.json ``` ### Alertmanager Operations ```bash # Check alerts curl http://localhost:9093/api/v1/alerts # Send test alert curl -X POST http://localhost:9093/api/v1/alerts \ -H 'Content-Type: application/json' \ -d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]' # List silences curl http://localhost:9093/api/v1/silences # Create silence curl -X POST http://localhost:9093/api/v1/silences \ -H 'Content-Type: application/json' \ -d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}' ``` --- ## Related Documentation ### Monitoring Documentation - [Monitoring Issues](monitoring-issues.md) - This guide - [Observability Dashboard](../user-guides/observability-dashboard.md) - Using dashboard - [Monitoring Guide](../deployment/monitoring.md) - Setup and configuration ### Other Troubleshooting - [Common Errors](common-errors.md) - General errors - [Performance Optimization](performance-optimization.md) - Performance tuning ### External Resources - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Documentation](https://grafana.com/docs/) - [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/) - [PromQL Tutorial](https://prometheus.io/docs/prometheus/latest/querying/basics/) --- **Last Updated:** February 2026 **Version:** V2.0 **Status:** Complete