1153 lines
24 KiB
Markdown

# Monitoring and Observability Issues
This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.
## Overview
### Monitoring Stack
Changemaker Lite V2 uses **profile-based monitoring** (optional):
```bash
# Start with monitoring
docker compose --profile monitoring up -d
```
**Components:**
- **Prometheus** - Metrics collection and storage (port 9090)
- **Grafana** - Metrics visualization (port 3001)
- **Alertmanager** - Alert routing and notification (port 9093)
- **cAdvisor** - Container metrics (port 8080)
- **Node Exporter** - Host metrics (port 9100)
- **Redis Exporter** - Redis metrics (port 9121)
### Custom Metrics
12 custom `cm_*` Prometheus metrics:
1. `cm_api_uptime_seconds` - API uptime
2. `cm_database_uptime_seconds` - Database uptime
3. `cm_email_queue_size` - Email queue depth
4. `cm_geocoding_queue_size` - Geocoding queue depth
5. `cm_users_total` - Total users
6. `cm_campaigns_total` - Total campaigns
7. `cm_locations_total` - Total locations
8. `cm_geocoded_locations_total` - Geocoded locations
9. `cm_active_canvass_sessions` - Active sessions
10. `cm_external_service_up` - Service health (0/1)
11. `cm_listmonk_subscribers_total` - Listmonk subscribers
12. `cm_media_videos_total` - Total videos
Plus standard HTTP metrics:
- `http_request_duration_seconds`
- `http_requests_total`
---
## Prometheus Not Scraping
### Target Down
**Severity:** 🔴 Critical
#### Symptoms
Prometheus UI (localhost:9090) shows targets as "DOWN":
```
Target: api (localhost:4000/metrics)
State: DOWN
Error: Get "http://api:4000/metrics": connection refused
```
No data in Grafana dashboards.
#### Common Causes
1. **Service not running** - API container stopped
2. **Metrics endpoint missing** - /metrics endpoint not registered
3. **Network issue** - Prometheus can't reach service
4. **Authentication required** - Metrics endpoint requires auth
#### Solutions
**Solution 1: Check service is running**
```bash
# Is API running?
docker compose ps api
# Should show "Up"
# If not:
docker compose up -d api
```
**Solution 2: Test metrics endpoint**
```bash
# From host
curl http://localhost:4000/metrics
# Should return Prometheus metrics:
# # HELP cm_api_uptime_seconds API uptime in seconds
# # TYPE cm_api_uptime_seconds gauge
# cm_api_uptime_seconds 123.45
# From Prometheus container
docker compose exec prometheus wget -O- http://api:4000/metrics
```
**Solution 3: Check Prometheus config**
In `configs/prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000'] # Use service name, not localhost
```
**Solution 4: Verify network**
```bash
# Both on same network?
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
docker inspect changemaker-lite-api-1 | grep NetworkMode
# Should both show "changemaker-lite"
```
**Solution 5: Check metrics are registered**
In API logs:
```bash
docker compose logs api | grep -i "metrics\|prometheus"
# Should show:
# Metrics endpoint registered at /metrics
# Prometheus metrics initialized
```
#### Prevention
- **Health checks** - Monitor Prometheus target health
- **Service dependencies** - Ensure services start in order
- **Network config** - Use Docker service names
- **Testing** - Test /metrics endpoint on deploy
---
### Scrape Timeout
**Severity:** 🟡 Medium
#### Symptoms
```
Target: api
State: UP
Last Scrape: 5.2s (slow)
Last Error: context deadline exceeded
```
Scrapes taking too long or timing out.
#### Solutions
**Solution 1: Increase scrape timeout**
In `configs/prometheus/prometheus.yml`:
```yaml
global:
scrape_interval: 15s
scrape_timeout: 10s # Increase from 10s to 30s
scrape_configs:
- job_name: 'api'
scrape_interval: 30s # Scrape less frequently
scrape_timeout: 20s
static_configs:
- targets: ['api:4000']
```
Reload config:
```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Or restart
docker compose restart prometheus
```
**Solution 2: Optimize metrics generation**
```typescript
// In api/src/utils/metrics.ts
// Cache expensive metrics
let cachedUserCount = 0;
let lastUserCountUpdate = 0;
register.registerMetric(new Gauge({
name: 'cm_users_total',
help: 'Total number of users',
async collect() {
const now = Date.now();
// Only query database every 60 seconds
if (now - lastUserCountUpdate > 60000) {
cachedUserCount = await prisma.user.count();
lastUserCountUpdate = now;
}
this.set(cachedUserCount);
}
}));
```
**Solution 3: Reduce metric cardinality**
```typescript
// Bad - high cardinality (creates metric per user)
new Counter({
name: 'requests_by_user',
labelNames: ['userId'] // Don't do this!
});
// Good - low cardinality
new Counter({
name: 'requests_by_role',
labelNames: ['role'] // Only 5 roles
});
```
#### Prevention
- **Cache expensive metrics** - Don't query DB on every scrape
- **Reasonable timeouts** - 10-30s timeouts
- **Low cardinality** - Avoid high-cardinality labels
- **Optimize queries** - Fast metric queries
---
### Authentication Errors
**Severity:** 🟡 Medium
#### Symptoms
```
Error: 401 Unauthorized when scraping /metrics
```
#### Solutions
Changemaker Lite V2 metrics endpoint is **public** (no auth required).
If you see auth errors:
**Solution 1: Remove auth middleware from /metrics**
In `api/src/server.ts`:
```typescript
// Metrics endpoint should be BEFORE authenticate middleware
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Auth middleware comes after
app.use(authenticate);
```
**Solution 2: Configure basic auth in Prometheus**
If you DO want to protect /metrics:
In `configs/prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
basic_auth:
username: 'prometheus'
password: 'your-password'
```
#### Prevention
- **Public metrics** - Keep /metrics public for simplicity
- **Network isolation** - Use Docker networks for security
- **IP whitelist** - Only allow Prometheus IP
---
## Grafana Issues
### Dashboards Not Loading
**Severity:** 🟠 High
#### Symptoms
Grafana shows blank dashboards or "No data" panels.
#### Solutions
**Solution 1: Check Grafana is running**
```bash
docker compose --profile monitoring ps grafana
# Should show "Up"
# If not:
docker compose --profile monitoring up -d grafana
```
**Solution 2: Verify Prometheus datasource**
1. Open Grafana: http://localhost:3001
2. Login (admin/admin)
3. Settings → Data Sources
4. Click Prometheus
5. URL should be: `http://prometheus:9090`
6. Click "Save & Test"
7. Should show "Data source is working"
**Solution 3: Check dashboard provisioning**
```bash
# List provisioned dashboards
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/
# Should show:
# dashboard-provider.yml
# changemaker-api.json
# changemaker-queue.json
# changemaker-external-services.json
```
**Solution 4: Import dashboard manually**
If auto-provisioning fails:
1. Grafana → Dashboards → Import
2. Upload JSON from `configs/grafana/dashboards/`
3. Select Prometheus datasource
4. Click Import
**Solution 5: Check for data**
```bash
# Test query in Grafana Explore
# Query: cm_api_uptime_seconds
# Or test in Prometheus:
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
```
#### Prevention
- **Dashboard versioning** - Keep dashboards in git
- **Auto-provisioning** - Use provisioning instead of manual import
- **Testing** - Test dashboards after changes
- **Documentation** - Document dashboard variables
---
### Datasource Errors
**Severity:** 🟠 High
#### Symptoms
```
Error: Failed to query Prometheus
Error: connection refused
```
Red error bars on Grafana panels.
#### Solutions
**Solution 1: Test Prometheus connection**
```bash
# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up
# Should return JSON:
# {"status":"success","data":{"resultType":"vector","result":[...]}}
```
**Solution 2: Check Prometheus is running**
```bash
docker compose --profile monitoring ps prometheus
# Should show "Up"
```
**Solution 3: Verify datasource URL**
In Grafana datasource settings:
- URL: `http://prometheus:9090` (NOT `http://localhost:9090`)
- Access: Server (NOT Browser)
**Solution 4: Check Docker network**
```bash
# Same network?
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
```
#### Prevention
- **Health checks** - Monitor datasource health
- **Service dependencies** - Start Prometheus before Grafana
- **Error handling** - Graceful error messages
---
### Query Errors
**Severity:** 🟡 Medium
#### Symptoms
```
Error executing query: parse error at char X: unexpected identifier
```
Panel shows "Error loading data".
#### Solutions
**Solution 1: Validate PromQL syntax**
Common errors:
```promql
# Bad - missing {}
cm_api_uptime_seconds{job=api}
# Good
cm_api_uptime_seconds{job="api"}
# Bad - wrong function
average(cm_api_uptime_seconds)
# Good
avg(cm_api_uptime_seconds)
```
**Solution 2: Test query in Explore**
1. Grafana → Explore
2. Enter query
3. Run
4. Fix errors before adding to dashboard
**Solution 3: Check metric exists**
```bash
# List all metrics
curl http://localhost:9090/api/v1/label/__name__/values | jq
# Search for metric
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_
```
**Solution 4: Use metric browser**
In Grafana query editor:
1. Click "Metrics" button
2. Browse available metrics
3. Select metric (auto-fills query)
#### Prevention
- **Query validation** - Validate before saving
- **Testing** - Test queries in Explore
- **Documentation** - Document available metrics
- **Examples** - Provide query examples
---
## Alertmanager Issues
### Alerts Not Firing
**Severity:** 🟠 High
#### Symptoms
Conditions met but alert not triggering.
#### Solutions
**Solution 1: Check alert rules**
In Prometheus UI (localhost:9090):
1. Click "Alerts"
2. Find your alert
3. Check state:
- Inactive: Condition not met
- Pending: Met but waiting for `for:` duration
- Firing: Alert active
**Solution 2: Verify alert rule syntax**
In `configs/prometheus/alerts.yml`:
```yaml
groups:
- name: changemaker_alerts
interval: 30s
rules:
- alert: APIDown
expr: up{job="api"} == 0
for: 1m # Must be down for 1 minute before firing
labels:
severity: critical
annotations:
summary: "API is down"
description: "API has been down for 1 minute"
```
**Solution 3: Check Alertmanager config**
```bash
# Test Alertmanager
curl http://localhost:9093/api/v1/alerts
# Should return alert list
```
**Solution 4: View Prometheus logs**
```bash
docker compose logs prometheus | grep -i alert
# Shows:
# Loaded alert rules
# Alert X is firing
```
**Solution 5: Reload alert rules**
```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1
# Check rules loaded
curl http://localhost:9090/api/v1/rules
```
#### Prevention
- **Test alert conditions** - Trigger manually to test
- **Reasonable thresholds** - Not too sensitive or too lenient
- **Documentation** - Document alert thresholds
- **Regular review** - Review alert effectiveness
---
### Notifications Not Sent
**Severity:** 🟡 Medium
#### Symptoms
Alert firing in Prometheus but no notification received.
#### Solutions
**Solution 1: Check Alertmanager config**
In `configs/alertmanager/alertmanager.yml`:
```yaml
route:
receiver: 'email'
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: 'email'
email_configs:
- to: 'alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'your-email@gmail.com'
auth_password: 'your-app-password'
```
**Solution 2: Test Alertmanager notification**
```bash
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{
"labels": {
"alertname": "Test",
"severity": "critical"
},
"annotations": {
"summary": "Test alert"
}
}]'
# Check if notification sent
docker compose logs alertmanager | grep -i "notification\|email"
```
**Solution 3: Check SMTP config**
See [Email Issues](email-issues.md#smtp-configuration) for SMTP troubleshooting.
**Solution 4: Use alternative notification channels**
```yaml
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
- name: 'webhook'
webhook_configs:
- url: 'http://your-webhook-url.com/alerts'
```
#### Prevention
- **Test notifications** - Regular notification tests
- **Multiple channels** - Email + Slack + webhook
- **Fallback receivers** - Backup notification method
- **Documentation** - Document notification setup
---
### Routing Errors
**Severity:** 🟡 Medium
#### Symptoms
Alerts going to wrong receiver or being silenced incorrectly.
#### Solutions
**Solution 1: Check routing rules**
In `configs/alertmanager/alertmanager.yml`:
```yaml
route:
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pager'
- match:
severity: warning
receiver: 'email'
```
**Solution 2: Test routing**
```bash
# Use amtool to test routing
docker compose exec alertmanager amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
alertname=TestAlert severity=critical
# Shows which receiver will be used
```
**Solution 3: View active silences**
In Alertmanager UI (localhost:9093):
1. Click "Silences"
2. Check if alert is silenced
3. Expire or delete silence if wrong
**Solution 4: Check inhibition rules**
```yaml
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'instance']
# Critical alerts inhibit warnings for same instance
```
#### Prevention
- **Clear routing logic** - Simple, understandable rules
- **Test routing** - Test before deploying
- **Documentation** - Document routing rules
- **Regular review** - Review silences and inhibitions
---
## Metrics Issues
### Missing Metrics
**Severity:** 🟡 Medium
#### Symptoms
Expected metric not appearing in Prometheus or Grafana.
#### Solutions
**Solution 1: Check metric is registered**
In API code (`api/src/utils/metrics.ts`):
```typescript
import { Counter } from 'prom-client';
const requestCounter = new Counter({
name: 'cm_my_metric_total',
help: 'Description of metric'
});
register.registerMetric(requestCounter); // Must register!
```
**Solution 2: Check metric is collected**
```bash
# Test /metrics endpoint
curl http://localhost:4000/metrics | grep cm_my_metric
# Should show:
# # HELP cm_my_metric_total Description of metric
# # TYPE cm_my_metric_total counter
# cm_my_metric_total 42
```
**Solution 3: Check scrape config**
In `configs/prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
metric_relabel_configs: # Don't accidentally drop metric
- source_labels: [__name__]
regex: 'cm_.*' # Keep cm_* metrics
action: keep
```
**Solution 4: Verify metric type**
```typescript
// Counter - only increases (counts)
const counter = new Counter({ name: 'cm_requests_total' });
counter.inc(); // Increment
// Gauge - can go up or down (current value)
const gauge = new Gauge({ name: 'cm_queue_size' });
gauge.set(42); // Set value
// Histogram - distribution of values
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
histogram.observe(0.5); // Record duration
```
#### Prevention
- **Register all metrics** - Don't forget register.registerMetric()
- **Test endpoint** - Check /metrics shows metric
- **Naming convention** - Use cm_* prefix for custom metrics
- **Documentation** - Document all custom metrics
---
### Incorrect Values
**Severity:** 🟡 Medium
#### Symptoms
Metric showing wrong or unexpected values.
#### Solutions
**Solution 1: Check metric logic**
```typescript
// Wrong - gauge not updated
const gauge = new Gauge({ name: 'cm_users_total' });
// Never set, always 0
// Right - gauge updated
const gauge = new Gauge({
name: 'cm_users_total',
async collect() {
const count = await prisma.user.count();
this.set(count);
}
});
```
**Solution 2: Check metric type**
```typescript
// Wrong - using Counter for value that can decrease
const queueSize = new Counter({ name: 'cm_queue_size' });
queueSize.inc(50); // Add 50
queueSize.inc(-20); // Try to subtract 20 - ERROR!
// Right - use Gauge
const queueSize = new Gauge({ name: 'cm_queue_size' });
queueSize.set(50); // Set to 50
queueSize.set(30); // Set to 30 (can decrease)
```
**Solution 3: Check label values**
```typescript
// Labels must match exactly
const counter = new Counter({
name: 'requests_total',
labelNames: ['method', 'status']
});
counter.inc({ method: 'GET', status: '200' });
// Creates: requests_total{method="GET",status="200"} 1
counter.inc({ method: 'GET', status: 200 }); // Wrong - number not string
// Creates separate metric: requests_total{method="GET",status=200} 1
```
**Solution 4: Check query aggregation**
```promql
# Wrong - sums across all labels
sum(cm_requests_total)
# Right - sum by specific label
sum by (status) (cm_requests_total)
```
#### Prevention
- **Correct metric type** - Counter vs Gauge vs Histogram
- **Type consistency** - Label values always same type
- **Testing** - Test metric values with sample data
- **Validation** - Validate metric values are reasonable
---
### Stale Metrics
**Severity:** 🟢 Low
#### Symptoms
Metric values not updating, showing old data.
#### Solutions
**Solution 1: Check collection frequency**
```typescript
// Metrics only updated when scraped
const gauge = new Gauge({
name: 'cm_queue_size',
async collect() {
// This runs on every Prometheus scrape (every 15s)
const size = await getQueueSize();
this.set(size);
}
});
```
**Solution 2: Force metric update**
```typescript
// Update metric on event, not just scrape
eventEmitter.on('queueSizeChanged', (size) => {
queueSizeGauge.set(size);
});
```
**Solution 3: Check scrape interval**
In `configs/prometheus/prometheus.yml`:
```yaml
global:
scrape_interval: 15s # Scrape every 15 seconds
# Increase for more frequent updates
global:
scrape_interval: 5s # Scrape every 5 seconds
```
#### Prevention
- **Appropriate intervals** - Balance freshness vs overhead
- **Event-driven updates** - Update on change, not just scrape
- **Cache expensive metrics** - Don't query DB every scrape
- **Staleness markers** - Set metrics to NaN when stale
---
## Performance Issues
### High Memory Usage
**Severity:** 🟠 High
#### Symptoms
Prometheus container using excessive memory (multiple GB).
#### Solutions
**Solution 1: Reduce retention period**
In `docker-compose.yml`:
```yaml
prometheus:
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d' # Reduce from 15d to 7d
- '--storage.tsdb.retention.size=10GB' # Add size limit
```
Restart:
```bash
docker compose --profile monitoring restart prometheus
```
**Solution 2: Reduce metric cardinality**
```typescript
// Bad - creates metric per user (thousands)
new Counter({
name: 'requests_by_user',
labelNames: ['userId']
});
// Good - creates metric per role (5)
new Counter({
name: 'requests_by_role',
labelNames: ['role']
});
```
**Solution 3: Drop unnecessary metrics**
In `configs/prometheus/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'api'
static_configs:
- targets: ['api:4000']
metric_relabel_configs:
# Drop metrics we don't use
- source_labels: [__name__]
regex: 'go_.*|process_.*' # Drop Go/process metrics
action: drop
```
**Solution 4: Increase memory limit**
```yaml
prometheus:
deploy:
resources:
limits:
memory: 4G # Increase from 2G
```
#### Prevention
- **Low cardinality** - Avoid high-cardinality labels
- **Appropriate retention** - 7-30 days is usually enough
- **Regular cleanup** - Drop unused metrics
- **Monitor memory** - Alert on high usage
---
### Slow Queries
**Severity:** 🟡 Medium
#### Symptoms
Grafana dashboards slow to load. Queries taking 10+ seconds.
#### Solutions
**Solution 1: Optimize query**
```promql
# Slow - calculates rate for all time
rate(cm_requests_total[1y])
# Fast - only last 5 minutes
rate(cm_requests_total[5m])
# Slow - many time series
sum(rate(cm_requests_total[5m]))
# Faster - aggregate before rate
sum(increase(cm_requests_total[5m])) / 300
```
**Solution 2: Use recording rules**
In `configs/prometheus/alerts.yml`:
```yaml
groups:
- name: recording_rules
interval: 30s
rules:
# Pre-calculate expensive query every 30s
- record: job:cm_request_rate:sum
expr: sum(rate(cm_requests_total[5m])) by (job)
# Then use in dashboard:
# job:cm_request_rate:sum # Fast!
```
**Solution 3: Reduce time range**
In Grafana:
- Change dashboard time range from "Last 30 days" to "Last 24 hours"
- Queries are faster with less data
**Solution 4: Increase Prometheus resources**
```yaml
prometheus:
deploy:
resources:
limits:
cpus: '2.0' # More CPU for queries
memory: 4G
```
#### Prevention
- **Efficient queries** - Keep queries simple
- **Recording rules** - Pre-calculate expensive queries
- **Appropriate time ranges** - Don't query months of data
- **Indexing** - Prometheus auto-indexes, but cardinality affects performance
---
## Useful Commands
### Prometheus Operations
```bash
# Check targets
curl http://localhost:9090/api/v1/targets
# Query metric
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
# Query range
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'
# Reload config
docker compose exec prometheus kill -HUP 1
# Check config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
# Check rules
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
```
### Grafana Operations
```bash
# Test datasource
curl http://admin:admin@localhost:3001/api/datasources/1/health
# List dashboards
curl http://admin:admin@localhost:3001/api/search?type=dash-db
# Export dashboard
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json
# Import dashboard
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
-H "Content-Type: application/json" \
-d @dashboard.json
```
### Alertmanager Operations
```bash
# Check alerts
curl http://localhost:9093/api/v1/alerts
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'
# List silences
curl http://localhost:9093/api/v1/silences
# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
-H 'Content-Type: application/json' \
-d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'
```
---
## Related Documentation
### Monitoring Documentation
- [Monitoring Issues](monitoring-issues.md) - This guide
- [Observability Dashboard](../user-guides/observability-dashboard.md) - Using dashboard
- [Monitoring Guide](../deployment/monitoring.md) - Setup and configuration
### Other Troubleshooting
- [Common Errors](common-errors.md) - General errors
- [Performance Optimization](performance-optimization.md) - Performance tuning
### External Resources
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
- [PromQL Tutorial](https://prometheus.io/docs/prometheus/latest/querying/basics/)
---
**Last Updated:** February 2026
**Version:** V2.0
**Status:** Complete