1153 lines
24 KiB
Markdown
1153 lines
24 KiB
Markdown
# Monitoring and Observability Issues
|
|
|
|
This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.
|
|
|
|
## Overview
|
|
|
|
### Monitoring Stack
|
|
|
|
Changemaker Lite V2 uses **profile-based monitoring** (optional):
|
|
|
|
```bash
|
|
# Start with monitoring
|
|
docker compose --profile monitoring up -d
|
|
```
|
|
|
|
**Components:**
|
|
|
|
- **Prometheus** - Metrics collection and storage (port 9090)
|
|
- **Grafana** - Metrics visualization (port 3001)
|
|
- **Alertmanager** - Alert routing and notification (port 9093)
|
|
- **cAdvisor** - Container metrics (port 8080)
|
|
- **Node Exporter** - Host metrics (port 9100)
|
|
- **Redis Exporter** - Redis metrics (port 9121)
|
|
|
|
### Custom Metrics
|
|
|
|
12 custom `cm_*` Prometheus metrics:
|
|
|
|
1. `cm_api_uptime_seconds` - API uptime
|
|
2. `cm_database_uptime_seconds` - Database uptime
|
|
3. `cm_email_queue_size` - Email queue depth
|
|
4. `cm_geocoding_queue_size` - Geocoding queue depth
|
|
5. `cm_users_total` - Total users
|
|
6. `cm_campaigns_total` - Total campaigns
|
|
7. `cm_locations_total` - Total locations
|
|
8. `cm_geocoded_locations_total` - Geocoded locations
|
|
9. `cm_active_canvass_sessions` - Active sessions
|
|
10. `cm_external_service_up` - Service health (0/1)
|
|
11. `cm_listmonk_subscribers_total` - Listmonk subscribers
|
|
12. `cm_media_videos_total` - Total videos
|
|
|
|
Plus standard HTTP metrics:
|
|
- `http_request_duration_seconds`
|
|
- `http_requests_total`
|
|
|
|
---
|
|
|
|
## Prometheus Not Scraping
|
|
|
|
### Target Down
|
|
|
|
**Severity:** 🔴 Critical
|
|
|
|
#### Symptoms
|
|
|
|
Prometheus UI (localhost:9090) shows targets as "DOWN":
|
|
|
|
```
|
|
Target: api (localhost:4000/metrics)
|
|
State: DOWN
|
|
Error: Get "http://api:4000/metrics": connection refused
|
|
```
|
|
|
|
No data in Grafana dashboards.
|
|
|
|
#### Common Causes
|
|
|
|
1. **Service not running** - API container stopped
|
|
2. **Metrics endpoint missing** - /metrics endpoint not registered
|
|
3. **Network issue** - Prometheus can't reach service
|
|
4. **Authentication required** - Metrics endpoint requires auth
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check service is running**
|
|
|
|
```bash
|
|
# Is API running?
|
|
docker compose ps api
|
|
|
|
# Should show "Up"
|
|
# If not:
|
|
docker compose up -d api
|
|
```
|
|
|
|
**Solution 2: Test metrics endpoint**
|
|
|
|
```bash
|
|
# From host
|
|
curl http://localhost:4000/metrics
|
|
|
|
# Should return Prometheus metrics:
|
|
# # HELP cm_api_uptime_seconds API uptime in seconds
|
|
# # TYPE cm_api_uptime_seconds gauge
|
|
# cm_api_uptime_seconds 123.45
|
|
|
|
# From Prometheus container
|
|
docker compose exec prometheus wget -O- http://api:4000/metrics
|
|
```
|
|
|
|
**Solution 3: Check Prometheus config**
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'api'
|
|
static_configs:
|
|
- targets: ['api:4000'] # Use service name, not localhost
|
|
```
|
|
|
|
**Solution 4: Verify network**
|
|
|
|
```bash
|
|
# Both on same network?
|
|
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
|
|
docker inspect changemaker-lite-api-1 | grep NetworkMode
|
|
|
|
# Should both show "changemaker-lite"
|
|
```
|
|
|
|
**Solution 5: Check metrics are registered**
|
|
|
|
In API logs:
|
|
|
|
```bash
|
|
docker compose logs api | grep -i "metrics\|prometheus"
|
|
|
|
# Should show:
|
|
# Metrics endpoint registered at /metrics
|
|
# Prometheus metrics initialized
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Health checks** - Monitor Prometheus target health
|
|
- **Service dependencies** - Ensure services start in order
|
|
- **Network config** - Use Docker service names
|
|
- **Testing** - Test /metrics endpoint on deploy
|
|
|
|
---
|
|
|
|
### Scrape Timeout
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
```
|
|
Target: api
|
|
State: UP
|
|
Last Scrape: 5.2s (slow)
|
|
Last Error: context deadline exceeded
|
|
```
|
|
|
|
Scrapes taking too long or timing out.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Increase scrape timeout**
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
scrape_timeout: 10s # Increase from 10s to 30s
|
|
|
|
scrape_configs:
|
|
- job_name: 'api'
|
|
scrape_interval: 30s # Scrape less frequently
|
|
scrape_timeout: 20s
|
|
static_configs:
|
|
- targets: ['api:4000']
|
|
```
|
|
|
|
Reload config:
|
|
|
|
```bash
|
|
# Reload Prometheus config
|
|
docker compose exec prometheus kill -HUP 1
|
|
|
|
# Or restart
|
|
docker compose restart prometheus
|
|
```
|
|
|
|
**Solution 2: Optimize metrics generation**
|
|
|
|
```typescript
|
|
// In api/src/utils/metrics.ts
|
|
// Cache expensive metrics
|
|
let cachedUserCount = 0;
|
|
let lastUserCountUpdate = 0;
|
|
|
|
register.registerMetric(new Gauge({
|
|
name: 'cm_users_total',
|
|
help: 'Total number of users',
|
|
async collect() {
|
|
const now = Date.now();
|
|
// Only query database every 60 seconds
|
|
if (now - lastUserCountUpdate > 60000) {
|
|
cachedUserCount = await prisma.user.count();
|
|
lastUserCountUpdate = now;
|
|
}
|
|
this.set(cachedUserCount);
|
|
}
|
|
}));
|
|
```
|
|
|
|
**Solution 3: Reduce metric cardinality**
|
|
|
|
```typescript
|
|
// Bad - high cardinality (creates metric per user)
|
|
new Counter({
|
|
name: 'requests_by_user',
|
|
labelNames: ['userId'] // Don't do this!
|
|
});
|
|
|
|
// Good - low cardinality
|
|
new Counter({
|
|
name: 'requests_by_role',
|
|
labelNames: ['role'] // Only 5 roles
|
|
});
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Cache expensive metrics** - Don't query DB on every scrape
|
|
- **Reasonable timeouts** - 10-30s timeouts
|
|
- **Low cardinality** - Avoid high-cardinality labels
|
|
- **Optimize queries** - Fast metric queries
|
|
|
|
---
|
|
|
|
### Authentication Errors
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
```
|
|
Error: 401 Unauthorized when scraping /metrics
|
|
```
|
|
|
|
#### Solutions
|
|
|
|
Changemaker Lite V2 metrics endpoint is **public** (no auth required).
|
|
|
|
If you see auth errors:
|
|
|
|
**Solution 1: Remove auth middleware from /metrics**
|
|
|
|
In `api/src/server.ts`:
|
|
|
|
```typescript
|
|
// Metrics endpoint should be BEFORE authenticate middleware
|
|
app.get('/metrics', async (req, res) => {
|
|
res.set('Content-Type', register.contentType);
|
|
res.end(await register.metrics());
|
|
});
|
|
|
|
// Auth middleware comes after
|
|
app.use(authenticate);
|
|
```
|
|
|
|
**Solution 2: Configure basic auth in Prometheus**
|
|
|
|
If you DO want to protect /metrics:
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'api'
|
|
static_configs:
|
|
- targets: ['api:4000']
|
|
basic_auth:
|
|
username: 'prometheus'
|
|
password: 'your-password'
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Public metrics** - Keep /metrics public for simplicity
|
|
- **Network isolation** - Use Docker networks for security
|
|
- **IP whitelist** - Only allow Prometheus IP
|
|
|
|
---
|
|
|
|
## Grafana Issues
|
|
|
|
### Dashboards Not Loading
|
|
|
|
**Severity:** 🟠 High
|
|
|
|
#### Symptoms
|
|
|
|
Grafana shows blank dashboards or "No data" panels.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check Grafana is running**
|
|
|
|
```bash
|
|
docker compose --profile monitoring ps grafana
|
|
|
|
# Should show "Up"
|
|
# If not:
|
|
docker compose --profile monitoring up -d grafana
|
|
```
|
|
|
|
**Solution 2: Verify Prometheus datasource**
|
|
|
|
1. Open Grafana: http://localhost:3001
|
|
2. Login (admin/admin)
|
|
3. Settings → Data Sources
|
|
4. Click Prometheus
|
|
5. URL should be: `http://prometheus:9090`
|
|
6. Click "Save & Test"
|
|
7. Should show "Data source is working"
|
|
|
|
**Solution 3: Check dashboard provisioning**
|
|
|
|
```bash
|
|
# List provisioned dashboards
|
|
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/
|
|
|
|
# Should show:
|
|
# dashboard-provider.yml
|
|
# changemaker-api.json
|
|
# changemaker-queue.json
|
|
# changemaker-external-services.json
|
|
```
|
|
|
|
**Solution 4: Import dashboard manually**
|
|
|
|
If auto-provisioning fails:
|
|
|
|
1. Grafana → Dashboards → Import
|
|
2. Upload JSON from `configs/grafana/dashboards/`
|
|
3. Select Prometheus datasource
|
|
4. Click Import
|
|
|
|
**Solution 5: Check for data**
|
|
|
|
```bash
|
|
# Test query in Grafana Explore
|
|
# Query: cm_api_uptime_seconds
|
|
|
|
# Or test in Prometheus:
|
|
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Dashboard versioning** - Keep dashboards in git
|
|
- **Auto-provisioning** - Use provisioning instead of manual import
|
|
- **Testing** - Test dashboards after changes
|
|
- **Documentation** - Document dashboard variables
|
|
|
|
---
|
|
|
|
### Datasource Errors
|
|
|
|
**Severity:** 🟠 High
|
|
|
|
#### Symptoms
|
|
|
|
```
|
|
Error: Failed to query Prometheus
|
|
Error: connection refused
|
|
```
|
|
|
|
Red error bars on Grafana panels.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Test Prometheus connection**
|
|
|
|
```bash
|
|
# From Grafana container
|
|
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up
|
|
|
|
# Should return JSON:
|
|
# {"status":"success","data":{"resultType":"vector","result":[...]}}
|
|
```
|
|
|
|
**Solution 2: Check Prometheus is running**
|
|
|
|
```bash
|
|
docker compose --profile monitoring ps prometheus
|
|
|
|
# Should show "Up"
|
|
```
|
|
|
|
**Solution 3: Verify datasource URL**
|
|
|
|
In Grafana datasource settings:
|
|
- URL: `http://prometheus:9090` (NOT `http://localhost:9090`)
|
|
- Access: Server (NOT Browser)
|
|
|
|
**Solution 4: Check Docker network**
|
|
|
|
```bash
|
|
# Same network?
|
|
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
|
|
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Health checks** - Monitor datasource health
|
|
- **Service dependencies** - Start Prometheus before Grafana
|
|
- **Error handling** - Graceful error messages
|
|
|
|
---
|
|
|
|
### Query Errors
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
```
|
|
Error executing query: parse error at char X: unexpected identifier
|
|
```
|
|
|
|
Panel shows "Error loading data".
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Validate PromQL syntax**
|
|
|
|
Common errors:
|
|
|
|
```promql
|
|
# Bad - missing {}
|
|
cm_api_uptime_seconds{job=api}
|
|
|
|
# Good
|
|
cm_api_uptime_seconds{job="api"}
|
|
|
|
# Bad - wrong function
|
|
average(cm_api_uptime_seconds)
|
|
|
|
# Good
|
|
avg(cm_api_uptime_seconds)
|
|
```
|
|
|
|
**Solution 2: Test query in Explore**
|
|
|
|
1. Grafana → Explore
|
|
2. Enter query
|
|
3. Run
|
|
4. Fix errors before adding to dashboard
|
|
|
|
**Solution 3: Check metric exists**
|
|
|
|
```bash
|
|
# List all metrics
|
|
curl http://localhost:9090/api/v1/label/__name__/values | jq
|
|
|
|
# Search for metric
|
|
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_
|
|
```
|
|
|
|
**Solution 4: Use metric browser**
|
|
|
|
In Grafana query editor:
|
|
1. Click "Metrics" button
|
|
2. Browse available metrics
|
|
3. Select metric (auto-fills query)
|
|
|
|
#### Prevention
|
|
|
|
- **Query validation** - Validate before saving
|
|
- **Testing** - Test queries in Explore
|
|
- **Documentation** - Document available metrics
|
|
- **Examples** - Provide query examples
|
|
|
|
---
|
|
|
|
## Alertmanager Issues
|
|
|
|
### Alerts Not Firing
|
|
|
|
**Severity:** 🟠 High
|
|
|
|
#### Symptoms
|
|
|
|
Conditions met but alert not triggering.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check alert rules**
|
|
|
|
In Prometheus UI (localhost:9090):
|
|
|
|
1. Click "Alerts"
|
|
2. Find your alert
|
|
3. Check state:
|
|
- Inactive: Condition not met
|
|
- Pending: Met but waiting for `for:` duration
|
|
- Firing: Alert active
|
|
|
|
**Solution 2: Verify alert rule syntax**
|
|
|
|
In `configs/prometheus/alerts.yml`:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: changemaker_alerts
|
|
interval: 30s
|
|
rules:
|
|
- alert: APIDown
|
|
expr: up{job="api"} == 0
|
|
for: 1m # Must be down for 1 minute before firing
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "API is down"
|
|
description: "API has been down for 1 minute"
|
|
```
|
|
|
|
**Solution 3: Check Alertmanager config**
|
|
|
|
```bash
|
|
# Test Alertmanager
|
|
curl http://localhost:9093/api/v1/alerts
|
|
|
|
# Should return alert list
|
|
```
|
|
|
|
**Solution 4: View Prometheus logs**
|
|
|
|
```bash
|
|
docker compose logs prometheus | grep -i alert
|
|
|
|
# Shows:
|
|
# Loaded alert rules
|
|
# Alert X is firing
|
|
```
|
|
|
|
**Solution 5: Reload alert rules**
|
|
|
|
```bash
|
|
# Reload Prometheus config
|
|
docker compose exec prometheus kill -HUP 1
|
|
|
|
# Check rules loaded
|
|
curl http://localhost:9090/api/v1/rules
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Test alert conditions** - Trigger manually to test
|
|
- **Reasonable thresholds** - Not too sensitive or too lenient
|
|
- **Documentation** - Document alert thresholds
|
|
- **Regular review** - Review alert effectiveness
|
|
|
|
---
|
|
|
|
### Notifications Not Sent
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
Alert firing in Prometheus but no notification received.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check Alertmanager config**
|
|
|
|
In `configs/alertmanager/alertmanager.yml`:
|
|
|
|
```yaml
|
|
route:
|
|
receiver: 'email'
|
|
group_wait: 30s
|
|
group_interval: 5m
|
|
repeat_interval: 12h
|
|
|
|
receivers:
|
|
- name: 'email'
|
|
email_configs:
|
|
- to: 'alerts@example.com'
|
|
from: 'alertmanager@example.com'
|
|
smarthost: 'smtp.gmail.com:587'
|
|
auth_username: 'your-email@gmail.com'
|
|
auth_password: 'your-app-password'
|
|
```
|
|
|
|
**Solution 2: Test Alertmanager notification**
|
|
|
|
```bash
|
|
# Send test alert
|
|
curl -X POST http://localhost:9093/api/v1/alerts \
|
|
-H 'Content-Type: application/json' \
|
|
-d '[{
|
|
"labels": {
|
|
"alertname": "Test",
|
|
"severity": "critical"
|
|
},
|
|
"annotations": {
|
|
"summary": "Test alert"
|
|
}
|
|
}]'
|
|
|
|
# Check if notification sent
|
|
docker compose logs alertmanager | grep -i "notification\|email"
|
|
```
|
|
|
|
**Solution 3: Check SMTP config**
|
|
|
|
See [Email Issues](email-issues.md#smtp-configuration) for SMTP troubleshooting.
|
|
|
|
**Solution 4: Use alternative notification channels**
|
|
|
|
```yaml
|
|
receivers:
|
|
- name: 'slack'
|
|
slack_configs:
|
|
- api_url: 'https://hooks.slack.com/services/...'
|
|
channel: '#alerts'
|
|
|
|
- name: 'webhook'
|
|
webhook_configs:
|
|
- url: 'http://your-webhook-url.com/alerts'
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Test notifications** - Regular notification tests
|
|
- **Multiple channels** - Email + Slack + webhook
|
|
- **Fallback receivers** - Backup notification method
|
|
- **Documentation** - Document notification setup
|
|
|
|
---
|
|
|
|
### Routing Errors
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
Alerts going to wrong receiver or being silenced incorrectly.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check routing rules**
|
|
|
|
In `configs/alertmanager/alertmanager.yml`:
|
|
|
|
```yaml
|
|
route:
|
|
receiver: 'default'
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'pager'
|
|
- match:
|
|
severity: warning
|
|
receiver: 'email'
|
|
```
|
|
|
|
**Solution 2: Test routing**
|
|
|
|
```bash
|
|
# Use amtool to test routing
|
|
docker compose exec alertmanager amtool config routes test \
|
|
--config.file=/etc/alertmanager/alertmanager.yml \
|
|
alertname=TestAlert severity=critical
|
|
|
|
# Shows which receiver will be used
|
|
```
|
|
|
|
**Solution 3: View active silences**
|
|
|
|
In Alertmanager UI (localhost:9093):
|
|
|
|
1. Click "Silences"
|
|
2. Check if alert is silenced
|
|
3. Expire or delete silence if wrong
|
|
|
|
**Solution 4: Check inhibition rules**
|
|
|
|
```yaml
|
|
inhibit_rules:
|
|
- source_match:
|
|
severity: critical
|
|
target_match:
|
|
severity: warning
|
|
equal: ['alertname', 'instance']
|
|
# Critical alerts inhibit warnings for same instance
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Clear routing logic** - Simple, understandable rules
|
|
- **Test routing** - Test before deploying
|
|
- **Documentation** - Document routing rules
|
|
- **Regular review** - Review silences and inhibitions
|
|
|
|
---
|
|
|
|
## Metrics Issues
|
|
|
|
### Missing Metrics
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
Expected metric not appearing in Prometheus or Grafana.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check metric is registered**
|
|
|
|
In API code (`api/src/utils/metrics.ts`):
|
|
|
|
```typescript
|
|
import { Counter } from 'prom-client';
|
|
|
|
const requestCounter = new Counter({
|
|
name: 'cm_my_metric_total',
|
|
help: 'Description of metric'
|
|
});
|
|
|
|
register.registerMetric(requestCounter); // Must register!
|
|
```
|
|
|
|
**Solution 2: Check metric is collected**
|
|
|
|
```bash
|
|
# Test /metrics endpoint
|
|
curl http://localhost:4000/metrics | grep cm_my_metric
|
|
|
|
# Should show:
|
|
# # HELP cm_my_metric_total Description of metric
|
|
# # TYPE cm_my_metric_total counter
|
|
# cm_my_metric_total 42
|
|
```
|
|
|
|
**Solution 3: Check scrape config**
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'api'
|
|
static_configs:
|
|
- targets: ['api:4000']
|
|
metric_relabel_configs: # Don't accidentally drop metric
|
|
- source_labels: [__name__]
|
|
regex: 'cm_.*' # Keep cm_* metrics
|
|
action: keep
|
|
```
|
|
|
|
**Solution 4: Verify metric type**
|
|
|
|
```typescript
|
|
// Counter - only increases (counts)
|
|
const counter = new Counter({ name: 'cm_requests_total' });
|
|
counter.inc(); // Increment
|
|
|
|
// Gauge - can go up or down (current value)
|
|
const gauge = new Gauge({ name: 'cm_queue_size' });
|
|
gauge.set(42); // Set value
|
|
|
|
// Histogram - distribution of values
|
|
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
|
|
histogram.observe(0.5); // Record duration
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Register all metrics** - Don't forget register.registerMetric()
|
|
- **Test endpoint** - Check /metrics shows metric
|
|
- **Naming convention** - Use cm_* prefix for custom metrics
|
|
- **Documentation** - Document all custom metrics
|
|
|
|
---
|
|
|
|
### Incorrect Values
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
Metric showing wrong or unexpected values.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check metric logic**
|
|
|
|
```typescript
|
|
// Wrong - gauge not updated
|
|
const gauge = new Gauge({ name: 'cm_users_total' });
|
|
// Never set, always 0
|
|
|
|
// Right - gauge updated
|
|
const gauge = new Gauge({
|
|
name: 'cm_users_total',
|
|
async collect() {
|
|
const count = await prisma.user.count();
|
|
this.set(count);
|
|
}
|
|
});
|
|
```
|
|
|
|
**Solution 2: Check metric type**
|
|
|
|
```typescript
|
|
// Wrong - using Counter for value that can decrease
|
|
const queueSize = new Counter({ name: 'cm_queue_size' });
|
|
queueSize.inc(50); // Add 50
|
|
queueSize.inc(-20); // Try to subtract 20 - ERROR!
|
|
|
|
// Right - use Gauge
|
|
const queueSize = new Gauge({ name: 'cm_queue_size' });
|
|
queueSize.set(50); // Set to 50
|
|
queueSize.set(30); // Set to 30 (can decrease)
|
|
```
|
|
|
|
**Solution 3: Check label values**
|
|
|
|
```typescript
|
|
// Labels must match exactly
|
|
const counter = new Counter({
|
|
name: 'requests_total',
|
|
labelNames: ['method', 'status']
|
|
});
|
|
|
|
counter.inc({ method: 'GET', status: '200' });
|
|
// Creates: requests_total{method="GET",status="200"} 1
|
|
|
|
counter.inc({ method: 'GET', status: 200 }); // Wrong - number not string
|
|
// Creates separate metric: requests_total{method="GET",status=200} 1
|
|
```
|
|
|
|
**Solution 4: Check query aggregation**
|
|
|
|
```promql
|
|
# Wrong - sums across all labels
|
|
sum(cm_requests_total)
|
|
|
|
# Right - sum by specific label
|
|
sum by (status) (cm_requests_total)
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Correct metric type** - Counter vs Gauge vs Histogram
|
|
- **Type consistency** - Label values always same type
|
|
- **Testing** - Test metric values with sample data
|
|
- **Validation** - Validate metric values are reasonable
|
|
|
|
---
|
|
|
|
### Stale Metrics
|
|
|
|
**Severity:** 🟢 Low
|
|
|
|
#### Symptoms
|
|
|
|
Metric values not updating, showing old data.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Check collection frequency**
|
|
|
|
```typescript
|
|
// Metrics only updated when scraped
|
|
const gauge = new Gauge({
|
|
name: 'cm_queue_size',
|
|
async collect() {
|
|
// This runs on every Prometheus scrape (every 15s)
|
|
const size = await getQueueSize();
|
|
this.set(size);
|
|
}
|
|
});
|
|
```
|
|
|
|
**Solution 2: Force metric update**
|
|
|
|
```typescript
|
|
// Update metric on event, not just scrape
|
|
eventEmitter.on('queueSizeChanged', (size) => {
|
|
queueSizeGauge.set(size);
|
|
});
|
|
```
|
|
|
|
**Solution 3: Check scrape interval**
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s # Scrape every 15 seconds
|
|
|
|
# Increase for more frequent updates
|
|
global:
|
|
scrape_interval: 5s # Scrape every 5 seconds
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Appropriate intervals** - Balance freshness vs overhead
|
|
- **Event-driven updates** - Update on change, not just scrape
|
|
- **Cache expensive metrics** - Don't query DB every scrape
|
|
- **Staleness markers** - Set metrics to NaN when stale
|
|
|
|
---
|
|
|
|
## Performance Issues
|
|
|
|
### High Memory Usage
|
|
|
|
**Severity:** 🟠 High
|
|
|
|
#### Symptoms
|
|
|
|
Prometheus container using excessive memory (multiple GB).
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Reduce retention period**
|
|
|
|
In `docker-compose.yml`:
|
|
|
|
```yaml
|
|
prometheus:
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.retention.time=7d' # Reduce from 15d to 7d
|
|
- '--storage.tsdb.retention.size=10GB' # Add size limit
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
docker compose --profile monitoring restart prometheus
|
|
```
|
|
|
|
**Solution 2: Reduce metric cardinality**
|
|
|
|
```typescript
|
|
// Bad - creates metric per user (thousands)
|
|
new Counter({
|
|
name: 'requests_by_user',
|
|
labelNames: ['userId']
|
|
});
|
|
|
|
// Good - creates metric per role (5)
|
|
new Counter({
|
|
name: 'requests_by_role',
|
|
labelNames: ['role']
|
|
});
|
|
```
|
|
|
|
**Solution 3: Drop unnecessary metrics**
|
|
|
|
In `configs/prometheus/prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'api'
|
|
static_configs:
|
|
- targets: ['api:4000']
|
|
metric_relabel_configs:
|
|
# Drop metrics we don't use
|
|
- source_labels: [__name__]
|
|
regex: 'go_.*|process_.*' # Drop Go/process metrics
|
|
action: drop
|
|
```
|
|
|
|
**Solution 4: Increase memory limit**
|
|
|
|
```yaml
|
|
prometheus:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 4G # Increase from 2G
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Low cardinality** - Avoid high-cardinality labels
|
|
- **Appropriate retention** - 7-30 days is usually enough
|
|
- **Regular cleanup** - Drop unused metrics
|
|
- **Monitor memory** - Alert on high usage
|
|
|
|
---
|
|
|
|
### Slow Queries
|
|
|
|
**Severity:** 🟡 Medium
|
|
|
|
#### Symptoms
|
|
|
|
Grafana dashboards slow to load. Queries taking 10+ seconds.
|
|
|
|
#### Solutions
|
|
|
|
**Solution 1: Optimize query**
|
|
|
|
```promql
|
|
# Slow - calculates rate for all time
|
|
rate(cm_requests_total[1y])
|
|
|
|
# Fast - only last 5 minutes
|
|
rate(cm_requests_total[5m])
|
|
|
|
# Slow - many time series
|
|
sum(rate(cm_requests_total[5m]))
|
|
|
|
# Faster - aggregate before rate
|
|
sum(increase(cm_requests_total[5m])) / 300
|
|
```
|
|
|
|
**Solution 2: Use recording rules**
|
|
|
|
In `configs/prometheus/alerts.yml`:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: recording_rules
|
|
interval: 30s
|
|
rules:
|
|
# Pre-calculate expensive query every 30s
|
|
- record: job:cm_request_rate:sum
|
|
expr: sum(rate(cm_requests_total[5m])) by (job)
|
|
|
|
# Then use in dashboard:
|
|
# job:cm_request_rate:sum # Fast!
|
|
```
|
|
|
|
**Solution 3: Reduce time range**
|
|
|
|
In Grafana:
|
|
- Change dashboard time range from "Last 30 days" to "Last 24 hours"
|
|
- Queries are faster with less data
|
|
|
|
**Solution 4: Increase Prometheus resources**
|
|
|
|
```yaml
|
|
prometheus:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '2.0' # More CPU for queries
|
|
memory: 4G
|
|
```
|
|
|
|
#### Prevention
|
|
|
|
- **Efficient queries** - Keep queries simple
|
|
- **Recording rules** - Pre-calculate expensive queries
|
|
- **Appropriate time ranges** - Don't query months of data
|
|
- **Indexing** - Prometheus auto-indexes, but cardinality affects performance
|
|
|
|
---
|
|
|
|
## Useful Commands
|
|
|
|
### Prometheus Operations
|
|
|
|
```bash
|
|
# Check targets
|
|
curl http://localhost:9090/api/v1/targets
|
|
|
|
# Query metric
|
|
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
|
|
|
|
# Query range
|
|
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'
|
|
|
|
# Reload config
|
|
docker compose exec prometheus kill -HUP 1
|
|
|
|
# Check config
|
|
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
|
|
|
|
# Check rules
|
|
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
|
|
```
|
|
|
|
### Grafana Operations
|
|
|
|
```bash
|
|
# Test datasource
|
|
curl http://admin:admin@localhost:3001/api/datasources/1/health
|
|
|
|
# List dashboards
|
|
curl http://admin:admin@localhost:3001/api/search?type=dash-db
|
|
|
|
# Export dashboard
|
|
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json
|
|
|
|
# Import dashboard
|
|
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
|
|
-H "Content-Type: application/json" \
|
|
-d @dashboard.json
|
|
```
|
|
|
|
### Alertmanager Operations
|
|
|
|
```bash
|
|
# Check alerts
|
|
curl http://localhost:9093/api/v1/alerts
|
|
|
|
# Send test alert
|
|
curl -X POST http://localhost:9093/api/v1/alerts \
|
|
-H 'Content-Type: application/json' \
|
|
-d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'
|
|
|
|
# List silences
|
|
curl http://localhost:9093/api/v1/silences
|
|
|
|
# Create silence
|
|
curl -X POST http://localhost:9093/api/v1/silences \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
### Monitoring Documentation
|
|
- [Monitoring Issues](monitoring-issues.md) - This guide
|
|
- [Observability Dashboard](../user-guides/observability-dashboard.md) - Using dashboard
|
|
- [Monitoring Guide](../deployment/monitoring.md) - Setup and configuration
|
|
|
|
### Other Troubleshooting
|
|
- [Common Errors](common-errors.md) - General errors
|
|
- [Performance Optimization](performance-optimization.md) - Performance tuning
|
|
|
|
### External Resources
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Grafana Documentation](https://grafana.com/docs/)
|
|
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
|
- [PromQL Tutorial](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
|
|
|
---
|
|
|
|
**Last Updated:** February 2026
|
|
**Version:** V2.0
|
|
**Status:** Complete
|