# Monitoring and Observability Issues

This guide covers Prometheus, Grafana, and observability stack problems in Changemaker Lite V2.

## Overview

### Monitoring Stack

Changemaker Lite V2 uses **profile-based monitoring** (optional):

```bash
# Start with monitoring
docker compose --profile monitoring up -d
```

**Components:**

- **Prometheus** - Metrics collection and storage (port 9090)
- **Grafana** - Metrics visualization (port 3001)
- **Alertmanager** - Alert routing and notification (port 9093)
- **cAdvisor** - Container metrics (port 8080)
- **Node Exporter** - Host metrics (port 9100)
- **Redis Exporter** - Redis metrics (port 9121)

### Custom Metrics

12 custom `cm_*` Prometheus metrics:

1. `cm_api_uptime_seconds` - API uptime
2. `cm_database_uptime_seconds` - Database uptime
3. `cm_email_queue_size` - Email queue depth
4. `cm_geocoding_queue_size` - Geocoding queue depth
5. `cm_users_total` - Total users
6. `cm_campaigns_total` - Total campaigns
7. `cm_locations_total` - Total locations
8. `cm_geocoded_locations_total` - Geocoded locations
9. `cm_active_canvass_sessions` - Active sessions
10. `cm_external_service_up` - Service health (0/1)
11. `cm_listmonk_subscribers_total` - Listmonk subscribers
12. `cm_media_videos_total` - Total videos

Plus standard HTTP metrics:
- `http_request_duration_seconds`
- `http_requests_total`

---

## Prometheus Not Scraping

### Target Down

**Severity:** 🔴 Critical

#### Symptoms

Prometheus UI (localhost:9090) shows targets as "DOWN":

```
Target: api (localhost:4000/metrics)
State: DOWN
Error: Get "http://api:4000/metrics": connection refused
```

No data in Grafana dashboards.

#### Common Causes

1. **Service not running** - API container stopped
2. **Metrics endpoint missing** - /metrics endpoint not registered
3. **Network issue** - Prometheus can't reach service
4. **Authentication required** - Metrics endpoint requires auth

#### Solutions

**Solution 1: Check service is running**

```bash
# Is API running?
docker compose ps api

# Should show "Up"
# If not:
docker compose up -d api
```

**Solution 2: Test metrics endpoint**

```bash
# From host
curl http://localhost:4000/metrics

# Should return Prometheus metrics:
# # HELP cm_api_uptime_seconds API uptime in seconds
# # TYPE cm_api_uptime_seconds gauge
# cm_api_uptime_seconds 123.45

# From Prometheus container
docker compose exec prometheus wget -O- http://api:4000/metrics
```

**Solution 3: Check Prometheus config**

In `configs/prometheus/prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']  # Use service name, not localhost
```

**Solution 4: Verify network**

```bash
# Both on same network?
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
docker inspect changemaker-lite-api-1 | grep NetworkMode

# Should both show "changemaker-lite"
```

**Solution 5: Check metrics are registered**

In API logs:

```bash
docker compose logs api | grep -i "metrics\|prometheus"

# Should show:
# Metrics endpoint registered at /metrics
# Prometheus metrics initialized
```

#### Prevention

- **Health checks** - Monitor Prometheus target health
- **Service dependencies** - Ensure services start in order
- **Network config** - Use Docker service names
- **Testing** - Test /metrics endpoint on deploy

---

### Scrape Timeout

**Severity:** 🟡 Medium

#### Symptoms

```
Target: api
State: UP
Last Scrape: 5.2s (slow)
Last Error: context deadline exceeded
```

Scrapes taking too long or timing out.

#### Solutions

**Solution 1: Increase scrape timeout**

In `configs/prometheus/prometheus.yml`:

```yaml
global:
  scrape_interval: 15s
  scrape_timeout: 10s  # Increase from 10s to 30s

scrape_configs:
  - job_name: 'api'
    scrape_interval: 30s  # Scrape less frequently
    scrape_timeout: 20s
    static_configs:
      - targets: ['api:4000']
```

Reload config:

```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Or restart
docker compose restart prometheus
```

**Solution 2: Optimize metrics generation**

```typescript
// In api/src/utils/metrics.ts
// Cache expensive metrics
let cachedUserCount = 0;
let lastUserCountUpdate = 0;

register.registerMetric(new Gauge({
  name: 'cm_users_total',
  help: 'Total number of users',
  async collect() {
    const now = Date.now();
    // Only query database every 60 seconds
    if (now - lastUserCountUpdate > 60000) {
      cachedUserCount = await prisma.user.count();
      lastUserCountUpdate = now;
    }
    this.set(cachedUserCount);
  }
}));
```

**Solution 3: Reduce metric cardinality**

```typescript
// Bad - high cardinality (creates metric per user)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']  // Don't do this!
});

// Good - low cardinality
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']  // Only 5 roles
});
```

#### Prevention

- **Cache expensive metrics** - Don't query DB on every scrape
- **Reasonable timeouts** - 10-30s timeouts
- **Low cardinality** - Avoid high-cardinality labels
- **Optimize queries** - Fast metric queries

---

### Authentication Errors

**Severity:** 🟡 Medium

#### Symptoms

```
Error: 401 Unauthorized when scraping /metrics
```

#### Solutions

Changemaker Lite V2 metrics endpoint is **public** (no auth required).

If you see auth errors:

**Solution 1: Remove auth middleware from /metrics**

In `api/src/server.ts`:

```typescript
// Metrics endpoint should be BEFORE authenticate middleware
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Auth middleware comes after
app.use(authenticate);
```

**Solution 2: Configure basic auth in Prometheus**

If you DO want to protect /metrics:

In `configs/prometheus/prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    basic_auth:
      username: 'prometheus'
      password: 'your-password'
```

#### Prevention

- **Public metrics** - Keep /metrics public for simplicity
- **Network isolation** - Use Docker networks for security
- **IP whitelist** - Only allow Prometheus IP

---

## Grafana Issues

### Dashboards Not Loading

**Severity:** 🟠 High

#### Symptoms

Grafana shows blank dashboards or "No data" panels.

#### Solutions

**Solution 1: Check Grafana is running**

```bash
docker compose --profile monitoring ps grafana

# Should show "Up"
# If not:
docker compose --profile monitoring up -d grafana
```

**Solution 2: Verify Prometheus datasource**

1. Open Grafana: http://localhost:3001
2. Login (admin/admin)
3. Settings → Data Sources
4. Click Prometheus
5. URL should be: `http://prometheus:9090`
6. Click "Save & Test"
7. Should show "Data source is working"

**Solution 3: Check dashboard provisioning**

```bash
# List provisioned dashboards
docker compose exec grafana ls -la /etc/grafana/provisioning/dashboards/

# Should show:
# dashboard-provider.yml
# changemaker-api.json
# changemaker-queue.json
# changemaker-external-services.json
```

**Solution 4: Import dashboard manually**

If auto-provisioning fails:

1. Grafana → Dashboards → Import
2. Upload JSON from `configs/grafana/dashboards/`
3. Select Prometheus datasource
4. Click Import

**Solution 5: Check for data**

```bash
# Test query in Grafana Explore
# Query: cm_api_uptime_seconds

# Or test in Prometheus:
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'
```

#### Prevention

- **Dashboard versioning** - Keep dashboards in git
- **Auto-provisioning** - Use provisioning instead of manual import
- **Testing** - Test dashboards after changes
- **Documentation** - Document dashboard variables

---

### Datasource Errors

**Severity:** 🟠 High

#### Symptoms

```
Error: Failed to query Prometheus
Error: connection refused
```

Red error bars on Grafana panels.

#### Solutions

**Solution 1: Test Prometheus connection**

```bash
# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up

# Should return JSON:
# {"status":"success","data":{"resultType":"vector","result":[...]}}
```

**Solution 2: Check Prometheus is running**

```bash
docker compose --profile monitoring ps prometheus

# Should show "Up"
```

**Solution 3: Verify datasource URL**

In Grafana datasource settings:
- URL: `http://prometheus:9090` (NOT `http://localhost:9090`)
- Access: Server (NOT Browser)

**Solution 4: Check Docker network**

```bash
# Same network?
docker inspect changemaker-lite-grafana-1 | grep NetworkMode
docker inspect changemaker-lite-prometheus-1 | grep NetworkMode
```

#### Prevention

- **Health checks** - Monitor datasource health
- **Service dependencies** - Start Prometheus before Grafana
- **Error handling** - Graceful error messages

---

### Query Errors

**Severity:** 🟡 Medium

#### Symptoms

```
Error executing query: parse error at char X: unexpected identifier
```

Panel shows "Error loading data".

#### Solutions

**Solution 1: Validate PromQL syntax**

Common errors:

```promql
# Bad - missing {}
cm_api_uptime_seconds{job=api}

# Good
cm_api_uptime_seconds{job="api"}

# Bad - wrong function
average(cm_api_uptime_seconds)

# Good
avg(cm_api_uptime_seconds)
```

**Solution 2: Test query in Explore**

1. Grafana → Explore
2. Enter query
3. Run
4. Fix errors before adding to dashboard

**Solution 3: Check metric exists**

```bash
# List all metrics
curl http://localhost:9090/api/v1/label/__name__/values | jq

# Search for metric
curl http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep cm_
```

**Solution 4: Use metric browser**

In Grafana query editor:
1. Click "Metrics" button
2. Browse available metrics
3. Select metric (auto-fills query)

#### Prevention

- **Query validation** - Validate before saving
- **Testing** - Test queries in Explore
- **Documentation** - Document available metrics
- **Examples** - Provide query examples

---

## Alertmanager Issues

### Alerts Not Firing

**Severity:** 🟠 High

#### Symptoms

Conditions met but alert not triggering.

#### Solutions

**Solution 1: Check alert rules**

In Prometheus UI (localhost:9090):

1. Click "Alerts"
2. Find your alert
3. Check state:
   - Inactive: Condition not met
   - Pending: Met but waiting for `for:` duration
   - Firing: Alert active

**Solution 2: Verify alert rule syntax**

In `configs/prometheus/alerts.yml`:

```yaml
groups:
  - name: changemaker_alerts
    interval: 30s
    rules:
      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m  # Must be down for 1 minute before firing
        labels:
          severity: critical
        annotations:
          summary: "API is down"
          description: "API has been down for 1 minute"
```

**Solution 3: Check Alertmanager config**

```bash
# Test Alertmanager
curl http://localhost:9093/api/v1/alerts

# Should return alert list
```

**Solution 4: View Prometheus logs**

```bash
docker compose logs prometheus | grep -i alert

# Shows:
# Loaded alert rules
# Alert X is firing
```

**Solution 5: Reload alert rules**

```bash
# Reload Prometheus config
docker compose exec prometheus kill -HUP 1

# Check rules loaded
curl http://localhost:9090/api/v1/rules
```

#### Prevention

- **Test alert conditions** - Trigger manually to test
- **Reasonable thresholds** - Not too sensitive or too lenient
- **Documentation** - Document alert thresholds
- **Regular review** - Review alert effectiveness

---

### Notifications Not Sent

**Severity:** 🟡 Medium

#### Symptoms

Alert firing in Prometheus but no notification received.

#### Solutions

**Solution 1: Check Alertmanager config**

In `configs/alertmanager/alertmanager.yml`:

```yaml
route:
  receiver: 'email'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h

receivers:
  - name: 'email'
    email_configs:
      - to: 'alerts@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'your-email@gmail.com'
        auth_password: 'your-app-password'
```

**Solution 2: Test Alertmanager notification**

```bash
# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{
    "labels": {
      "alertname": "Test",
      "severity": "critical"
    },
    "annotations": {
      "summary": "Test alert"
    }
  }]'

# Check if notification sent
docker compose logs alertmanager | grep -i "notification\|email"
```

**Solution 3: Check SMTP config**

See [Email Issues](email-issues.md#smtp-configuration) for SMTP troubleshooting.

**Solution 4: Use alternative notification channels**

```yaml
receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

  - name: 'webhook'
    webhook_configs:
      - url: 'http://your-webhook-url.com/alerts'
```

#### Prevention

- **Test notifications** - Regular notification tests
- **Multiple channels** - Email + Slack + webhook
- **Fallback receivers** - Backup notification method
- **Documentation** - Document notification setup

---

### Routing Errors

**Severity:** 🟡 Medium

#### Symptoms

Alerts going to wrong receiver or being silenced incorrectly.

#### Solutions

**Solution 1: Check routing rules**

In `configs/alertmanager/alertmanager.yml`:

```yaml
route:
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'
    - match:
        severity: warning
      receiver: 'email'
```

**Solution 2: Test routing**

```bash
# Use amtool to test routing
docker compose exec alertmanager amtool config routes test \
  --config.file=/etc/alertmanager/alertmanager.yml \
  alertname=TestAlert severity=critical

# Shows which receiver will be used
```

**Solution 3: View active silences**

In Alertmanager UI (localhost:9093):

1. Click "Silences"
2. Check if alert is silenced
3. Expire or delete silence if wrong

**Solution 4: Check inhibition rules**

```yaml
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']
# Critical alerts inhibit warnings for same instance
```

#### Prevention

- **Clear routing logic** - Simple, understandable rules
- **Test routing** - Test before deploying
- **Documentation** - Document routing rules
- **Regular review** - Review silences and inhibitions

---

## Metrics Issues

### Missing Metrics

**Severity:** 🟡 Medium

#### Symptoms

Expected metric not appearing in Prometheus or Grafana.

#### Solutions

**Solution 1: Check metric is registered**

In API code (`api/src/utils/metrics.ts`):

```typescript
import { Counter } from 'prom-client';

const requestCounter = new Counter({
  name: 'cm_my_metric_total',
  help: 'Description of metric'
});

register.registerMetric(requestCounter);  // Must register!
```

**Solution 2: Check metric is collected**

```bash
# Test /metrics endpoint
curl http://localhost:4000/metrics | grep cm_my_metric

# Should show:
# # HELP cm_my_metric_total Description of metric
# # TYPE cm_my_metric_total counter
# cm_my_metric_total 42
```

**Solution 3: Check scrape config**

In `configs/prometheus/prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:  # Don't accidentally drop metric
      - source_labels: [__name__]
        regex: 'cm_.*'  # Keep cm_* metrics
        action: keep
```

**Solution 4: Verify metric type**

```typescript
// Counter - only increases (counts)
const counter = new Counter({ name: 'cm_requests_total' });
counter.inc();  // Increment

// Gauge - can go up or down (current value)
const gauge = new Gauge({ name: 'cm_queue_size' });
gauge.set(42);  // Set value

// Histogram - distribution of values
const histogram = new Histogram({ name: 'cm_request_duration_seconds' });
histogram.observe(0.5);  // Record duration
```

#### Prevention

- **Register all metrics** - Don't forget register.registerMetric()
- **Test endpoint** - Check /metrics shows metric
- **Naming convention** - Use cm_* prefix for custom metrics
- **Documentation** - Document all custom metrics

---

### Incorrect Values

**Severity:** 🟡 Medium

#### Symptoms

Metric showing wrong or unexpected values.

#### Solutions

**Solution 1: Check metric logic**

```typescript
// Wrong - gauge not updated
const gauge = new Gauge({ name: 'cm_users_total' });
// Never set, always 0

// Right - gauge updated
const gauge = new Gauge({
  name: 'cm_users_total',
  async collect() {
    const count = await prisma.user.count();
    this.set(count);
  }
});
```

**Solution 2: Check metric type**

```typescript
// Wrong - using Counter for value that can decrease
const queueSize = new Counter({ name: 'cm_queue_size' });
queueSize.inc(50);  // Add 50
queueSize.inc(-20);  // Try to subtract 20 - ERROR!

// Right - use Gauge
const queueSize = new Gauge({ name: 'cm_queue_size' });
queueSize.set(50);  // Set to 50
queueSize.set(30);  // Set to 30 (can decrease)
```

**Solution 3: Check label values**

```typescript
// Labels must match exactly
const counter = new Counter({
  name: 'requests_total',
  labelNames: ['method', 'status']
});

counter.inc({ method: 'GET', status: '200' });
// Creates: requests_total{method="GET",status="200"} 1

counter.inc({ method: 'GET', status: 200 });  // Wrong - number not string
// Creates separate metric: requests_total{method="GET",status=200} 1
```

**Solution 4: Check query aggregation**

```promql
# Wrong - sums across all labels
sum(cm_requests_total)

# Right - sum by specific label
sum by (status) (cm_requests_total)
```

#### Prevention

- **Correct metric type** - Counter vs Gauge vs Histogram
- **Type consistency** - Label values always same type
- **Testing** - Test metric values with sample data
- **Validation** - Validate metric values are reasonable

---

### Stale Metrics

**Severity:** 🟢 Low

#### Symptoms

Metric values not updating, showing old data.

#### Solutions

**Solution 1: Check collection frequency**

```typescript
// Metrics only updated when scraped
const gauge = new Gauge({
  name: 'cm_queue_size',
  async collect() {
    // This runs on every Prometheus scrape (every 15s)
    const size = await getQueueSize();
    this.set(size);
  }
});
```

**Solution 2: Force metric update**

```typescript
// Update metric on event, not just scrape
eventEmitter.on('queueSizeChanged', (size) => {
  queueSizeGauge.set(size);
});
```

**Solution 3: Check scrape interval**

In `configs/prometheus/prometheus.yml`:

```yaml
global:
  scrape_interval: 15s  # Scrape every 15 seconds

# Increase for more frequent updates
global:
  scrape_interval: 5s  # Scrape every 5 seconds
```

#### Prevention

- **Appropriate intervals** - Balance freshness vs overhead
- **Event-driven updates** - Update on change, not just scrape
- **Cache expensive metrics** - Don't query DB every scrape
- **Staleness markers** - Set metrics to NaN when stale

---

## Performance Issues

### High Memory Usage

**Severity:** 🟠 High

#### Symptoms

Prometheus container using excessive memory (multiple GB).

#### Solutions

**Solution 1: Reduce retention period**

In `docker-compose.yml`:

```yaml
prometheus:
  command:
    - '--config.file=/etc/prometheus/prometheus.yml'
    - '--storage.tsdb.retention.time=7d'  # Reduce from 15d to 7d
    - '--storage.tsdb.retention.size=10GB'  # Add size limit
```

Restart:

```bash
docker compose --profile monitoring restart prometheus
```

**Solution 2: Reduce metric cardinality**

```typescript
// Bad - creates metric per user (thousands)
new Counter({
  name: 'requests_by_user',
  labelNames: ['userId']
});

// Good - creates metric per role (5)
new Counter({
  name: 'requests_by_role',
  labelNames: ['role']
});
```

**Solution 3: Drop unnecessary metrics**

In `configs/prometheus/prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:4000']
    metric_relabel_configs:
      # Drop metrics we don't use
      - source_labels: [__name__]
        regex: 'go_.*|process_.*'  # Drop Go/process metrics
        action: drop
```

**Solution 4: Increase memory limit**

```yaml
prometheus:
  deploy:
    resources:
      limits:
        memory: 4G  # Increase from 2G
```

#### Prevention

- **Low cardinality** - Avoid high-cardinality labels
- **Appropriate retention** - 7-30 days is usually enough
- **Regular cleanup** - Drop unused metrics
- **Monitor memory** - Alert on high usage

---

### Slow Queries

**Severity:** 🟡 Medium

#### Symptoms

Grafana dashboards slow to load. Queries taking 10+ seconds.

#### Solutions

**Solution 1: Optimize query**

```promql
# Slow - calculates rate for all time
rate(cm_requests_total[1y])

# Fast - only last 5 minutes
rate(cm_requests_total[5m])

# Slow - many time series
sum(rate(cm_requests_total[5m]))

# Faster - aggregate before rate
sum(increase(cm_requests_total[5m])) / 300
```

**Solution 2: Use recording rules**

In `configs/prometheus/alerts.yml`:

```yaml
groups:
  - name: recording_rules
    interval: 30s
    rules:
      # Pre-calculate expensive query every 30s
      - record: job:cm_request_rate:sum
        expr: sum(rate(cm_requests_total[5m])) by (job)

# Then use in dashboard:
# job:cm_request_rate:sum  # Fast!
```

**Solution 3: Reduce time range**

In Grafana:
- Change dashboard time range from "Last 30 days" to "Last 24 hours"
- Queries are faster with less data

**Solution 4: Increase Prometheus resources**

```yaml
prometheus:
  deploy:
    resources:
      limits:
        cpus: '2.0'  # More CPU for queries
        memory: 4G
```

#### Prevention

- **Efficient queries** - Keep queries simple
- **Recording rules** - Pre-calculate expensive queries
- **Appropriate time ranges** - Don't query months of data
- **Indexing** - Prometheus auto-indexes, but cardinality affects performance

---

## Useful Commands

### Prometheus Operations

```bash
# Check targets
curl http://localhost:9090/api/v1/targets

# Query metric
curl 'http://localhost:9090/api/v1/query?query=cm_api_uptime_seconds'

# Query range
curl 'http://localhost:9090/api/v1/query_range?query=cm_api_uptime_seconds&start=2026-02-13T00:00:00Z&end=2026-02-13T23:59:59Z&step=15s'

# Reload config
docker compose exec prometheus kill -HUP 1

# Check config
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

# Check rules
docker compose exec prometheus promtool check rules /etc/prometheus/alerts.yml
```

### Grafana Operations

```bash
# Test datasource
curl http://admin:admin@localhost:3001/api/datasources/1/health

# List dashboards
curl http://admin:admin@localhost:3001/api/search?type=dash-db

# Export dashboard
curl http://admin:admin@localhost:3001/api/dashboards/uid/YOUR_UID | jq .dashboard > dashboard.json

# Import dashboard
curl -X POST http://admin:admin@localhost:3001/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @dashboard.json
```

### Alertmanager Operations

```bash
# Check alerts
curl http://localhost:9093/api/v1/alerts

# Send test alert
curl -X POST http://localhost:9093/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"Test","severity":"critical"},"annotations":{"summary":"Test"}}]'

# List silences
curl http://localhost:9093/api/v1/silences

# Create silence
curl -X POST http://localhost:9093/api/v1/silences \
  -H 'Content-Type: application/json' \
  -d '{"matchers":[{"name":"alertname","value":"Test"}],"startsAt":"2026-02-13T00:00:00Z","endsAt":"2026-02-14T00:00:00Z","createdBy":"admin","comment":"Test silence"}'
```

---

## Related Documentation

### Monitoring Documentation
- [Monitoring Issues](monitoring-issues.md) - This guide
- [Observability Dashboard](../user-guides/observability-dashboard.md) - Using dashboard
- [Monitoring Guide](../deployment/monitoring.md) - Setup and configuration

### Other Troubleshooting
- [Common Errors](common-errors.md) - General errors
- [Performance Optimization](performance-optimization.md) - Performance tuning

### External Resources
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
- [PromQL Tutorial](https://prometheus.io/docs/prometheus/latest/querying/basics/)

---

**Last Updated:** February 2026
**Version:** V2.0
**Status:** Complete