changemaker.lite/mkdocs/docs/v2/deployment/healthchecks.md

# Docker Health Check Configuration

## Overview

Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services.

**Benefits:**
- Automatic restart of unhealthy containers
- Dependency management (`depends_on` with `service_healthy`)
- Monitoring integration (Prometheus can scrape health status)

---

## Services with Health Checks

| Service | Healthcheck Command | Interval | Timeout | Retries | Start Period |
|---------|---------------------|----------|---------|---------|--------------|
| **api** | `wget http://localhost:4000/api/health` | 15s | 5s | 3 | 30s |
| **media-api** | `wget http://127.0.0.1:4100/health` | 15s | 5s | 3 | 30s |
| **admin** | `wget http://127.0.0.1:3000/` | 30s | 5s | 3 | 20s |
| **v2-postgres** | `pg_isready -U changemaker` | 10s | 5s | 5 | - |
| **redis** | `redis-cli -a $REDIS_PASSWORD ping` | 10s | 5s | 5 | - |
| **gitea-app** | `curl http://localhost:3000/` | 30s | 5s | 3 | 30s |
| **n8n** | `wget http://localhost:5678/healthz` | 30s | 5s | 3 | 30s |

---

## Health Check Configuration

### API (Express)

**docker-compose.yml**:
```yaml
api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s
```

**Explanation**:
- **test**: Runs `wget` (Alpine image standard) to check `/api/health` endpoint
- **interval**: Check every 15 seconds
- **timeout**: Fail if no response in 5 seconds
- **retries**: Mark unhealthy after 3 consecutive failures
- **start_period**: 30s grace period on startup (allows migrations to run)

**Health endpoint** (api/src/server.ts):
```typescript
app.get('/api/health', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});
```

**Health states**:
- **starting**: Within start_period (30s)
- **healthy**: Check passed
- **unhealthy**: 3 consecutive failures

---

### Media API (Fastify)

**docker-compose.yml**:
```yaml
media-api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s
```

**Health endpoint** (api/src/media-server.ts):
```typescript
app.get('/health', async (req, reply) => {
  return { status: 'ok' };
});
```

**Note**: Uses `127.0.0.1` instead of `localhost` (Alpine's `wget` prefers IP).

---

### Admin (Vite Dev Server)

**docker-compose.yml**:
```yaml
admin:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 20s
```

**Explanation**:
- **30s interval**: Less critical than backend (frontend can tolerate brief downtime)
- **20s start period**: Vite dev server starts quickly
- **Root path**: Checks Vite is serving HTML (no dedicated /health endpoint)

---

### V2 PostgreSQL

**docker-compose.yml**:
```yaml
v2-postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U changemaker"]
    interval: 10s
    timeout: 5s
    retries: 5
```

**Explanation**:
- **pg_isready**: Built-in PostgreSQL health check utility
- **10s interval**: Fast detection of database issues
- **5 retries**: More tolerant (database startup can be slow)
- **No start_period**: PostgreSQL has its own startup delay

**pg_isready output**:
```bash
# Healthy
/var/run/postgresql:5432 - accepting connections

# Unhealthy
/var/run/postgresql:5432 - rejecting connections
```

---

### Redis

**docker-compose.yml**:
```yaml
redis:
  healthcheck:
    test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
    interval: 10s
    timeout: 5s
    retries: 5
```

**Explanation**:
- **redis-cli ping**: Returns `PONG` if healthy
- **-a ${REDIS_PASSWORD}**: Authenticates with password (required since Security Audit)
- **10s interval**: Fast detection for critical cache service

**PING output**:
```bash
# Healthy
PONG

# Unhealthy
(error) NOAUTH Authentication required
```

---

### Gitea

**docker-compose.yml**:
```yaml
gitea-app:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s
```

**Explanation**:
- **curl**: Debian-based image (no `wget`)
- **-f**: Fail on HTTP errors (non-200 response)
- **30s interval**: Supporting service (less critical)

**Important**: Gitea uses `curl` (not `wget`) because it's a Debian image, not Alpine.

---

### n8n

**docker-compose.yml**:
```yaml
n8n:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s
```

**Explanation**:
- **/healthz**: n8n's built-in health endpoint
- **30s interval**: Workflow automation (not user-facing)

---

## Dependency Chains

### API Depends on Database + Redis

**docker-compose.yml**:
```yaml
api:
  depends_on:
    v2-postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
```

**Effect**: API container waits for PostgreSQL + Redis to be healthy before starting.

**Startup sequence**:
1. PostgreSQL starts → health checks begin
2. After 5 successful checks → marked healthy
3. Redis starts → health checks begin
4. After 5 successful checks → marked healthy
5. API starts (both dependencies healthy)

---

### Media API Depends on Database

**docker-compose.yml**:
```yaml
media-api:
  depends_on:
    v2-postgres:
      condition: service_healthy
```

**Effect**: Media API waits for PostgreSQL to be healthy.

---

### NocoDB Depends on Database

**docker-compose.yml**:
```yaml
nocodb-v2:
  depends_on:
    v2-postgres:
      condition: service_healthy
```

**Effect**: NocoDB waits for its metadata database to be ready.

---

## Monitoring Healthcheck Status

### View Health Status

```bash
# All services (shows health in STATUS column)
docker compose ps

# Example output:
# NAME                    STATUS
# changemaker-v2-api      Up 2 hours (healthy)
# changemaker-v2-postgres Up 2 hours (healthy)
# redis-changemaker       Up 2 hours (healthy)
```

**Health states**:
- `(healthy)`: All checks passing
- `(unhealthy)`: Multiple checks failed
- `(health: starting)`: Within start_period

---

### Filter Unhealthy Services

```bash
# Show only unhealthy
docker compose ps | grep unhealthy

# Count unhealthy
docker compose ps -q --status unhealthy | wc -l
```

---

### Inspect Health Check Details

```bash
# Full health info for API
docker inspect changemaker-v2-api | jq '.[0].State.Health'

# Example output:
{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2026-02-13T14:30:00Z",
      "End": "2026-02-13T14:30:01Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}
```

**Key fields**:
- **Status**: `healthy`, `unhealthy`, or `starting`
- **FailingStreak**: Consecutive failed checks
- **Log**: Last 5 health check results

---

### Health Check Logs

```bash
# View health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]'

# Example (success):
{
  "Start": "2026-02-13T14:30:00Z",
  "End": "2026-02-13T14:30:01Z",
  "ExitCode": 0,
  "Output": ""
}

# Example (failure):
{
  "Start": "2026-02-13T14:35:00Z",
  "End": "2026-02-13T14:35:05Z",
  "ExitCode": 1,
  "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused"
}
```

---

## Custom Health Checks

### Advanced API Health Check

**Check database + Redis connectivity**:

**api/src/server.ts**:
```typescript
app.get('/api/health', async (req, res) => {
  const checks = {
    database: false,
    redis: false,
  };

  try {
    await prisma.$queryRaw`SELECT 1`;
    checks.database = true;
  } catch (err) {
    console.error('DB health check failed:', err);
  }

  try {
    await redis.ping();
    checks.redis = true;
  } catch (err) {
    console.error('Redis health check failed:', err);
  }

  const healthy = checks.database && checks.redis;
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});
```

**docker-compose.yml** (no change needed — still checks `/api/health`):
```yaml
healthcheck:
  test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
```

---

### Readiness vs Liveness

**Readiness**: Service is ready to accept traffic (used by Kubernetes)
**Liveness**: Service is running (Docker health checks)

**Example** (separate endpoints):
```typescript
// Liveness (minimal check)
app.get('/api/health', (req, res) => {
  res.json({ status: 'ok' });
});

// Readiness (comprehensive check)
app.get('/api/ready', async (req, res) => {
  const dbReady = await checkDatabase();
  const redisReady = await checkRedis();
  const ready = dbReady && redisReady;
  res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady });
});
```

**Docker uses liveness** (`/api/health`).
**Load balancer uses readiness** (`/api/ready`).

---

## Troubleshooting

### Service Marked Unhealthy

**Diagnosis**:
```bash
# Check logs
docker compose logs --tail=50 api

# Check health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output'

# Manually run health check
docker compose exec api wget -O- http://localhost:4000/api/health
```

**Common causes**:
- Service crashed (check logs)
- Health endpoint broken (test manually)
- Timeout too short (increase in docker-compose.yml)
- Database migration running (increase start_period)

---

### Container Restarting Loop

**Symptoms**: Container repeatedly marked unhealthy → restart → unhealthy

**Diagnosis**:
```bash
# Check restart count
docker inspect changemaker-v2-api | jq '.[0].RestartCount'

# Check logs for errors
docker compose logs api | grep -i error
```

**Common causes**:
- Health check too aggressive (increase retries/interval)
- Service genuinely broken (fix code issue)
- Resource limits too low (increase memory/CPU)

**Solution**:
```yaml
# Temporarily disable health check
healthcheck:
  disable: true

# Or increase tolerance
healthcheck:
  retries: 10
  start_period: 60s
```

---

### Health Check Command Not Found

**Symptoms**: Health check fails with "wget: not found" or "curl: not found"

**Cause**: Using wrong command for image type (Alpine vs Debian)

**Solution**:

**Alpine images** (api, media-api, redis, v2-postgres):
```yaml
test: ["CMD", "wget", "-q", "--spider", "http://..."]
```

**Debian images** (gitea-app):
```yaml
test: ["CMD", "curl", "-f", "http://..."]
```

---

### Start Period Too Short

**Symptoms**: Service marked unhealthy immediately on startup

**Cause**: Database migrations or slow startup exceed start_period

**Solution**:
```yaml
# Increase start_period
healthcheck:
  start_period: 60s  # Was 30s
```

**Monitor startup time**:
```bash
# Measure time to first healthy
docker compose up -d api && \
  while ! docker compose ps api | grep -q healthy; do sleep 1; done && \
  echo "Startup took $SECONDS seconds"
```

---

## Production Recommendations

### Timeout Configuration

**Critical services** (database, redis, api):
- interval: 10-15s
- timeout: 5s
- retries: 3-5
- start_period: 30-60s

**Supporting services** (n8n, gitea, mailhog):
- interval: 30-60s
- timeout: 10s
- retries: 3
- start_period: 30s

---

### Restart Policies

**Combine with restart policies**:
```yaml
api:
  restart: unless-stopped  # Auto-restart on failure
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
```

**Effect**: Unhealthy container → restart → health checks resume.

---

### Monitoring Integration

**Prometheus exporter** (future):
```bash
# Expose health check status as metrics
docker_healthcheck_status{container="changemaker-v2-api"} 1
```

**Alert on unhealthy**:
```yaml
- alert: ContainerUnhealthy
  expr: docker_healthcheck_status == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} unhealthy"
```

---

## Testing Health Checks

### Manual Test

```bash
# Start service
docker compose up -d api

# Watch health status
watch -n2 'docker compose ps api'

# Should see:
# (health: starting) → (healthy)
```

---

### Simulate Failure

```bash
# Stop backend service
docker compose stop v2-postgres

# Wait 15s (API health check interval)
sleep 15

# Check API status
docker compose ps api
# Should show (unhealthy) after 3 failures (45s)

# Restart backend
docker compose start v2-postgres

# API should recover
docker compose ps api
# Should show (healthy) after successful check
```

---

## Related Documentation

- **[Docker Compose](docker-compose.md)** — Service orchestration
- **[Monitoring Stack](monitoring-stack.md)** — Health metrics
- **[Troubleshooting](../troubleshooting/common-issues.md)** — Debug failing services