614 lines
12 KiB
Markdown

# Docker Health Check Configuration
## Overview
Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services.
**Benefits:**
- Automatic restart of unhealthy containers
- Dependency management (`depends_on` with `service_healthy`)
- Monitoring integration (Prometheus can scrape health status)
---
## Services with Health Checks
| Service | Healthcheck Command | Interval | Timeout | Retries | Start Period |
|---------|---------------------|----------|---------|---------|--------------|
| **api** | `wget http://localhost:4000/api/health` | 15s | 5s | 3 | 30s |
| **media-api** | `wget http://127.0.0.1:4100/health` | 15s | 5s | 3 | 30s |
| **admin** | `wget http://127.0.0.1:3000/` | 30s | 5s | 3 | 20s |
| **v2-postgres** | `pg_isready -U changemaker` | 10s | 5s | 5 | - |
| **redis** | `redis-cli -a $REDIS_PASSWORD ping` | 10s | 5s | 5 | - |
| **gitea-app** | `curl http://localhost:3000/` | 30s | 5s | 3 | 30s |
| **n8n** | `wget http://localhost:5678/healthz` | 30s | 5s | 3 | 30s |
---
## Health Check Configuration
### API (Express)
**docker-compose.yml**:
```yaml
api:
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
```
**Explanation**:
- **test**: Runs `wget` (Alpine image standard) to check `/api/health` endpoint
- **interval**: Check every 15 seconds
- **timeout**: Fail if no response in 5 seconds
- **retries**: Mark unhealthy after 3 consecutive failures
- **start_period**: 30s grace period on startup (allows migrations to run)
**Health endpoint** (api/src/server.ts):
```typescript
app.get('/api/health', (req, res) => {
res.json({ status: 'ok', timestamp: new Date().toISOString() });
});
```
**Health states**:
- **starting**: Within start_period (30s)
- **healthy**: Check passed
- **unhealthy**: 3 consecutive failures
---
### Media API (Fastify)
**docker-compose.yml**:
```yaml
media-api:
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
```
**Health endpoint** (api/src/media-server.ts):
```typescript
app.get('/health', async (req, reply) => {
return { status: 'ok' };
});
```
**Note**: Uses `127.0.0.1` instead of `localhost` (Alpine's `wget` prefers IP).
---
### Admin (Vite Dev Server)
**docker-compose.yml**:
```yaml
admin:
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"]
interval: 30s
timeout: 5s
retries: 3
start_period: 20s
```
**Explanation**:
- **30s interval**: Less critical than backend (frontend can tolerate brief downtime)
- **20s start period**: Vite dev server starts quickly
- **Root path**: Checks Vite is serving HTML (no dedicated /health endpoint)
---
### V2 PostgreSQL
**docker-compose.yml**:
```yaml
v2-postgres:
healthcheck:
test: ["CMD-SHELL", "pg_isready -U changemaker"]
interval: 10s
timeout: 5s
retries: 5
```
**Explanation**:
- **pg_isready**: Built-in PostgreSQL health check utility
- **10s interval**: Fast detection of database issues
- **5 retries**: More tolerant (database startup can be slow)
- **No start_period**: PostgreSQL has its own startup delay
**pg_isready output**:
```bash
# Healthy
/var/run/postgresql:5432 - accepting connections
# Unhealthy
/var/run/postgresql:5432 - rejecting connections
```
---
### Redis
**docker-compose.yml**:
```yaml
redis:
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 10s
timeout: 5s
retries: 5
```
**Explanation**:
- **redis-cli ping**: Returns `PONG` if healthy
- **-a ${REDIS_PASSWORD}**: Authenticates with password (required since Security Audit)
- **10s interval**: Fast detection for critical cache service
**PING output**:
```bash
# Healthy
PONG
# Unhealthy
(error) NOAUTH Authentication required
```
---
### Gitea
**docker-compose.yml**:
```yaml
gitea-app:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
```
**Explanation**:
- **curl**: Debian-based image (no `wget`)
- **-f**: Fail on HTTP errors (non-200 response)
- **30s interval**: Supporting service (less critical)
**Important**: Gitea uses `curl` (not `wget`) because it's a Debian image, not Alpine.
---
### n8n
**docker-compose.yml**:
```yaml
n8n:
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30s
```
**Explanation**:
- **/healthz**: n8n's built-in health endpoint
- **30s interval**: Workflow automation (not user-facing)
---
## Dependency Chains
### API Depends on Database + Redis
**docker-compose.yml**:
```yaml
api:
depends_on:
v2-postgres:
condition: service_healthy
redis:
condition: service_healthy
```
**Effect**: API container waits for PostgreSQL + Redis to be healthy before starting.
**Startup sequence**:
1. PostgreSQL starts → health checks begin
2. After 5 successful checks → marked healthy
3. Redis starts → health checks begin
4. After 5 successful checks → marked healthy
5. API starts (both dependencies healthy)
---
### Media API Depends on Database
**docker-compose.yml**:
```yaml
media-api:
depends_on:
v2-postgres:
condition: service_healthy
```
**Effect**: Media API waits for PostgreSQL to be healthy.
---
### NocoDB Depends on Database
**docker-compose.yml**:
```yaml
nocodb-v2:
depends_on:
v2-postgres:
condition: service_healthy
```
**Effect**: NocoDB waits for its metadata database to be ready.
---
## Monitoring Healthcheck Status
### View Health Status
```bash
# All services (shows health in STATUS column)
docker compose ps
# Example output:
# NAME STATUS
# changemaker-v2-api Up 2 hours (healthy)
# changemaker-v2-postgres Up 2 hours (healthy)
# redis-changemaker Up 2 hours (healthy)
```
**Health states**:
- `(healthy)`: All checks passing
- `(unhealthy)`: Multiple checks failed
- `(health: starting)`: Within start_period
---
### Filter Unhealthy Services
```bash
# Show only unhealthy
docker compose ps | grep unhealthy
# Count unhealthy
docker compose ps -q --status unhealthy | wc -l
```
---
### Inspect Health Check Details
```bash
# Full health info for API
docker inspect changemaker-v2-api | jq '.[0].State.Health'
# Example output:
{
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2026-02-13T14:30:00Z",
"End": "2026-02-13T14:30:01Z",
"ExitCode": 0,
"Output": ""
}
]
}
```
**Key fields**:
- **Status**: `healthy`, `unhealthy`, or `starting`
- **FailingStreak**: Consecutive failed checks
- **Log**: Last 5 health check results
---
### Health Check Logs
```bash
# View health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]'
# Example (success):
{
"Start": "2026-02-13T14:30:00Z",
"End": "2026-02-13T14:30:01Z",
"ExitCode": 0,
"Output": ""
}
# Example (failure):
{
"Start": "2026-02-13T14:35:00Z",
"End": "2026-02-13T14:35:05Z",
"ExitCode": 1,
"Output": "wget: can't connect to remote host (127.0.0.1): Connection refused"
}
```
---
## Custom Health Checks
### Advanced API Health Check
**Check database + Redis connectivity**:
**api/src/server.ts**:
```typescript
app.get('/api/health', async (req, res) => {
const checks = {
database: false,
redis: false,
};
try {
await prisma.$queryRaw`SELECT 1`;
checks.database = true;
} catch (err) {
console.error('DB health check failed:', err);
}
try {
await redis.ping();
checks.redis = true;
} catch (err) {
console.error('Redis health check failed:', err);
}
const healthy = checks.database && checks.redis;
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
checks,
timestamp: new Date().toISOString(),
});
});
```
**docker-compose.yml** (no change needed — still checks `/api/health`):
```yaml
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
```
---
### Readiness vs Liveness
**Readiness**: Service is ready to accept traffic (used by Kubernetes)
**Liveness**: Service is running (Docker health checks)
**Example** (separate endpoints):
```typescript
// Liveness (minimal check)
app.get('/api/health', (req, res) => {
res.json({ status: 'ok' });
});
// Readiness (comprehensive check)
app.get('/api/ready', async (req, res) => {
const dbReady = await checkDatabase();
const redisReady = await checkRedis();
const ready = dbReady && redisReady;
res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady });
});
```
**Docker uses liveness** (`/api/health`).
**Load balancer uses readiness** (`/api/ready`).
---
## Troubleshooting
### Service Marked Unhealthy
**Diagnosis**:
```bash
# Check logs
docker compose logs --tail=50 api
# Check health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output'
# Manually run health check
docker compose exec api wget -O- http://localhost:4000/api/health
```
**Common causes**:
- Service crashed (check logs)
- Health endpoint broken (test manually)
- Timeout too short (increase in docker-compose.yml)
- Database migration running (increase start_period)
---
### Container Restarting Loop
**Symptoms**: Container repeatedly marked unhealthy → restart → unhealthy
**Diagnosis**:
```bash
# Check restart count
docker inspect changemaker-v2-api | jq '.[0].RestartCount'
# Check logs for errors
docker compose logs api | grep -i error
```
**Common causes**:
- Health check too aggressive (increase retries/interval)
- Service genuinely broken (fix code issue)
- Resource limits too low (increase memory/CPU)
**Solution**:
```yaml
# Temporarily disable health check
healthcheck:
disable: true
# Or increase tolerance
healthcheck:
retries: 10
start_period: 60s
```
---
### Health Check Command Not Found
**Symptoms**: Health check fails with "wget: not found" or "curl: not found"
**Cause**: Using wrong command for image type (Alpine vs Debian)
**Solution**:
**Alpine images** (api, media-api, redis, v2-postgres):
```yaml
test: ["CMD", "wget", "-q", "--spider", "http://..."]
```
**Debian images** (gitea-app):
```yaml
test: ["CMD", "curl", "-f", "http://..."]
```
---
### Start Period Too Short
**Symptoms**: Service marked unhealthy immediately on startup
**Cause**: Database migrations or slow startup exceed start_period
**Solution**:
```yaml
# Increase start_period
healthcheck:
start_period: 60s # Was 30s
```
**Monitor startup time**:
```bash
# Measure time to first healthy
docker compose up -d api && \
while ! docker compose ps api | grep -q healthy; do sleep 1; done && \
echo "Startup took $SECONDS seconds"
```
---
## Production Recommendations
### Timeout Configuration
**Critical services** (database, redis, api):
- interval: 10-15s
- timeout: 5s
- retries: 3-5
- start_period: 30-60s
**Supporting services** (n8n, gitea, mailhog):
- interval: 30-60s
- timeout: 10s
- retries: 3
- start_period: 30s
---
### Restart Policies
**Combine with restart policies**:
```yaml
api:
restart: unless-stopped # Auto-restart on failure
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
```
**Effect**: Unhealthy container → restart → health checks resume.
---
### Monitoring Integration
**Prometheus exporter** (future):
```bash
# Expose health check status as metrics
docker_healthcheck_status{container="changemaker-v2-api"} 1
```
**Alert on unhealthy**:
```yaml
- alert: ContainerUnhealthy
expr: docker_healthcheck_status == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} unhealthy"
```
---
## Testing Health Checks
### Manual Test
```bash
# Start service
docker compose up -d api
# Watch health status
watch -n2 'docker compose ps api'
# Should see:
# (health: starting) → (healthy)
```
---
### Simulate Failure
```bash
# Stop backend service
docker compose stop v2-postgres
# Wait 15s (API health check interval)
sleep 15
# Check API status
docker compose ps api
# Should show (unhealthy) after 3 failures (45s)
# Restart backend
docker compose start v2-postgres
# API should recover
docker compose ps api
# Should show (healthy) after successful check
```
---
## Related Documentation
- **[Docker Compose](docker-compose.md)** — Service orchestration
- **[Monitoring Stack](monitoring-stack.md)** — Health metrics
- **[Troubleshooting](../troubleshooting/common-issues.md)** — Debug failing services