614 lines
12 KiB
Markdown
614 lines
12 KiB
Markdown
# Docker Health Check Configuration
|
|
|
|
## Overview
|
|
|
|
Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services.
|
|
|
|
**Benefits:**
|
|
- Automatic restart of unhealthy containers
|
|
- Dependency management (`depends_on` with `service_healthy`)
|
|
- Monitoring integration (Prometheus can scrape health status)
|
|
|
|
---
|
|
|
|
## Services with Health Checks
|
|
|
|
| Service | Healthcheck Command | Interval | Timeout | Retries | Start Period |
|
|
|---------|---------------------|----------|---------|---------|--------------|
|
|
| **api** | `wget http://localhost:4000/api/health` | 15s | 5s | 3 | 30s |
|
|
| **media-api** | `wget http://127.0.0.1:4100/health` | 15s | 5s | 3 | 30s |
|
|
| **admin** | `wget http://127.0.0.1:3000/` | 30s | 5s | 3 | 20s |
|
|
| **v2-postgres** | `pg_isready -U changemaker` | 10s | 5s | 5 | - |
|
|
| **redis** | `redis-cli -a $REDIS_PASSWORD ping` | 10s | 5s | 5 | - |
|
|
| **gitea-app** | `curl http://localhost:3000/` | 30s | 5s | 3 | 30s |
|
|
| **n8n** | `wget http://localhost:5678/healthz` | 30s | 5s | 3 | 30s |
|
|
|
|
---
|
|
|
|
## Health Check Configuration
|
|
|
|
### API (Express)
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
api:
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
|
|
interval: 15s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 30s
|
|
```
|
|
|
|
**Explanation**:
|
|
- **test**: Runs `wget` (Alpine image standard) to check `/api/health` endpoint
|
|
- **interval**: Check every 15 seconds
|
|
- **timeout**: Fail if no response in 5 seconds
|
|
- **retries**: Mark unhealthy after 3 consecutive failures
|
|
- **start_period**: 30s grace period on startup (allows migrations to run)
|
|
|
|
**Health endpoint** (api/src/server.ts):
|
|
```typescript
|
|
app.get('/api/health', (req, res) => {
|
|
res.json({ status: 'ok', timestamp: new Date().toISOString() });
|
|
});
|
|
```
|
|
|
|
**Health states**:
|
|
- **starting**: Within start_period (30s)
|
|
- **healthy**: Check passed
|
|
- **unhealthy**: 3 consecutive failures
|
|
|
|
---
|
|
|
|
### Media API (Fastify)
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
media-api:
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"]
|
|
interval: 15s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 30s
|
|
```
|
|
|
|
**Health endpoint** (api/src/media-server.ts):
|
|
```typescript
|
|
app.get('/health', async (req, reply) => {
|
|
return { status: 'ok' };
|
|
});
|
|
```
|
|
|
|
**Note**: Uses `127.0.0.1` instead of `localhost` (Alpine's `wget` prefers IP).
|
|
|
|
---
|
|
|
|
### Admin (Vite Dev Server)
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
admin:
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 20s
|
|
```
|
|
|
|
**Explanation**:
|
|
- **30s interval**: Less critical than backend (frontend can tolerate brief downtime)
|
|
- **20s start period**: Vite dev server starts quickly
|
|
- **Root path**: Checks Vite is serving HTML (no dedicated /health endpoint)
|
|
|
|
---
|
|
|
|
### V2 PostgreSQL
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
v2-postgres:
|
|
healthcheck:
|
|
test: ["CMD-SHELL", "pg_isready -U changemaker"]
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
```
|
|
|
|
**Explanation**:
|
|
- **pg_isready**: Built-in PostgreSQL health check utility
|
|
- **10s interval**: Fast detection of database issues
|
|
- **5 retries**: More tolerant (database startup can be slow)
|
|
- **No start_period**: PostgreSQL has its own startup delay
|
|
|
|
**pg_isready output**:
|
|
```bash
|
|
# Healthy
|
|
/var/run/postgresql:5432 - accepting connections
|
|
|
|
# Unhealthy
|
|
/var/run/postgresql:5432 - rejecting connections
|
|
```
|
|
|
|
---
|
|
|
|
### Redis
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
redis:
|
|
healthcheck:
|
|
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
```
|
|
|
|
**Explanation**:
|
|
- **redis-cli ping**: Returns `PONG` if healthy
|
|
- **-a ${REDIS_PASSWORD}**: Authenticates with password (required since Security Audit)
|
|
- **10s interval**: Fast detection for critical cache service
|
|
|
|
**PING output**:
|
|
```bash
|
|
# Healthy
|
|
PONG
|
|
|
|
# Unhealthy
|
|
(error) NOAUTH Authentication required
|
|
```
|
|
|
|
---
|
|
|
|
### Gitea
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
gitea-app:
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:3000/"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 30s
|
|
```
|
|
|
|
**Explanation**:
|
|
- **curl**: Debian-based image (no `wget`)
|
|
- **-f**: Fail on HTTP errors (non-200 response)
|
|
- **30s interval**: Supporting service (less critical)
|
|
|
|
**Important**: Gitea uses `curl` (not `wget`) because it's a Debian image, not Alpine.
|
|
|
|
---
|
|
|
|
### n8n
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
n8n:
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"]
|
|
interval: 30s
|
|
timeout: 5s
|
|
retries: 3
|
|
start_period: 30s
|
|
```
|
|
|
|
**Explanation**:
|
|
- **/healthz**: n8n's built-in health endpoint
|
|
- **30s interval**: Workflow automation (not user-facing)
|
|
|
|
---
|
|
|
|
## Dependency Chains
|
|
|
|
### API Depends on Database + Redis
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
api:
|
|
depends_on:
|
|
v2-postgres:
|
|
condition: service_healthy
|
|
redis:
|
|
condition: service_healthy
|
|
```
|
|
|
|
**Effect**: API container waits for PostgreSQL + Redis to be healthy before starting.
|
|
|
|
**Startup sequence**:
|
|
1. PostgreSQL starts → health checks begin
|
|
2. After 5 successful checks → marked healthy
|
|
3. Redis starts → health checks begin
|
|
4. After 5 successful checks → marked healthy
|
|
5. API starts (both dependencies healthy)
|
|
|
|
---
|
|
|
|
### Media API Depends on Database
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
media-api:
|
|
depends_on:
|
|
v2-postgres:
|
|
condition: service_healthy
|
|
```
|
|
|
|
**Effect**: Media API waits for PostgreSQL to be healthy.
|
|
|
|
---
|
|
|
|
### NocoDB Depends on Database
|
|
|
|
**docker-compose.yml**:
|
|
```yaml
|
|
nocodb-v2:
|
|
depends_on:
|
|
v2-postgres:
|
|
condition: service_healthy
|
|
```
|
|
|
|
**Effect**: NocoDB waits for its metadata database to be ready.
|
|
|
|
---
|
|
|
|
## Monitoring Healthcheck Status
|
|
|
|
### View Health Status
|
|
|
|
```bash
|
|
# All services (shows health in STATUS column)
|
|
docker compose ps
|
|
|
|
# Example output:
|
|
# NAME STATUS
|
|
# changemaker-v2-api Up 2 hours (healthy)
|
|
# changemaker-v2-postgres Up 2 hours (healthy)
|
|
# redis-changemaker Up 2 hours (healthy)
|
|
```
|
|
|
|
**Health states**:
|
|
- `(healthy)`: All checks passing
|
|
- `(unhealthy)`: Multiple checks failed
|
|
- `(health: starting)`: Within start_period
|
|
|
|
---
|
|
|
|
### Filter Unhealthy Services
|
|
|
|
```bash
|
|
# Show only unhealthy
|
|
docker compose ps | grep unhealthy
|
|
|
|
# Count unhealthy
|
|
docker compose ps -q --status unhealthy | wc -l
|
|
```
|
|
|
|
---
|
|
|
|
### Inspect Health Check Details
|
|
|
|
```bash
|
|
# Full health info for API
|
|
docker inspect changemaker-v2-api | jq '.[0].State.Health'
|
|
|
|
# Example output:
|
|
{
|
|
"Status": "healthy",
|
|
"FailingStreak": 0,
|
|
"Log": [
|
|
{
|
|
"Start": "2026-02-13T14:30:00Z",
|
|
"End": "2026-02-13T14:30:01Z",
|
|
"ExitCode": 0,
|
|
"Output": ""
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Key fields**:
|
|
- **Status**: `healthy`, `unhealthy`, or `starting`
|
|
- **FailingStreak**: Consecutive failed checks
|
|
- **Log**: Last 5 health check results
|
|
|
|
---
|
|
|
|
### Health Check Logs
|
|
|
|
```bash
|
|
# View health check output
|
|
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]'
|
|
|
|
# Example (success):
|
|
{
|
|
"Start": "2026-02-13T14:30:00Z",
|
|
"End": "2026-02-13T14:30:01Z",
|
|
"ExitCode": 0,
|
|
"Output": ""
|
|
}
|
|
|
|
# Example (failure):
|
|
{
|
|
"Start": "2026-02-13T14:35:00Z",
|
|
"End": "2026-02-13T14:35:05Z",
|
|
"ExitCode": 1,
|
|
"Output": "wget: can't connect to remote host (127.0.0.1): Connection refused"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Custom Health Checks
|
|
|
|
### Advanced API Health Check
|
|
|
|
**Check database + Redis connectivity**:
|
|
|
|
**api/src/server.ts**:
|
|
```typescript
|
|
app.get('/api/health', async (req, res) => {
|
|
const checks = {
|
|
database: false,
|
|
redis: false,
|
|
};
|
|
|
|
try {
|
|
await prisma.$queryRaw`SELECT 1`;
|
|
checks.database = true;
|
|
} catch (err) {
|
|
console.error('DB health check failed:', err);
|
|
}
|
|
|
|
try {
|
|
await redis.ping();
|
|
checks.redis = true;
|
|
} catch (err) {
|
|
console.error('Redis health check failed:', err);
|
|
}
|
|
|
|
const healthy = checks.database && checks.redis;
|
|
res.status(healthy ? 200 : 503).json({
|
|
status: healthy ? 'ok' : 'degraded',
|
|
checks,
|
|
timestamp: new Date().toISOString(),
|
|
});
|
|
});
|
|
```
|
|
|
|
**docker-compose.yml** (no change needed — still checks `/api/health`):
|
|
```yaml
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
|
|
```
|
|
|
|
---
|
|
|
|
### Readiness vs Liveness
|
|
|
|
**Readiness**: Service is ready to accept traffic (used by Kubernetes)
|
|
**Liveness**: Service is running (Docker health checks)
|
|
|
|
**Example** (separate endpoints):
|
|
```typescript
|
|
// Liveness (minimal check)
|
|
app.get('/api/health', (req, res) => {
|
|
res.json({ status: 'ok' });
|
|
});
|
|
|
|
// Readiness (comprehensive check)
|
|
app.get('/api/ready', async (req, res) => {
|
|
const dbReady = await checkDatabase();
|
|
const redisReady = await checkRedis();
|
|
const ready = dbReady && redisReady;
|
|
res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady });
|
|
});
|
|
```
|
|
|
|
**Docker uses liveness** (`/api/health`).
|
|
**Load balancer uses readiness** (`/api/ready`).
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Service Marked Unhealthy
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check logs
|
|
docker compose logs --tail=50 api
|
|
|
|
# Check health check output
|
|
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output'
|
|
|
|
# Manually run health check
|
|
docker compose exec api wget -O- http://localhost:4000/api/health
|
|
```
|
|
|
|
**Common causes**:
|
|
- Service crashed (check logs)
|
|
- Health endpoint broken (test manually)
|
|
- Timeout too short (increase in docker-compose.yml)
|
|
- Database migration running (increase start_period)
|
|
|
|
---
|
|
|
|
### Container Restarting Loop
|
|
|
|
**Symptoms**: Container repeatedly marked unhealthy → restart → unhealthy
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check restart count
|
|
docker inspect changemaker-v2-api | jq '.[0].RestartCount'
|
|
|
|
# Check logs for errors
|
|
docker compose logs api | grep -i error
|
|
```
|
|
|
|
**Common causes**:
|
|
- Health check too aggressive (increase retries/interval)
|
|
- Service genuinely broken (fix code issue)
|
|
- Resource limits too low (increase memory/CPU)
|
|
|
|
**Solution**:
|
|
```yaml
|
|
# Temporarily disable health check
|
|
healthcheck:
|
|
disable: true
|
|
|
|
# Or increase tolerance
|
|
healthcheck:
|
|
retries: 10
|
|
start_period: 60s
|
|
```
|
|
|
|
---
|
|
|
|
### Health Check Command Not Found
|
|
|
|
**Symptoms**: Health check fails with "wget: not found" or "curl: not found"
|
|
|
|
**Cause**: Using wrong command for image type (Alpine vs Debian)
|
|
|
|
**Solution**:
|
|
|
|
**Alpine images** (api, media-api, redis, v2-postgres):
|
|
```yaml
|
|
test: ["CMD", "wget", "-q", "--spider", "http://..."]
|
|
```
|
|
|
|
**Debian images** (gitea-app):
|
|
```yaml
|
|
test: ["CMD", "curl", "-f", "http://..."]
|
|
```
|
|
|
|
---
|
|
|
|
### Start Period Too Short
|
|
|
|
**Symptoms**: Service marked unhealthy immediately on startup
|
|
|
|
**Cause**: Database migrations or slow startup exceed start_period
|
|
|
|
**Solution**:
|
|
```yaml
|
|
# Increase start_period
|
|
healthcheck:
|
|
start_period: 60s # Was 30s
|
|
```
|
|
|
|
**Monitor startup time**:
|
|
```bash
|
|
# Measure time to first healthy
|
|
docker compose up -d api && \
|
|
while ! docker compose ps api | grep -q healthy; do sleep 1; done && \
|
|
echo "Startup took $SECONDS seconds"
|
|
```
|
|
|
|
---
|
|
|
|
## Production Recommendations
|
|
|
|
### Timeout Configuration
|
|
|
|
**Critical services** (database, redis, api):
|
|
- interval: 10-15s
|
|
- timeout: 5s
|
|
- retries: 3-5
|
|
- start_period: 30-60s
|
|
|
|
**Supporting services** (n8n, gitea, mailhog):
|
|
- interval: 30-60s
|
|
- timeout: 10s
|
|
- retries: 3
|
|
- start_period: 30s
|
|
|
|
---
|
|
|
|
### Restart Policies
|
|
|
|
**Combine with restart policies**:
|
|
```yaml
|
|
api:
|
|
restart: unless-stopped # Auto-restart on failure
|
|
healthcheck:
|
|
test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
|
|
```
|
|
|
|
**Effect**: Unhealthy container → restart → health checks resume.
|
|
|
|
---
|
|
|
|
### Monitoring Integration
|
|
|
|
**Prometheus exporter** (future):
|
|
```bash
|
|
# Expose health check status as metrics
|
|
docker_healthcheck_status{container="changemaker-v2-api"} 1
|
|
```
|
|
|
|
**Alert on unhealthy**:
|
|
```yaml
|
|
- alert: ContainerUnhealthy
|
|
expr: docker_healthcheck_status == 0
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Container {{ $labels.container }} unhealthy"
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Health Checks
|
|
|
|
### Manual Test
|
|
|
|
```bash
|
|
# Start service
|
|
docker compose up -d api
|
|
|
|
# Watch health status
|
|
watch -n2 'docker compose ps api'
|
|
|
|
# Should see:
|
|
# (health: starting) → (healthy)
|
|
```
|
|
|
|
---
|
|
|
|
### Simulate Failure
|
|
|
|
```bash
|
|
# Stop backend service
|
|
docker compose stop v2-postgres
|
|
|
|
# Wait 15s (API health check interval)
|
|
sleep 15
|
|
|
|
# Check API status
|
|
docker compose ps api
|
|
# Should show (unhealthy) after 3 failures (45s)
|
|
|
|
# Restart backend
|
|
docker compose start v2-postgres
|
|
|
|
# API should recover
|
|
docker compose ps api
|
|
# Should show (healthy) after successful check
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **[Docker Compose](docker-compose.md)** — Service orchestration
|
|
- **[Monitoring Stack](monitoring-stack.md)** — Health metrics
|
|
- **[Troubleshooting](../troubleshooting/common-issues.md)** — Debug failing services
|