# Docker Health Check Configuration ## Overview Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services. **Benefits:** - Automatic restart of unhealthy containers - Dependency management (`depends_on` with `service_healthy`) - Monitoring integration (Prometheus can scrape health status) --- ## Services with Health Checks | Service | Healthcheck Command | Interval | Timeout | Retries | Start Period | |---------|---------------------|----------|---------|---------|--------------| | **api** | `wget http://localhost:4000/api/health` | 15s | 5s | 3 | 30s | | **media-api** | `wget http://127.0.0.1:4100/health` | 15s | 5s | 3 | 30s | | **admin** | `wget http://127.0.0.1:3000/` | 30s | 5s | 3 | 20s | | **v2-postgres** | `pg_isready -U changemaker` | 10s | 5s | 5 | - | | **redis** | `redis-cli -a $REDIS_PASSWORD ping` | 10s | 5s | 5 | - | | **gitea-app** | `curl http://localhost:3000/` | 30s | 5s | 3 | 30s | | **n8n** | `wget http://localhost:5678/healthz` | 30s | 5s | 3 | 30s | --- ## Health Check Configuration ### API (Express) **docker-compose.yml**: ```yaml api: healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s ``` **Explanation**: - **test**: Runs `wget` (Alpine image standard) to check `/api/health` endpoint - **interval**: Check every 15 seconds - **timeout**: Fail if no response in 5 seconds - **retries**: Mark unhealthy after 3 consecutive failures - **start_period**: 30s grace period on startup (allows migrations to run) **Health endpoint** (api/src/server.ts): ```typescript app.get('/api/health', (req, res) => { res.json({ status: 'ok', timestamp: new Date().toISOString() }); }); ``` **Health states**: - **starting**: Within start_period (30s) - **healthy**: Check passed - **unhealthy**: 3 consecutive failures --- ### Media API (Fastify) **docker-compose.yml**: ```yaml media-api: healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s ``` **Health endpoint** (api/src/media-server.ts): ```typescript app.get('/health', async (req, reply) => { return { status: 'ok' }; }); ``` **Note**: Uses `127.0.0.1` instead of `localhost` (Alpine's `wget` prefers IP). --- ### Admin (Vite Dev Server) **docker-compose.yml**: ```yaml admin: healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"] interval: 30s timeout: 5s retries: 3 start_period: 20s ``` **Explanation**: - **30s interval**: Less critical than backend (frontend can tolerate brief downtime) - **20s start period**: Vite dev server starts quickly - **Root path**: Checks Vite is serving HTML (no dedicated /health endpoint) --- ### V2 PostgreSQL **docker-compose.yml**: ```yaml v2-postgres: healthcheck: test: ["CMD-SHELL", "pg_isready -U changemaker"] interval: 10s timeout: 5s retries: 5 ``` **Explanation**: - **pg_isready**: Built-in PostgreSQL health check utility - **10s interval**: Fast detection of database issues - **5 retries**: More tolerant (database startup can be slow) - **No start_period**: PostgreSQL has its own startup delay **pg_isready output**: ```bash # Healthy /var/run/postgresql:5432 - accepting connections # Unhealthy /var/run/postgresql:5432 - rejecting connections ``` --- ### Redis **docker-compose.yml**: ```yaml redis: healthcheck: test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"] interval: 10s timeout: 5s retries: 5 ``` **Explanation**: - **redis-cli ping**: Returns `PONG` if healthy - **-a ${REDIS_PASSWORD}**: Authenticates with password (required since Security Audit) - **10s interval**: Fast detection for critical cache service **PING output**: ```bash # Healthy PONG # Unhealthy (error) NOAUTH Authentication required ``` --- ### Gitea **docker-compose.yml**: ```yaml gitea-app: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3000/"] interval: 30s timeout: 5s retries: 3 start_period: 30s ``` **Explanation**: - **curl**: Debian-based image (no `wget`) - **-f**: Fail on HTTP errors (non-200 response) - **30s interval**: Supporting service (less critical) **Important**: Gitea uses `curl` (not `wget`) because it's a Debian image, not Alpine. --- ### n8n **docker-compose.yml**: ```yaml n8n: healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"] interval: 30s timeout: 5s retries: 3 start_period: 30s ``` **Explanation**: - **/healthz**: n8n's built-in health endpoint - **30s interval**: Workflow automation (not user-facing) --- ## Dependency Chains ### API Depends on Database + Redis **docker-compose.yml**: ```yaml api: depends_on: v2-postgres: condition: service_healthy redis: condition: service_healthy ``` **Effect**: API container waits for PostgreSQL + Redis to be healthy before starting. **Startup sequence**: 1. PostgreSQL starts → health checks begin 2. After 5 successful checks → marked healthy 3. Redis starts → health checks begin 4. After 5 successful checks → marked healthy 5. API starts (both dependencies healthy) --- ### Media API Depends on Database **docker-compose.yml**: ```yaml media-api: depends_on: v2-postgres: condition: service_healthy ``` **Effect**: Media API waits for PostgreSQL to be healthy. --- ### NocoDB Depends on Database **docker-compose.yml**: ```yaml nocodb-v2: depends_on: v2-postgres: condition: service_healthy ``` **Effect**: NocoDB waits for its metadata database to be ready. --- ## Monitoring Healthcheck Status ### View Health Status ```bash # All services (shows health in STATUS column) docker compose ps # Example output: # NAME STATUS # changemaker-v2-api Up 2 hours (healthy) # changemaker-v2-postgres Up 2 hours (healthy) # redis-changemaker Up 2 hours (healthy) ``` **Health states**: - `(healthy)`: All checks passing - `(unhealthy)`: Multiple checks failed - `(health: starting)`: Within start_period --- ### Filter Unhealthy Services ```bash # Show only unhealthy docker compose ps | grep unhealthy # Count unhealthy docker compose ps -q --status unhealthy | wc -l ``` --- ### Inspect Health Check Details ```bash # Full health info for API docker inspect changemaker-v2-api | jq '.[0].State.Health' # Example output: { "Status": "healthy", "FailingStreak": 0, "Log": [ { "Start": "2026-02-13T14:30:00Z", "End": "2026-02-13T14:30:01Z", "ExitCode": 0, "Output": "" } ] } ``` **Key fields**: - **Status**: `healthy`, `unhealthy`, or `starting` - **FailingStreak**: Consecutive failed checks - **Log**: Last 5 health check results --- ### Health Check Logs ```bash # View health check output docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]' # Example (success): { "Start": "2026-02-13T14:30:00Z", "End": "2026-02-13T14:30:01Z", "ExitCode": 0, "Output": "" } # Example (failure): { "Start": "2026-02-13T14:35:00Z", "End": "2026-02-13T14:35:05Z", "ExitCode": 1, "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused" } ``` --- ## Custom Health Checks ### Advanced API Health Check **Check database + Redis connectivity**: **api/src/server.ts**: ```typescript app.get('/api/health', async (req, res) => { const checks = { database: false, redis: false, }; try { await prisma.$queryRaw`SELECT 1`; checks.database = true; } catch (err) { console.error('DB health check failed:', err); } try { await redis.ping(); checks.redis = true; } catch (err) { console.error('Redis health check failed:', err); } const healthy = checks.database && checks.redis; res.status(healthy ? 200 : 503).json({ status: healthy ? 'ok' : 'degraded', checks, timestamp: new Date().toISOString(), }); }); ``` **docker-compose.yml** (no change needed — still checks `/api/health`): ```yaml healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"] ``` --- ### Readiness vs Liveness **Readiness**: Service is ready to accept traffic (used by Kubernetes) **Liveness**: Service is running (Docker health checks) **Example** (separate endpoints): ```typescript // Liveness (minimal check) app.get('/api/health', (req, res) => { res.json({ status: 'ok' }); }); // Readiness (comprehensive check) app.get('/api/ready', async (req, res) => { const dbReady = await checkDatabase(); const redisReady = await checkRedis(); const ready = dbReady && redisReady; res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady }); }); ``` **Docker uses liveness** (`/api/health`). **Load balancer uses readiness** (`/api/ready`). --- ## Troubleshooting ### Service Marked Unhealthy **Diagnosis**: ```bash # Check logs docker compose logs --tail=50 api # Check health check output docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output' # Manually run health check docker compose exec api wget -O- http://localhost:4000/api/health ``` **Common causes**: - Service crashed (check logs) - Health endpoint broken (test manually) - Timeout too short (increase in docker-compose.yml) - Database migration running (increase start_period) --- ### Container Restarting Loop **Symptoms**: Container repeatedly marked unhealthy → restart → unhealthy **Diagnosis**: ```bash # Check restart count docker inspect changemaker-v2-api | jq '.[0].RestartCount' # Check logs for errors docker compose logs api | grep -i error ``` **Common causes**: - Health check too aggressive (increase retries/interval) - Service genuinely broken (fix code issue) - Resource limits too low (increase memory/CPU) **Solution**: ```yaml # Temporarily disable health check healthcheck: disable: true # Or increase tolerance healthcheck: retries: 10 start_period: 60s ``` --- ### Health Check Command Not Found **Symptoms**: Health check fails with "wget: not found" or "curl: not found" **Cause**: Using wrong command for image type (Alpine vs Debian) **Solution**: **Alpine images** (api, media-api, redis, v2-postgres): ```yaml test: ["CMD", "wget", "-q", "--spider", "http://..."] ``` **Debian images** (gitea-app): ```yaml test: ["CMD", "curl", "-f", "http://..."] ``` --- ### Start Period Too Short **Symptoms**: Service marked unhealthy immediately on startup **Cause**: Database migrations or slow startup exceed start_period **Solution**: ```yaml # Increase start_period healthcheck: start_period: 60s # Was 30s ``` **Monitor startup time**: ```bash # Measure time to first healthy docker compose up -d api && \ while ! docker compose ps api | grep -q healthy; do sleep 1; done && \ echo "Startup took $SECONDS seconds" ``` --- ## Production Recommendations ### Timeout Configuration **Critical services** (database, redis, api): - interval: 10-15s - timeout: 5s - retries: 3-5 - start_period: 30-60s **Supporting services** (n8n, gitea, mailhog): - interval: 30-60s - timeout: 10s - retries: 3 - start_period: 30s --- ### Restart Policies **Combine with restart policies**: ```yaml api: restart: unless-stopped # Auto-restart on failure healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"] ``` **Effect**: Unhealthy container → restart → health checks resume. --- ### Monitoring Integration **Prometheus exporter** (future): ```bash # Expose health check status as metrics docker_healthcheck_status{container="changemaker-v2-api"} 1 ``` **Alert on unhealthy**: ```yaml - alert: ContainerUnhealthy expr: docker_healthcheck_status == 0 for: 5m labels: severity: warning annotations: summary: "Container {{ $labels.container }} unhealthy" ``` --- ## Testing Health Checks ### Manual Test ```bash # Start service docker compose up -d api # Watch health status watch -n2 'docker compose ps api' # Should see: # (health: starting) → (healthy) ``` --- ### Simulate Failure ```bash # Stop backend service docker compose stop v2-postgres # Wait 15s (API health check interval) sleep 15 # Check API status docker compose ps api # Should show (unhealthy) after 3 failures (45s) # Restart backend docker compose start v2-postgres # API should recover docker compose ps api # Should show (healthy) after successful check ``` --- ## Related Documentation - **[Docker Compose](docker-compose.md)** — Service orchestration - **[Monitoring Stack](monitoring-stack.md)** — Health metrics - **[Troubleshooting](../troubleshooting/common-issues.md)** — Debug failing services