12 KiB

Docker Health Check Configuration

Overview

Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services.

Benefits:

  • Automatic restart of unhealthy containers
  • Dependency management (depends_on with service_healthy)
  • Monitoring integration (Prometheus can scrape health status)

Services with Health Checks

Service Healthcheck Command Interval Timeout Retries Start Period
api wget http://localhost:4000/api/health 15s 5s 3 30s
media-api wget http://127.0.0.1:4100/health 15s 5s 3 30s
admin wget http://127.0.0.1:3000/ 30s 5s 3 20s
v2-postgres pg_isready -U changemaker 10s 5s 5 -
redis redis-cli -a $REDIS_PASSWORD ping 10s 5s 5 -
gitea-app curl http://localhost:3000/ 30s 5s 3 30s
n8n wget http://localhost:5678/healthz 30s 5s 3 30s

Health Check Configuration

API (Express)

docker-compose.yml:

api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

  • test: Runs wget (Alpine image standard) to check /api/health endpoint
  • interval: Check every 15 seconds
  • timeout: Fail if no response in 5 seconds
  • retries: Mark unhealthy after 3 consecutive failures
  • start_period: 30s grace period on startup (allows migrations to run)

Health endpoint (api/src/server.ts):

app.get('/api/health', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

Health states:

  • starting: Within start_period (30s)
  • healthy: Check passed
  • unhealthy: 3 consecutive failures

Media API (Fastify)

docker-compose.yml:

media-api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

Health endpoint (api/src/media-server.ts):

app.get('/health', async (req, reply) => {
  return { status: 'ok' };
});

Note: Uses 127.0.0.1 instead of localhost (Alpine's wget prefers IP).


Admin (Vite Dev Server)

docker-compose.yml:

admin:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 20s

Explanation:

  • 30s interval: Less critical than backend (frontend can tolerate brief downtime)
  • 20s start period: Vite dev server starts quickly
  • Root path: Checks Vite is serving HTML (no dedicated /health endpoint)

V2 PostgreSQL

docker-compose.yml:

v2-postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U changemaker"]
    interval: 10s
    timeout: 5s
    retries: 5

Explanation:

  • pg_isready: Built-in PostgreSQL health check utility
  • 10s interval: Fast detection of database issues
  • 5 retries: More tolerant (database startup can be slow)
  • No start_period: PostgreSQL has its own startup delay

pg_isready output:

# Healthy
/var/run/postgresql:5432 - accepting connections

# Unhealthy
/var/run/postgresql:5432 - rejecting connections

Redis

docker-compose.yml:

redis:
  healthcheck:
    test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
    interval: 10s
    timeout: 5s
    retries: 5

Explanation:

  • redis-cli ping: Returns PONG if healthy
  • -a ${REDIS_PASSWORD}: Authenticates with password (required since Security Audit)
  • 10s interval: Fast detection for critical cache service

PING output:

# Healthy
PONG

# Unhealthy
(error) NOAUTH Authentication required

Gitea

docker-compose.yml:

gitea-app:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

  • curl: Debian-based image (no wget)
  • -f: Fail on HTTP errors (non-200 response)
  • 30s interval: Supporting service (less critical)

Important: Gitea uses curl (not wget) because it's a Debian image, not Alpine.


n8n

docker-compose.yml:

n8n:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

  • /healthz: n8n's built-in health endpoint
  • 30s interval: Workflow automation (not user-facing)

Dependency Chains

API Depends on Database + Redis

docker-compose.yml:

api:
  depends_on:
    v2-postgres:
      condition: service_healthy
    redis:
      condition: service_healthy

Effect: API container waits for PostgreSQL + Redis to be healthy before starting.

Startup sequence:

  1. PostgreSQL starts → health checks begin
  2. After 5 successful checks → marked healthy
  3. Redis starts → health checks begin
  4. After 5 successful checks → marked healthy
  5. API starts (both dependencies healthy)

Media API Depends on Database

docker-compose.yml:

media-api:
  depends_on:
    v2-postgres:
      condition: service_healthy

Effect: Media API waits for PostgreSQL to be healthy.


NocoDB Depends on Database

docker-compose.yml:

nocodb-v2:
  depends_on:
    v2-postgres:
      condition: service_healthy

Effect: NocoDB waits for its metadata database to be ready.


Monitoring Healthcheck Status

View Health Status

# All services (shows health in STATUS column)
docker compose ps

# Example output:
# NAME                    STATUS
# changemaker-v2-api      Up 2 hours (healthy)
# changemaker-v2-postgres Up 2 hours (healthy)
# redis-changemaker       Up 2 hours (healthy)

Health states:

  • (healthy): All checks passing
  • (unhealthy): Multiple checks failed
  • (health: starting): Within start_period

Filter Unhealthy Services

# Show only unhealthy
docker compose ps | grep unhealthy

# Count unhealthy
docker compose ps -q --status unhealthy | wc -l

Inspect Health Check Details

# Full health info for API
docker inspect changemaker-v2-api | jq '.[0].State.Health'

# Example output:
{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2026-02-13T14:30:00Z",
      "End": "2026-02-13T14:30:01Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}

Key fields:

  • Status: healthy, unhealthy, or starting
  • FailingStreak: Consecutive failed checks
  • Log: Last 5 health check results

Health Check Logs

# View health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]'

# Example (success):
{
  "Start": "2026-02-13T14:30:00Z",
  "End": "2026-02-13T14:30:01Z",
  "ExitCode": 0,
  "Output": ""
}

# Example (failure):
{
  "Start": "2026-02-13T14:35:00Z",
  "End": "2026-02-13T14:35:05Z",
  "ExitCode": 1,
  "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused"
}

Custom Health Checks

Advanced API Health Check

Check database + Redis connectivity:

api/src/server.ts:

app.get('/api/health', async (req, res) => {
  const checks = {
    database: false,
    redis: false,
  };

  try {
    await prisma.$queryRaw`SELECT 1`;
    checks.database = true;
  } catch (err) {
    console.error('DB health check failed:', err);
  }

  try {
    await redis.ping();
    checks.redis = true;
  } catch (err) {
    console.error('Redis health check failed:', err);
  }

  const healthy = checks.database && checks.redis;
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});

docker-compose.yml (no change needed — still checks /api/health):

healthcheck:
  test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]

Readiness vs Liveness

Readiness: Service is ready to accept traffic (used by Kubernetes)
Liveness: Service is running (Docker health checks)

Example (separate endpoints):

// Liveness (minimal check)
app.get('/api/health', (req, res) => {
  res.json({ status: 'ok' });
});

// Readiness (comprehensive check)
app.get('/api/ready', async (req, res) => {
  const dbReady = await checkDatabase();
  const redisReady = await checkRedis();
  const ready = dbReady && redisReady;
  res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady });
});

Docker uses liveness (/api/health).
Load balancer uses readiness (/api/ready).


Troubleshooting

Service Marked Unhealthy

Diagnosis:

# Check logs
docker compose logs --tail=50 api

# Check health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output'

# Manually run health check
docker compose exec api wget -O- http://localhost:4000/api/health

Common causes:

  • Service crashed (check logs)
  • Health endpoint broken (test manually)
  • Timeout too short (increase in docker-compose.yml)
  • Database migration running (increase start_period)

Container Restarting Loop

Symptoms: Container repeatedly marked unhealthy → restart → unhealthy

Diagnosis:

# Check restart count
docker inspect changemaker-v2-api | jq '.[0].RestartCount'

# Check logs for errors
docker compose logs api | grep -i error

Common causes:

  • Health check too aggressive (increase retries/interval)
  • Service genuinely broken (fix code issue)
  • Resource limits too low (increase memory/CPU)

Solution:

# Temporarily disable health check
healthcheck:
  disable: true

# Or increase tolerance
healthcheck:
  retries: 10
  start_period: 60s

Health Check Command Not Found

Symptoms: Health check fails with "wget: not found" or "curl: not found"

Cause: Using wrong command for image type (Alpine vs Debian)

Solution:

Alpine images (api, media-api, redis, v2-postgres):

test: ["CMD", "wget", "-q", "--spider", "http://..."]

Debian images (gitea-app):

test: ["CMD", "curl", "-f", "http://..."]

Start Period Too Short

Symptoms: Service marked unhealthy immediately on startup

Cause: Database migrations or slow startup exceed start_period

Solution:

# Increase start_period
healthcheck:
  start_period: 60s  # Was 30s

Monitor startup time:

# Measure time to first healthy
docker compose up -d api && \
  while ! docker compose ps api | grep -q healthy; do sleep 1; done && \
  echo "Startup took $SECONDS seconds"

Production Recommendations

Timeout Configuration

Critical services (database, redis, api):

  • interval: 10-15s
  • timeout: 5s
  • retries: 3-5
  • start_period: 30-60s

Supporting services (n8n, gitea, mailhog):

  • interval: 30-60s
  • timeout: 10s
  • retries: 3
  • start_period: 30s

Restart Policies

Combine with restart policies:

api:
  restart: unless-stopped  # Auto-restart on failure
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]

Effect: Unhealthy container → restart → health checks resume.


Monitoring Integration

Prometheus exporter (future):

# Expose health check status as metrics
docker_healthcheck_status{container="changemaker-v2-api"} 1

Alert on unhealthy:

- alert: ContainerUnhealthy
  expr: docker_healthcheck_status == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} unhealthy"

Testing Health Checks

Manual Test

# Start service
docker compose up -d api

# Watch health status
watch -n2 'docker compose ps api'

# Should see:
# (health: starting) → (healthy)

Simulate Failure

# Stop backend service
docker compose stop v2-postgres

# Wait 15s (API health check interval)
sleep 15

# Check API status
docker compose ps api
# Should show (unhealthy) after 3 failures (45s)

# Restart backend
docker compose start v2-postgres

# API should recover
docker compose ps api
# Should show (healthy) after successful check