bunker-admin 7895ce683e Tonne of debugging - getting ready for the production builds

2026-02-16 10:44:18 -07:00

12 KiB

Raw Blame History

Docker Health Check Configuration

Overview

Docker health checks provide automatic service monitoring and restart capabilities. Changemaker Lite V2 includes health checks for 7 critical services.

Benefits:

Automatic restart of unhealthy containers
Dependency management (depends_on with service_healthy)
Monitoring integration (Prometheus can scrape health status)

Services with Health Checks

Service	Healthcheck Command	Interval	Timeout	Retries	Start Period
api	`wget http://localhost:4000/api/health`	15s	5s	3	30s
media-api	`wget http://127.0.0.1:4100/health`	15s	5s	3	30s
admin	`wget http://127.0.0.1:3000/`	30s	5s	3	20s
v2-postgres	`pg_isready -U changemaker`	10s	5s	5	-
redis	`redis-cli -a $REDIS_PASSWORD ping`	10s	5s	5	-
gitea-app	`curl http://localhost:3000/`	30s	5s	3	30s
n8n	`wget http://localhost:5678/healthz`	30s	5s	3	30s

Health Check Configuration

API (Express)

docker-compose.yml:

api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

test: Runs wget (Alpine image standard) to check /api/health endpoint
interval: Check every 15 seconds
timeout: Fail if no response in 5 seconds
retries: Mark unhealthy after 3 consecutive failures
start_period: 30s grace period on startup (allows migrations to run)

Health endpoint (api/src/server.ts):

app.get('/api/health', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

Health states:

starting: Within start_period (30s)
healthy: Check passed
unhealthy: 3 consecutive failures

Media API (Fastify)

docker-compose.yml:

media-api:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:4100/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s

Health endpoint (api/src/media-server.ts):

app.get('/health', async (req, reply) => {
  return { status: 'ok' };
});

Note: Uses 127.0.0.1 instead of localhost (Alpine's wget prefers IP).

Admin (Vite Dev Server)

docker-compose.yml:

admin:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://127.0.0.1:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 20s

Explanation:

30s interval: Less critical than backend (frontend can tolerate brief downtime)
20s start period: Vite dev server starts quickly
Root path: Checks Vite is serving HTML (no dedicated /health endpoint)

V2 PostgreSQL

docker-compose.yml:

v2-postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U changemaker"]
    interval: 10s
    timeout: 5s
    retries: 5

Explanation:

pg_isready: Built-in PostgreSQL health check utility
10s interval: Fast detection of database issues
5 retries: More tolerant (database startup can be slow)
No start_period: PostgreSQL has its own startup delay

pg_isready output:

# Healthy
/var/run/postgresql:5432 - accepting connections

# Unhealthy
/var/run/postgresql:5432 - rejecting connections

Redis

docker-compose.yml:

redis:
  healthcheck:
    test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
    interval: 10s
    timeout: 5s
    retries: 5

Explanation:

redis-cli ping: Returns PONG if healthy
-a ${REDIS_PASSWORD}: Authenticates with password (required since Security Audit)
10s interval: Fast detection for critical cache service

PING output:

# Healthy
PONG

# Unhealthy
(error) NOAUTH Authentication required

Gitea

docker-compose.yml:

gitea-app:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:3000/"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

curl: Debian-based image (no wget)
-f: Fail on HTTP errors (non-200 response)
30s interval: Supporting service (less critical)

Important: Gitea uses curl (not wget) because it's a Debian image, not Alpine.

n8n

docker-compose.yml:

n8n:
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:5678/healthz"]
    interval: 30s
    timeout: 5s
    retries: 3
    start_period: 30s

Explanation:

/healthz: n8n's built-in health endpoint
30s interval: Workflow automation (not user-facing)

Dependency Chains

API Depends on Database + Redis

docker-compose.yml:

api:
  depends_on:
    v2-postgres:
      condition: service_healthy
    redis:
      condition: service_healthy

Effect: API container waits for PostgreSQL + Redis to be healthy before starting.

Startup sequence:

PostgreSQL starts → health checks begin
After 5 successful checks → marked healthy
Redis starts → health checks begin
After 5 successful checks → marked healthy
API starts (both dependencies healthy)

Media API Depends on Database

docker-compose.yml:

media-api:
  depends_on:
    v2-postgres:
      condition: service_healthy

Effect: Media API waits for PostgreSQL to be healthy.

NocoDB Depends on Database

docker-compose.yml:

nocodb-v2:
  depends_on:
    v2-postgres:
      condition: service_healthy

Effect: NocoDB waits for its metadata database to be ready.

Monitoring Healthcheck Status

View Health Status

# All services (shows health in STATUS column)
docker compose ps

# Example output:
# NAME                    STATUS
# changemaker-v2-api      Up 2 hours (healthy)
# changemaker-v2-postgres Up 2 hours (healthy)
# redis-changemaker       Up 2 hours (healthy)

Health states:

(healthy): All checks passing
(unhealthy): Multiple checks failed
(health: starting): Within start_period

Filter Unhealthy Services

# Show only unhealthy
docker compose ps | grep unhealthy

# Count unhealthy
docker compose ps -q --status unhealthy | wc -l

Inspect Health Check Details

# Full health info for API
docker inspect changemaker-v2-api | jq '.[0].State.Health'

# Example output:
{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2026-02-13T14:30:00Z",
      "End": "2026-02-13T14:30:01Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}

Key fields:

Status: healthy, unhealthy, or starting
FailingStreak: Consecutive failed checks
Log: Last 5 health check results

Health Check Logs

# View health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1]'

# Example (success):
{
  "Start": "2026-02-13T14:30:00Z",
  "End": "2026-02-13T14:30:01Z",
  "ExitCode": 0,
  "Output": ""
}

# Example (failure):
{
  "Start": "2026-02-13T14:35:00Z",
  "End": "2026-02-13T14:35:05Z",
  "ExitCode": 1,
  "Output": "wget: can't connect to remote host (127.0.0.1): Connection refused"
}

Custom Health Checks

Advanced API Health Check

Check database + Redis connectivity:

api/src/server.ts:

app.get('/api/health', async (req, res) => {
  const checks = {
    database: false,
    redis: false,
  };

  try {
    await prisma.$queryRaw`SELECT 1`;
    checks.database = true;
  } catch (err) {
    console.error('DB health check failed:', err);
  }

  try {
    await redis.ping();
    checks.redis = true;
  } catch (err) {
    console.error('Redis health check failed:', err);
  }

  const healthy = checks.database && checks.redis;
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});

docker-compose.yml (no change needed — still checks /api/health):

healthcheck:
  test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]

Readiness vs Liveness

Readiness: Service is ready to accept traffic (used by Kubernetes)
Liveness: Service is running (Docker health checks)

Example (separate endpoints):

// Liveness (minimal check)
app.get('/api/health', (req, res) => {
  res.json({ status: 'ok' });
});

// Readiness (comprehensive check)
app.get('/api/ready', async (req, res) => {
  const dbReady = await checkDatabase();
  const redisReady = await checkRedis();
  const ready = dbReady && redisReady;
  res.status(ready ? 200 : 503).json({ ready, dbReady, redisReady });
});

Docker uses liveness (/api/health).
Load balancer uses readiness (/api/ready).

Troubleshooting

Service Marked Unhealthy

Diagnosis:

# Check logs
docker compose logs --tail=50 api

# Check health check output
docker inspect changemaker-v2-api | jq '.[0].State.Health.Log[-1].Output'

# Manually run health check
docker compose exec api wget -O- http://localhost:4000/api/health

Common causes:

Service crashed (check logs)
Health endpoint broken (test manually)
Timeout too short (increase in docker-compose.yml)
Database migration running (increase start_period)

Container Restarting Loop

Symptoms: Container repeatedly marked unhealthy → restart → unhealthy

Diagnosis:

# Check restart count
docker inspect changemaker-v2-api | jq '.[0].RestartCount'

# Check logs for errors
docker compose logs api | grep -i error

Common causes:

Health check too aggressive (increase retries/interval)
Service genuinely broken (fix code issue)
Resource limits too low (increase memory/CPU)

Solution:

# Temporarily disable health check
healthcheck:
  disable: true

# Or increase tolerance
healthcheck:
  retries: 10
  start_period: 60s

Health Check Command Not Found

Symptoms: Health check fails with "wget: not found" or "curl: not found"

Cause: Using wrong command for image type (Alpine vs Debian)

Solution:

Alpine images (api, media-api, redis, v2-postgres):

test: ["CMD", "wget", "-q", "--spider", "http://..."]

Debian images (gitea-app):

test: ["CMD", "curl", "-f", "http://..."]

Start Period Too Short

Symptoms: Service marked unhealthy immediately on startup

Cause: Database migrations or slow startup exceed start_period

Solution:

# Increase start_period
healthcheck:
  start_period: 60s  # Was 30s

Monitor startup time:

# Measure time to first healthy
docker compose up -d api && \
  while ! docker compose ps api | grep -q healthy; do sleep 1; done && \
  echo "Startup took $SECONDS seconds"

Production Recommendations

Timeout Configuration

Critical services (database, redis, api):

interval: 10-15s
timeout: 5s
retries: 3-5
start_period: 30-60s

Supporting services (n8n, gitea, mailhog):

interval: 30-60s
timeout: 10s
retries: 3
start_period: 30s

Restart Policies

Combine with restart policies:

api:
  restart: unless-stopped  # Auto-restart on failure
  healthcheck:
    test: ["CMD", "wget", "-q", "--spider", "http://localhost:4000/api/health"]

Effect: Unhealthy container → restart → health checks resume.

Monitoring Integration

Prometheus exporter (future):

# Expose health check status as metrics
docker_healthcheck_status{container="changemaker-v2-api"} 1

Alert on unhealthy:

- alert: ContainerUnhealthy
  expr: docker_healthcheck_status == 0
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Container {{ $labels.container }} unhealthy"

Testing Health Checks

Manual Test

# Start service
docker compose up -d api

# Watch health status
watch -n2 'docker compose ps api'

# Should see:
# (health: starting) → (healthy)

Simulate Failure

# Stop backend service
docker compose stop v2-postgres

# Wait 15s (API health check interval)
sleep 15

# Check API status
docker compose ps api
# Should show (unhealthy) after 3 failures (45s)

# Restart backend
docker compose start v2-postgres

# API should recover
docker compose ps api
# Should show (healthy) after successful check

Docker Compose — Service orchestration
Monitoring Stack — Health metrics
Troubleshooting — Debug failing services

12 KiB Raw Blame History

Docker Health Check Configuration

Overview

Services with Health Checks

Health Check Configuration

API (Express)

Media API (Fastify)

Admin (Vite Dev Server)

V2 PostgreSQL

Redis

Gitea

n8n

Dependency Chains

API Depends on Database + Redis

Media API Depends on Database

NocoDB Depends on Database

Monitoring Healthcheck Status

View Health Status

Filter Unhealthy Services

Inspect Health Check Details

Health Check Logs

Custom Health Checks

Advanced API Health Check

Readiness vs Liveness

Troubleshooting

Service Marked Unhealthy

Container Restarting Loop

Health Check Command Not Found

Start Period Too Short

Production Recommendations

Timeout Configuration

Restart Policies

Monitoring Integration

Testing Health Checks

Manual Test

Simulate Failure

Related Documentation

12 KiB

Raw Blame History