changemaker.lite/production plan.md

36 KiB

Production Networking Preparation Plan

Context

The Changemaker Lite V2 application needs to be prepared for production deployment. The current architecture is development-focused with Docker Compose orchestration, Nginx reverse proxy, and Pangolin tunnel integration for SSL/TLS termination. The user wants a comprehensive understanding of the networking setup and identification of production readiness gaps before going live.

Why this is needed:

  • Current setup is optimized for local development (HTTP-only, MailHog, default passwords)
  • Production deployment requires SSL/TLS via Pangolin tunnel, real SMTP, security hardening
  • Need to identify all gaps between dev and production configurations
  • Need actionable checklist for production cutover

What prompted this:

  • User preparing to deploy production instance on betteredmonton.org domain
  • Need to understand networking architecture, security posture, and deployment requirements
  • Ensure all 12 subdomains route correctly through Pangolin tunnel

Intended outcome:

  • Comprehensive documentation of current networking architecture
  • Identified production readiness gaps with severity ratings
  • Prioritized checklist for production deployment
  • Configuration changes needed for production hardening

Current State Assessment

Network Architecture

Single-Bridge Network Design:

  • All 25+ services on one Docker bridge network (changemaker-lite)
  • Services communicate via container hostnames (DNS: 127.0.0.11)
  • Nginx acts as single reverse proxy for all external traffic
  • Pangolin tunnel (Newt container) provides SSL/TLS termination

Service Topology:

Internet → Pangolin Tunnel (HTTPS) → Newt Container → Nginx (HTTP:80) → Backend Services

Critical Services:

  • Express API (port 4000) - Main V2 API with Prisma ORM
  • Fastify Media API (port 4100) - Video library management
  • Admin GUI (port 3000) - React admin interface
  • PostgreSQL V2 (port 5433 localhost-only) - Primary database
  • Redis (port 6379) - Cache, rate limiting, BullMQ backend
  • Nginx (ports 80/443) - Reverse proxy with 12 subdomain routes

Subdomain Routing Matrix

Subdomain Backend Container Port Purpose Security Headers
app.betteredmonton.org Admin GUI 3000 Admin interface SAMEORIGIN
api.betteredmonton.org Express + Media API 4000/4100 Main API + Media routes SAMEORIGIN
betteredmonton.org (root) MkDocs Site 80 Public documentation Default
db.betteredmonton.org NocoDB 8080 Data browser CSP iframe
docs.betteredmonton.org MkDocs Dev 8000 Live preview CSP iframe + WS
code.betteredmonton.org Code Server 8080 Web IDE CSP iframe + WS
git.betteredmonton.org Gitea 3000 Git hosting CSP iframe
n8n.betteredmonton.org n8n 5678 Workflow automation CSP iframe + WS
listmonk.betteredmonton.org Listmonk 9000 Newsletter platform SAMEORIGIN
mail.betteredmonton.org MailHog 8025 Email capture (dev) CSP iframe + WS
qr.betteredmonton.org Mini QR 8080 QR code generator CSP iframe
draw.betteredmonton.org Excalidraw 80 Collaborative whiteboard CSP iframe + WS
grafana.betteredmonton.org Grafana 3000 Monitoring dashboard SAMEORIGIN
home.betteredmonton.org Homepage 3000 Service dashboard SAMEORIGIN

Embed Proxy Ports (bypass security headers for iframe embedding):

  • Ports 8881-8886 → Strip X-Frame-Options and Content-Security-Policy headers
  • Used by Admin GUI to embed third-party services (NocoDB, n8n, Gitea, MailHog, Mini QR, Excalidraw)

SSL/TLS & Tunnel Configuration

Current Setup:

  • Nginx: HTTP-only (port 80), no SSL/TLS configuration
  • Pangolin Tunnel: Handles all HTTPS termination externally
  • Newt Container: Establishes encrypted tunnel to Pangolin server
  • Certificate Management: Delegated entirely to Pangolin (zero config in Nginx)

Pangolin Environment Variables:

PANGOLIN_API_URL=https://api.bnkserve.org/v1     # Self-hosted Pangolin instance
PANGOLIN_API_KEY=                                 # Bearer token authentication
PANGOLIN_ORG_ID=                                  # Organization identifier
PANGOLIN_SITE_ID=                                 # Created during initial setup
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org  # Tunnel entry point
PANGOLIN_NEWT_ID=                                 # Generated tunnel identity
PANGOLIN_NEWT_SECRET=                             # Tunnel authentication secret

Automated Setup (Feb 2026):

  • One-command deployment via /api/pangolin/setup-automated endpoint
  • Central resource config: configs/pangolin/resources.yml (12 services)
  • Atomic .env updates + Newt container restart + tunnel verification
  • Reduces setup time from 15min → 2min (87% reduction)

Security Posture

Strengths:

  • JWT access/refresh token rotation (atomic transactions)
  • Password policy enforced at schema level (12+ chars, complexity requirements)
  • Rate limiting on auth endpoints (10/min per IP)
  • Redis authentication required (requirepass enforced)
  • User enumeration prevention (401 for all auth failures)
  • Database secrets encrypted with ENCRYPTION_KEY
  • HSTS header with 1-year max-age + includeSubDomains
  • CSP headers for iframe protection on sensitive services
  • PostgreSQL bound to localhost only (not exposed to network)
  • Security audit completed Feb 2026 (13 findings addressed)

Critical Gaps:

  • No HTTP → HTTPS redirect in Nginx (relies on Pangolin)
  • Embed proxy ports (8881-8886) bypass ALL security headers (XSS risk)
  • No nginx-level rate limiting (only application-level)
  • Grafana admin password defaults to "admin"
  • Gotify admin password defaults to "admin"
  • N8N default credentials in .env.example
  • EMAIL_TEST_MODE=true by default (routes to MailHog in production)
  • NODE_TLS_REJECT_UNAUTHORIZED not explicitly set (could accept self-signed certs)

Database & Caching

PostgreSQL V2 (changemaker-v2-postgres):

  • Port binding: 127.0.0.1:5433:5432 (localhost-only, production-safe)
  • Connection: postgresql://changemaker:${V2_POSTGRES_PASSWORD}@changemaker-v2-postgres:5432/changemaker_v2
  • Used by: Express API (Prisma), Media API (Prisma), NocoDB (separate nocodb_meta DB)
  • Healthcheck: pg_isready with 10s interval

Listmonk PostgreSQL (listmonk-db):

  • Port binding: 127.0.0.1:5432:5432 (localhost-only)
  • Isolated database lifecycle (separate from V2)
  • Two-user architecture: Web admin + API user (plaintext tokens)

Redis (redis-changemaker):

  • Port binding: 6379:6379 (exposed to host network)
  • Authentication: requirepass ${REDIS_PASSWORD} enforced
  • Connection: redis://:${REDIS_PASSWORD}@redis-changemaker:6379
  • Used for: Cache, BullMQ queues, rate limiting, geocoding cache
  • SECURITY NOTE: redis-exporter uses unauthenticated connection string (potential risk)

Email Configuration

Development (Current):

  • EMAIL_TEST_MODE=true → All emails route to MailHog (localhost:1025)
  • MailHog Web UI: http://mail.betteredmonton.org (dev only)
  • No external SMTP configured

Production Requirements:

  • EMAIL_TEST_MODE=false → Route to real SMTP server
  • SMTP credentials: SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS
  • Encrypt SMTP password with ENCRYPTION_KEY (stored in DB)
  • Configure Listmonk SMTP separately (newsletter sending)

Email Systems:

  1. Campaign Emails (BullMQ queue) → Main SMTP
  2. System Emails (password reset, shift confirmations) → Main SMTP
  3. Newsletter Emails (Listmonk) → Listmonk SMTP (can be same or separate)

Monitoring & Observability

Prometheus Metrics:

  • 12 custom cm_* metrics (API uptime, queue size, sessions, etc.)
  • HTTP request metrics (duration, status codes, paths)
  • Redis, PostgreSQL, container metrics via exporters
  • Scrape interval: 15s

Grafana Dashboards:

  • 3 pre-configured dashboards (API metrics, system metrics, canvass activity)
  • Data source: Prometheus
  • Default admin: admin/admin (must change for production)

Alertmanager:

  • Alert routing configured
  • Requires Gotify setup for notifications (default: admin/admin)

Services Behind --profile monitoring:

  • Prometheus (9090)
  • Grafana (3001)
  • Alertmanager (9093)
  • cAdvisor (8080)
  • Node Exporter (9100)
  • Redis Exporter (9121)
  • Gotify (8889)

Backup & Disaster Recovery

Current Backup Script (scripts/backup.sh):

  • PostgreSQL V2 dump (pg_dump)
  • Listmonk database dump
  • Uploads directory archive (tar.gz)
  • Optional S3 upload (requires S3_BUCKET, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

Critical Gaps:

  • No automated backup scheduling (cron not configured)
  • No backup retention policy
  • No disaster recovery playbook
  • No restore procedure documentation
  • No backup monitoring/alerting

Production Readiness Gaps

Critical Severity (Must Fix Before Production)

  1. Default Admin Passwords

    • Services: Grafana, Gotify, N8N, NocoDB, Listmonk
    • Impact: Unauthorized access to admin dashboards, data exfiltration
    • Fix: Change all default passwords in .env before deployment
    • Verification: Attempt login with default credentials (should fail)
  2. Email Test Mode Enabled

    • Issue: EMAIL_TEST_MODE=true routes all production emails to MailHog
    • Impact: Users never receive password reset, shift confirmation, campaign emails
    • Fix: Set EMAIL_TEST_MODE=false + configure real SMTP credentials
    • Verification: Send test email, verify receipt in external inbox
  3. Missing ENCRYPTION_KEY

    • Issue: Required for encrypting DB secrets (SMTP passwords, API tokens)
    • Impact: Application won't start in production if unset
    • Fix: Generate via openssl rand -hex 32, add to .env
    • Verification: Restart API, check logs for encryption errors
  4. Embed Proxy XSS Risk

    • Issue: Ports 8881-8886 strip all security headers (X-Frame-Options, CSP)
    • Impact: If one service is compromised, attacker can iframe it from malicious site
    • Fix: Restrict embed proxy ports to localhost-only OR implement IP whitelist
    • Verification: Attempt to access embed proxy from external IP (should fail)

High Severity (Fix Before Launch)

  1. No HTTP → HTTPS Redirect

    • Issue: Users can access http://betteredmonton.org without forced redirect
    • Impact: Mixed content warnings, insecure authentication cookies
    • Fix: Add nginx redirect block for all subdomains
    • Verification: curl -I http://app.betteredmonton.org should return 301 redirect
  2. No Automated Backups

    • Issue: Manual backup script requires cron scheduling
    • Impact: Data loss if server fails before manual backup
    • Fix: Add cron job: 0 */6 * * * /path/to/backup.sh (every 6 hours)
    • Verification: Check /var/log/cron for backup execution logs
  3. Redis Exporter Unauthenticated

    • Issue: REDIS_ADDR=redis:6379 (no password)
    • Impact: If exporter runs on separate network segment, Redis exposed
    • Fix: Change to REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis:6379
    • Verification: Check redis-exporter logs, ensure no auth errors
  4. No Disaster Recovery Documentation

    • Issue: Restore procedure not documented
    • Impact: Extended downtime during recovery, data corruption risk
    • Fix: Document step-by-step restore process (DB import, volume restore, env config)
    • Verification: Perform disaster recovery drill on staging environment

Medium Severity (Address Within 30 Days)

  1. Single Bridge Network

    • Issue: All services on same network; lateral movement easy if one compromised
    • Impact: If one service is exploited, attacker can reach databases/Redis
    • Fix: Split into separate networks (app-net, data-net, services-net)
    • Verification: Verify service isolation via docker network inspect
  2. No Nginx Rate Limiting

    • Issue: Rate limiting only at application level (Express middleware)
    • Impact: DDoS attacks can saturate Nginx/network before reaching API rate limiter
    • Fix: Add nginx limit_req zones for /api/* paths
    • Verification: Send 1000 req/sec, verify 429 responses from Nginx
  3. No Log Aggregation

    • Issue: Logs scattered across Docker containers
    • Impact: Difficult to debug multi-service issues, no centralized audit trail
    • Fix: Implement ELK stack or similar (Elasticsearch, Logstash, Kibana)
    • Verification: Search logs from all services in one UI
  4. No TLS Certificate Monitoring

    • Issue: Pangolin manages certs, but no alerting on renewal failures
    • Impact: Site goes offline when cert expires
    • Fix: Add Prometheus alert for cert expiry (30 days before)
    • Verification: Simulate expired cert, verify alert fires

Low Severity (Nice to Have)

  1. No Service Mesh

    • Issue: No observability of inter-service communication
    • Impact: Difficult to debug network issues between containers
    • Fix: Implement Linkerd or Istio for traffic management
    • Verification: View service-to-service latency in Grafana
  2. No Container Resource Limits

    • Issue: Docker Compose doesn't set CPU/memory limits
    • Impact: One service can starve others of resources
    • Fix: Add deploy.resources.limits to docker-compose.yml
    • Verification: Monitor resource usage under load
  3. No Listmonk HTTPS

    • Issue: API-to-Listmonk communication uses HTTP (inside Docker network)
    • Impact: If network is compromised, credentials visible in plaintext
    • Fix: Configure Listmonk with internal TLS certificate
    • Verification: Inspect network traffic, verify encryption

Implementation Plan

Phase 1: Pre-Deployment Security Hardening (2-3 hours)

File: .env (production environment variables)

Changes Required:

  1. Generate Secrets

    # Run on production server
    openssl rand -hex 32  # JWT_ACCESS_SECRET
    openssl rand -hex 32  # JWT_REFRESH_SECRET
    openssl rand -hex 32  # ENCRYPTION_KEY (must differ from JWT secrets)
    openssl rand -hex 16  # LISTMONK_API_TOKEN
    
  2. Update Environment Variables

    • EMAIL_TEST_MODE=false
    • NODE_TLS_REJECT_UNAUTHORIZED= (empty string for strict validation)
    • GRAFANA_ADMIN_PASSWORD=<strong_password>
    • GOTIFY_ADMIN_PASSWORD=<strong_password>
    • N8N_USER_PASSWORD=<strong_password>
    • NC_ADMIN_PASSWORD=<strong_password>
    • LISTMONK_WEB_ADMIN_PASSWORD=<strong_password>
    • V2_POSTGRES_PASSWORD=<strong_password>
    • REDIS_PASSWORD=<strong_password>
    • LISTMONK_DB_PASSWORD=<strong_password>
    • GITEA_DB_PASSWD=<strong_password>
    • GITEA_DB_ROOT_PASSWORD=<strong_password>
    • N8N_ENCRYPTION_KEY=<strong_password>
  3. Configure Production SMTP

    • SMTP_HOST=<smtp.provider.com>
    • SMTP_PORT=<465 or 587>
    • SMTP_USER=<username>
    • SMTP_PASS=<password> (will be encrypted by API on first startup)
    • SMTP_SECURE=true (for port 465) or false (for STARTTLS on 587)
  4. Listmonk SMTP Configuration

    • LISTMONK_SMTP_HOST=<smtp.provider.com>
    • LISTMONK_SMTP_PORT=<465 or 587>
    • LISTMONK_SMTP_TLS_TYPE=STARTTLS (for 587) or TLS (for 465)
    • LISTMONK_SMTP_AUTH_PROTOCOL=login
    • LISTMONK_SMTP_USERNAME=<username>
    • LISTMONK_SMTP_PASSWORD=<password>

Verification:

# Check all required env vars are set
grep "CHANGE_THIS" .env  # Should return nothing
grep "admin" .env | grep -v ADMIN_EMAIL  # Should return nothing (no default admin passwords)

# Test SMTP connection
docker compose exec api node -e "
  const nodemailer = require('nodemailer');
  const transport = nodemailer.createTransport({
    host: process.env.SMTP_HOST,
    port: parseInt(process.env.SMTP_PORT),
    secure: process.env.SMTP_SECURE === 'true',
    auth: {
      user: process.env.SMTP_USER,
      pass: process.env.SMTP_PASS
    }
  });
  transport.verify().then(console.log).catch(console.error);
"

Phase 2: Nginx Production Hardening (1 hour)

File: nginx/conf.d/default.conf (or new production.conf)

Changes Required:

  1. Add HTTP → HTTPS Redirect

    server {
        listen 80;
        server_name *.betteredmonton.org betteredmonton.org;
    
        # Health check endpoints (allow HTTP)
        location /health {
            proxy_pass http://changemaker-v2-api:4000;
        }
    
        # Redirect all other traffic to HTTPS
        location / {
            return 301 https://$host$request_uri;
        }
    }
    
  2. Add Nginx Rate Limiting

    # Add to http block in nginx.conf
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
    limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=10r/m;
    
    # Add to api.conf location blocks
    location /api/auth/ {
        limit_req zone=auth_limit burst=20 nodelay;
        limit_req_status 429;
        proxy_pass http://changemaker-v2-api:4000;
    }
    
    location /api/ {
        limit_req zone=api_limit burst=200 nodelay;
        limit_req_status 429;
        proxy_pass http://changemaker-v2-api:4000;
    }
    
  3. Restrict Embed Proxy Ports to Localhost

    # Add to each embed proxy server block
    server {
        listen 8881;
        server_name localhost;
    
        # Reject non-localhost connections
        allow 127.0.0.1;
        deny all;
    
        location / {
            proxy_pass http://changemaker-v2-nocodb:8080;
            proxy_hide_header X-Frame-Options;
            proxy_hide_header Content-Security-Policy;
        }
    }
    
  4. Add Custom Error Pages

    # Add to http block
    error_page 502 503 504 /5xx.html;
    location = /5xx.html {
        root /usr/share/nginx/html;
        internal;
    }
    
    error_page 429 /429.html;
    location = /429.html {
        root /usr/share/nginx/html;
        internal;
    }
    

Verification:

# Test HTTP redirect
curl -I http://app.betteredmonton.org | grep "301"  # Should see 301 Moved Permanently

# Test rate limiting
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}\n" http://api.betteredmonton.org/api/health; done
# Should see mostly 200s, then 429s

# Test embed proxy localhost restriction
curl -I http://<server_ip>:8881  # Should return 403 Forbidden
curl -I http://localhost:8881  # Should return 200 OK

Phase 3: Pangolin Tunnel Configuration (30 minutes)

File: .env (Pangolin environment variables)

Prerequisites:

  • Pangolin organization created at https://api.bnkserve.org
  • API key obtained from organization settings
  • DNS records created (see below)

Steps:

  1. Configure Pangolin Environment Variables

    PANGOLIN_API_URL=https://api.bnkserve.org/v1
    PANGOLIN_API_KEY=<your_api_key>
    PANGOLIN_ORG_ID=<your_org_id>
    PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org
    
  2. Run Automated Setup

    # Option 1: Via API endpoint
    curl -X POST http://localhost:4000/api/pangolin/setup-automated \
      -H "Authorization: Bearer <admin_jwt_token>" \
      -H "Content-Type: application/json" \
      -d '{
        "siteName": "Changemaker Lite Production",
        "domain": "betteredmonton.org"
      }'
    
    # Option 2: Via CLI wrapper
    ./scripts/pangolin-setup.sh
    
  3. Verify Tunnel Connectivity

    # Check Newt container logs
    docker compose logs -f newt
    # Should see "Connected to Pangolin server" and "Tunnel established"
    
    # Test external access
    curl -I https://app.betteredmonton.org
    # Should return 200 OK with HTTPS
    

DNS Configuration Required:

Create 12 CNAME records pointing to Pangolin endpoint:

app.betteredmonton.org      CNAME   pangolin.bnkserve.org
api.betteredmonton.org      CNAME   pangolin.bnkserve.org
db.betteredmonton.org       CNAME   pangolin.bnkserve.org
docs.betteredmonton.org     CNAME   pangolin.bnkserve.org
code.betteredmonton.org     CNAME   pangolin.bnkserve.org
git.betteredmonton.org      CNAME   pangolin.bnkserve.org
n8n.betteredmonton.org      CNAME   pangolin.bnkserve.org
listmonk.betteredmonton.org CNAME   pangolin.bnkserve.org
mail.betteredmonton.org     CNAME   pangolin.bnkserve.org
qr.betteredmonton.org       CNAME   pangolin.bnkserve.org
draw.betteredmonton.org     CNAME   pangolin.bnkserve.org
grafana.betteredmonton.org  CNAME   pangolin.bnkserve.org
home.betteredmonton.org     CNAME   pangolin.bnkserve.org

Phase 4: Backup Automation (30 minutes)

File: New cron job configuration

Steps:

  1. Create Backup Directory

    mkdir -p /var/backups/changemaker-lite
    chmod 750 /var/backups/changemaker-lite
    
  2. Test Manual Backup

    cd /home/bunker-admin/changemaker.lite
    ./scripts/backup.sh
    # Should create timestamped backup files in ./backups/
    
  3. Configure S3 Upload (Optional)

    # Add to .env
    S3_BUCKET=changemaker-lite-backups
    AWS_ACCESS_KEY_ID=<your_access_key>
    AWS_SECRET_ACCESS_KEY=<your_secret_key>
    AWS_REGION=us-east-1  # Or your preferred region
    
  4. Add Cron Job

    # Edit crontab
    crontab -e
    
    # Add the following lines:
    # Backup every 6 hours at minute 0
    0 */6 * * * cd /home/bunker-admin/changemaker.lite && ./scripts/backup.sh >> /var/log/changemaker-backup.log 2>&1
    
    # Clean up old backups (keep last 7 days)
    0 3 * * * find /home/bunker-admin/changemaker.lite/backups -type f -mtime +7 -delete
    
  5. Setup Backup Monitoring Alert

    # Add to configs/prometheus/alerts.yml
    - alert: BackupJobFailed
      expr: time() - cm_backup_last_success_timestamp > 21600  # 6 hours
      for: 1h
      labels:
        severity: critical
      annotations:
        summary: "Backup job has not run successfully in over 6 hours"
        description: "Last successful backup was {{ $value | humanizeDuration }} ago"
    

Verification:

# Wait for cron execution (or run manually)
./scripts/backup.sh

# Check backup files exist
ls -lh backups/
# Should see 3 files: changemaker_v2-YYYYMMDD-HHMMSS.sql, listmonk-YYYYMMDD-HHMMSS.sql, uploads-YYYYMMDD-HHMMSS.tar.gz

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/

Phase 5: Docker Compose Updates (1 hour)

File: docker-compose.yml

Changes Required:

  1. Fix Redis Exporter Authentication

    redis-exporter:
      environment:
        - REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis-changemaker:6379
    
  2. Add Container Resource Limits (Optional)

    api:
      deploy:
        resources:
          limits:
            cpus: '2'
            memory: 2G
          reservations:
            cpus: '1'
            memory: 1G
    
    media-api:
      deploy:
        resources:
          limits:
            cpus: '2'
            memory: 4G  # Higher for video processing
          reservations:
            cpus: '1'
            memory: 2G
    
    v2-postgres:
      deploy:
        resources:
          limits:
            cpus: '2'
            memory: 4G
          reservations:
            cpus: '1'
            memory: 2G
    
  3. Add Volume Size Limits (Optional)

    volumes:
      v2-postgres-data:
        driver_opts:
          type: none
          device: /var/lib/docker/volumes/v2-postgres-data
          o: bind,size=50G
    

Verification:

# Recreate containers with new config
docker compose down
docker compose up -d

# Verify Redis exporter connects with auth
docker compose logs redis-exporter | grep "successfully"

# Check resource limits are applied
docker stats --no-stream | grep changemaker

Phase 6: Monitoring & Alerting Setup (1-2 hours)

File: configs/prometheus/alerts.yml

Additional Alerts to Add:

groups:
  - name: production_critical
    rules:
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 2592000  # 30 days
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}"

      - alert: DiskSpaceRunningLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space running low on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections ({{ $value }})"
          description: "PostgreSQL has {{ $value }} active connections"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 80%"
          description: "Redis is using {{ $value | humanizePercentage }} of allocated memory"

Gotify Configuration:

# Start Gotify container
docker compose --profile monitoring up -d gotify

# Access Gotify UI at http://localhost:8889
# Change admin password (default: admin/admin)

# Create application token for Alertmanager
# Copy token to configs/alertmanager/alertmanager.yml:

receivers:
  - name: 'gotify'
    webhook_configs:
      - url: 'http://gotify-changemaker:80/message?token=<your_app_token>'
        send_resolved: true

Verification:

# Start monitoring stack
docker compose --profile monitoring up -d

# Check Prometheus targets are up
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# Should return empty (all targets healthy)

# Test alert firing
docker compose stop api
# Wait 1 minute, check Alertmanager UI at http://localhost:9093
# Should see "APIDown" alert firing

# Verify Gotify receives notification
# Check Gotify UI, should see new message

# Restart API
docker compose start api

Phase 7: Final Production Verification (1-2 hours)

Production Deployment Checklist:

Security:

  • All default passwords changed (Grafana, Gotify, N8N, NocoDB, Listmonk)
  • JWT secrets generated via openssl rand -hex 32 (3 different values)
  • ENCRYPTION_KEY generated and different from JWT secrets
  • EMAIL_TEST_MODE=false set in production .env
  • NODE_TLS_REJECT_UNAUTHORIZED= (empty) for strict TLS validation
  • Redis password set and authenticated
  • PostgreSQL passwords strong (20+ characters)
  • Nginx rate limiting enabled
  • Embed proxy ports restricted to localhost

Networking:

  • All 12 DNS CNAME records created
  • Pangolin tunnel configured and connected
  • HTTP → HTTPS redirect working
  • All subdomains resolve via HTTPS
  • SSL certificates valid (checked via browser)
  • WebSocket connections work (test n8n, MkDocs, Code Server)

Email:

  • Production SMTP configured (host, port, user, pass)
  • Test email sent and received
  • Listmonk SMTP configured separately
  • Password reset email works
  • Shift confirmation email works

Backup:

  • Backup script tested manually
  • S3 credentials configured (if using)
  • Cron job added for automated backups (every 6 hours)
  • Old backup cleanup cron added (7 day retention)
  • Backup monitoring alert configured

Monitoring:

  • Prometheus collecting metrics from all services
  • Grafana dashboards showing data
  • Alertmanager configured with Gotify
  • SSL expiry alert configured (30 days warning)
  • Disk space alert configured (10% threshold)
  • Backup job alert configured (6 hour SLA)
  • Test alert sent to Gotify

Application:

  • Admin login works (JWT token issued)
  • Admin dashboard loads all components
  • API health check returns 200 OK
  • Media upload works (test 100MB+ video)
  • Geocoding works (test address lookup)
  • Map loads locations correctly
  • Campaign email sending works (test queue)
  • Listmonk sync works (if enabled)
  • Canvass map GPS tracking works (volunteer portal)

Performance:

  • Nginx rate limiting prevents abuse (test with 1000 req/sec)
  • Database connection pooling configured
  • Redis cache hit ratio >80% (check Grafana)
  • Page load times <2 seconds (test with network throttling)
  • Video upload completes within timeout (10GB max)

Disaster Recovery:

  • Full backup restored on staging environment
  • Database migration verified (Prisma migrations applied)
  • Environment variables match production
  • All services start cleanly after restore

Verification Steps

Post-Deployment Tests

1. SSL/TLS Verification

# Check all subdomains have valid SSL
for subdomain in app api db docs code git n8n listmonk mail qr draw grafana home; do
  echo "Testing $subdomain.betteredmonton.org"
  curl -I https://$subdomain.betteredmonton.org 2>&1 | grep -E "(HTTP|Subject:|Issuer:)"
done

# Should see:
# - HTTP/2 200 (or 301 redirect)
# - Valid certificate issuer (Let's Encrypt or Pangolin)
# - No certificate errors

2. Authentication Flow

# Test login endpoint
curl -X POST https://api.betteredmonton.org/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<admin_password>"}' \
  | jq '.accessToken'

# Should return JWT token

# Test token refresh
curl -X POST https://api.betteredmonton.org/api/auth/refresh \
  -H "Content-Type: application/json" \
  -d '{"refreshToken":"<refresh_token>"}' \
  | jq '.accessToken'

# Should return new access token

3. Email Delivery

# Trigger password reset email
curl -X POST https://api.betteredmonton.org/api/auth/forgot-password \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com"}'

# Check external email inbox for reset link
# Verify email arrives within 2 minutes

4. Rate Limiting

# Test auth endpoint rate limit (10/min)
for i in {1..15}; do
  curl -s -o /dev/null -w "%{http_code} " https://api.betteredmonton.org/api/auth/login \
    -H "Content-Type: application/json" \
    -d '{"email":"test","password":"test"}'
done

# Should see: 401 401 401 ... 429 429 429 (after 10 requests)

5. Database Connectivity

# Check API can connect to database
curl https://api.betteredmonton.org/api/health | jq '.database'
# Should return: "healthy"

# Check Redis connectivity
curl https://api.betteredmonton.org/api/health | jq '.redis'
# Should return: "healthy"

6. Media Upload

# Test video upload (requires auth token)
curl -X POST https://api.betteredmonton.org/media/videos/upload \
  -H "Authorization: Bearer <admin_jwt>" \
  -F "file=@test-video.mp4" \
  -F "title=Test Upload" \
  | jq '.id'

# Should return video ID

7. Monitoring Endpoints

# Prometheus targets
curl -s https://grafana.betteredmonton.org/api/datasources/proxy/1/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Should show all targets with health: "up"

# Grafana health
curl https://grafana.betteredmonton.org/api/health
# Should return: {"database":"ok","version":"..."}

8. Backup Verification

# Trigger manual backup
./scripts/backup.sh

# Check backup files created
ls -lh backups/ | tail -3

# Should see 3 new files with current timestamp

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/ | tail -3

Critical Files Reference

Configuration Files:

  • docker-compose.yml - Service orchestration (25+ services)
  • .env - Environment variables (100+ vars, not committed)
  • .env.example - Template with all required variables
  • nginx/nginx.conf - Global Nginx config + security headers
  • nginx/conf.d/api.conf - API + Media API reverse proxy
  • nginx/conf.d/services.conf - 12 service subdomains + embed proxies
  • configs/pangolin/resources.yml - Tunnel resource definitions
  • configs/prometheus/prometheus.yml - Metrics collection config
  • configs/prometheus/alerts.yml - Alert rules
  • configs/grafana/*.json - Pre-configured dashboards
  • configs/alertmanager/alertmanager.yml - Alert routing

Database Schema:

  • api/prisma/schema.prisma - Main database schema (30+ models)
  • api/prisma/migrations/ - Migration history
  • api/prisma/seed.ts - Initial data seeding

Deployment Scripts:

  • scripts/backup.sh - PostgreSQL + Listmonk + uploads backup
  • scripts/pangolin-setup.sh - CLI wrapper for automated tunnel setup

Environment Validation:

  • api/src/config/env.ts - Zod schema for all environment variables (100+ vars)

Rollback Procedure

If deployment fails or critical issues arise:

1. Immediate Rollback (5 minutes)

# Stop all containers
docker compose down

# Restore previous .env file
cp .env.backup .env

# Restart with old configuration
docker compose up -d

2. Database Rollback (15 minutes)

# Stop API to prevent new writes
docker compose stop api media-api

# Restore from latest backup
docker compose exec v2-postgres psql -U changemaker -d postgres -c "DROP DATABASE changemaker_v2;"
docker compose exec v2-postgres psql -U changemaker -d postgres -c "CREATE DATABASE changemaker_v2;"
docker compose exec -T v2-postgres psql -U changemaker -d changemaker_v2 < backups/changemaker_v2-YYYYMMDD-HHMMSS.sql

# Restart services
docker compose start api media-api

3. Full System Restore (30 minutes)

# Stop all services
docker compose down -v  # WARNING: Removes all volumes

# Restore PostgreSQL data
tar -xzf backups/postgres-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore Redis data (if backed up)
tar -xzf backups/redis-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore uploads
tar -xzf backups/uploads-YYYYMMDD-HHMMSS.tar.gz -C ./media/

# Restart all services
docker compose up -d

4. Verify Rollback Success

# Check all services healthy
docker compose ps | grep -v "Up"  # Should return nothing

# Test admin login
curl -X POST http://localhost:4000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<password>"}'

# Verify database has data
curl http://localhost:4000/api/health | jq '.database'

Post-Production Maintenance

Daily Tasks:

  • Monitor Grafana dashboards for anomalies
  • Check Gotify alerts for critical issues
  • Verify backups completed successfully (check logs)

Weekly Tasks:

  • Review API error logs for patterns
  • Check disk space usage (alert should fire if <10%)
  • Verify SSL certificate validity (30 days remaining)
  • Test disaster recovery on staging environment

Monthly Tasks:

  • Review access logs for suspicious activity
  • Update Docker images to latest versions (after testing on staging)
  • Audit user accounts and remove inactive users
  • Review and rotate API keys if necessary

Quarterly Tasks:

  • Conduct full security audit (penetration testing)
  • Review and update rate limiting thresholds based on traffic
  • Analyze backup storage costs and adjust retention policy
  • Test full disaster recovery procedure with restore drill

Summary

This plan provides a comprehensive pathway from development to production for the Changemaker Lite V2 networking infrastructure. The architecture is fundamentally sound with:

Strengths:

  • Single bridge network simplifies communication
  • Pangolin tunnel handles SSL/TLS externally (zero Nginx cert management)
  • Comprehensive security headers and policies
  • Automated backup script exists
  • Monitoring stack with Prometheus/Grafana ready
  • Rate limiting on critical endpoints

Critical Path for Production:

  1. Phase 1: Security hardening (change passwords, configure SMTP) - MUST DO
  2. Phase 3: Pangolin tunnel setup - MUST DO
  3. Phase 4: Backup automation - SHOULD DO
  4. Phase 6: Monitoring alerts - SHOULD DO
  5. Phase 2: Nginx hardening - NICE TO HAVE

The remaining phases (network segmentation, resource limits, log aggregation) can be deferred to post-launch improvements without blocking production deployment.

Estimated Total Implementation Time: 6-10 hours (can be split across multiple days)

Estimated Downtime During Deployment: <5 minutes (only during final container restart)