bunker-admin 7895ce683e Tonne of debugging - getting ready for the production builds

2026-02-16 10:44:18 -07:00

36 KiB

Raw Blame History

Production Networking Preparation Plan

Context

The Changemaker Lite V2 application needs to be prepared for production deployment. The current architecture is development-focused with Docker Compose orchestration, Nginx reverse proxy, and Pangolin tunnel integration for SSL/TLS termination. The user wants a comprehensive understanding of the networking setup and identification of production readiness gaps before going live.

Why this is needed:

Current setup is optimized for local development (HTTP-only, MailHog, default passwords)
Production deployment requires SSL/TLS via Pangolin tunnel, real SMTP, security hardening
Need to identify all gaps between dev and production configurations
Need actionable checklist for production cutover

What prompted this:

User preparing to deploy production instance on betteredmonton.org domain
Need to understand networking architecture, security posture, and deployment requirements
Ensure all 12 subdomains route correctly through Pangolin tunnel

Intended outcome:

Comprehensive documentation of current networking architecture
Identified production readiness gaps with severity ratings
Prioritized checklist for production deployment
Configuration changes needed for production hardening

Current State Assessment

Network Architecture

Single-Bridge Network Design:

All 25+ services on one Docker bridge network (changemaker-lite)
Services communicate via container hostnames (DNS: 127.0.0.11)
Nginx acts as single reverse proxy for all external traffic
Pangolin tunnel (Newt container) provides SSL/TLS termination

Service Topology:

Internet → Pangolin Tunnel (HTTPS) → Newt Container → Nginx (HTTP:80) → Backend Services

Critical Services:

Express API (port 4000) - Main V2 API with Prisma ORM
Fastify Media API (port 4100) - Video library management
Admin GUI (port 3000) - React admin interface
PostgreSQL V2 (port 5433 localhost-only) - Primary database
Redis (port 6379) - Cache, rate limiting, BullMQ backend
Nginx (ports 80/443) - Reverse proxy with 12 subdomain routes

Subdomain Routing Matrix

Subdomain	Backend	Container Port	Purpose	Security Headers
`app.betteredmonton.org`	Admin GUI	3000	Admin interface	SAMEORIGIN
`api.betteredmonton.org`	Express + Media API	4000/4100	Main API + Media routes	SAMEORIGIN
`betteredmonton.org` (root)	MkDocs Site	80	Public documentation	Default
`db.betteredmonton.org`	NocoDB	8080	Data browser	CSP iframe
`docs.betteredmonton.org`	MkDocs Dev	8000	Live preview	CSP iframe + WS
`code.betteredmonton.org`	Code Server	8080	Web IDE	CSP iframe + WS
`git.betteredmonton.org`	Gitea	3000	Git hosting	CSP iframe
`n8n.betteredmonton.org`	n8n	5678	Workflow automation	CSP iframe + WS
`listmonk.betteredmonton.org`	Listmonk	9000	Newsletter platform	SAMEORIGIN
`mail.betteredmonton.org`	MailHog	8025	Email capture (dev)	CSP iframe + WS
`qr.betteredmonton.org`	Mini QR	8080	QR code generator	CSP iframe
`draw.betteredmonton.org`	Excalidraw	80	Collaborative whiteboard	CSP iframe + WS
`grafana.betteredmonton.org`	Grafana	3000	Monitoring dashboard	SAMEORIGIN
`home.betteredmonton.org`	Homepage	3000	Service dashboard	SAMEORIGIN

Embed Proxy Ports (bypass security headers for iframe embedding):

Ports 8881-8886 → Strip X-Frame-Options and Content-Security-Policy headers
Used by Admin GUI to embed third-party services (NocoDB, n8n, Gitea, MailHog, Mini QR, Excalidraw)

SSL/TLS & Tunnel Configuration

Current Setup:

Nginx: HTTP-only (port 80), no SSL/TLS configuration
Pangolin Tunnel: Handles all HTTPS termination externally
Newt Container: Establishes encrypted tunnel to Pangolin server
Certificate Management: Delegated entirely to Pangolin (zero config in Nginx)

Pangolin Environment Variables:

PANGOLIN_API_URL=https://api.bnkserve.org/v1     # Self-hosted Pangolin instance
PANGOLIN_API_KEY=                                 # Bearer token authentication
PANGOLIN_ORG_ID=                                  # Organization identifier
PANGOLIN_SITE_ID=                                 # Created during initial setup
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org  # Tunnel entry point
PANGOLIN_NEWT_ID=                                 # Generated tunnel identity
PANGOLIN_NEWT_SECRET=                             # Tunnel authentication secret

Automated Setup (Feb 2026):

One-command deployment via /api/pangolin/setup-automated endpoint
Central resource config: configs/pangolin/resources.yml (12 services)
Atomic .env updates + Newt container restart + tunnel verification
Reduces setup time from 15min → 2min (87% reduction)

Security Posture

Strengths:

✅ JWT access/refresh token rotation (atomic transactions)
✅ Password policy enforced at schema level (12+ chars, complexity requirements)
✅ Rate limiting on auth endpoints (10/min per IP)
✅ Redis authentication required (requirepass enforced)
✅ User enumeration prevention (401 for all auth failures)
✅ Database secrets encrypted with ENCRYPTION_KEY
✅ HSTS header with 1-year max-age + includeSubDomains
✅ CSP headers for iframe protection on sensitive services
✅ PostgreSQL bound to localhost only (not exposed to network)
✅ Security audit completed Feb 2026 (13 findings addressed)

Critical Gaps:

❌ No HTTP → HTTPS redirect in Nginx (relies on Pangolin)
❌ Embed proxy ports (8881-8886) bypass ALL security headers (XSS risk)
❌ No nginx-level rate limiting (only application-level)
❌ Grafana admin password defaults to "admin"
❌ Gotify admin password defaults to "admin"
❌ N8N default credentials in .env.example
❌ EMAIL_TEST_MODE=true by default (routes to MailHog in production)
❌ NODE_TLS_REJECT_UNAUTHORIZED not explicitly set (could accept self-signed certs)

Database & Caching

PostgreSQL V2 (changemaker-v2-postgres):

Port binding: 127.0.0.1:5433:5432 (localhost-only, production-safe)
Connection: postgresql://changemaker:${V2_POSTGRES_PASSWORD}@changemaker-v2-postgres:5432/changemaker_v2
Used by: Express API (Prisma), Media API (Prisma), NocoDB (separate nocodb_meta DB)
Healthcheck: pg_isready with 10s interval

Listmonk PostgreSQL (listmonk-db):

Port binding: 127.0.0.1:5432:5432 (localhost-only)
Isolated database lifecycle (separate from V2)
Two-user architecture: Web admin + API user (plaintext tokens)

Redis (redis-changemaker):

Port binding: 6379:6379 (exposed to host network)
Authentication: requirepass ${REDIS_PASSWORD} enforced
Connection: redis://:${REDIS_PASSWORD}@redis-changemaker:6379
Used for: Cache, BullMQ queues, rate limiting, geocoding cache
SECURITY NOTE: redis-exporter uses unauthenticated connection string (potential risk)

Email Configuration

Development (Current):

EMAIL_TEST_MODE=true → All emails route to MailHog (localhost:1025)
MailHog Web UI: http://mail.betteredmonton.org (dev only)
No external SMTP configured

Production Requirements:

EMAIL_TEST_MODE=false → Route to real SMTP server
SMTP credentials: SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS
Encrypt SMTP password with ENCRYPTION_KEY (stored in DB)
Configure Listmonk SMTP separately (newsletter sending)

Email Systems:

Campaign Emails (BullMQ queue) → Main SMTP
System Emails (password reset, shift confirmations) → Main SMTP
Newsletter Emails (Listmonk) → Listmonk SMTP (can be same or separate)

Monitoring & Observability

Prometheus Metrics:

12 custom cm_* metrics (API uptime, queue size, sessions, etc.)
HTTP request metrics (duration, status codes, paths)
Redis, PostgreSQL, container metrics via exporters
Scrape interval: 15s

Grafana Dashboards:

3 pre-configured dashboards (API metrics, system metrics, canvass activity)
Data source: Prometheus
Default admin: admin/admin (must change for production)

Alertmanager:

Alert routing configured
Requires Gotify setup for notifications (default: admin/admin)

Services Behind --profile monitoring:

Prometheus (9090)
Grafana (3001)
Alertmanager (9093)
cAdvisor (8080)
Node Exporter (9100)
Redis Exporter (9121)
Gotify (8889)

Backup & Disaster Recovery

Current Backup Script (scripts/backup.sh):

PostgreSQL V2 dump (pg_dump)
Listmonk database dump
Uploads directory archive (tar.gz)
Optional S3 upload (requires S3_BUCKET, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

Critical Gaps:

❌ No automated backup scheduling (cron not configured)
❌ No backup retention policy
❌ No disaster recovery playbook
❌ No restore procedure documentation
❌ No backup monitoring/alerting

Production Readiness Gaps

Critical Severity (Must Fix Before Production)

Default Admin Passwords
- Services: Grafana, Gotify, N8N, NocoDB, Listmonk
- Impact: Unauthorized access to admin dashboards, data exfiltration
- Fix: Change all default passwords in .env before deployment
- Verification: Attempt login with default credentials (should fail)
Email Test Mode Enabled
- Issue: EMAIL_TEST_MODE=true routes all production emails to MailHog
- Impact: Users never receive password reset, shift confirmation, campaign emails
- Fix: Set EMAIL_TEST_MODE=false + configure real SMTP credentials
- Verification: Send test email, verify receipt in external inbox
Missing ENCRYPTION_KEY
- Issue: Required for encrypting DB secrets (SMTP passwords, API tokens)
- Impact: Application won't start in production if unset
- Fix: Generate via openssl rand -hex 32, add to .env
- Verification: Restart API, check logs for encryption errors
Embed Proxy XSS Risk
- Issue: Ports 8881-8886 strip all security headers (X-Frame-Options, CSP)
- Impact: If one service is compromised, attacker can iframe it from malicious site
- Fix: Restrict embed proxy ports to localhost-only OR implement IP whitelist
- Verification: Attempt to access embed proxy from external IP (should fail)

High Severity (Fix Before Launch)

No HTTP → HTTPS Redirect
- Issue: Users can access http://betteredmonton.org without forced redirect
- Impact: Mixed content warnings, insecure authentication cookies
- Fix: Add nginx redirect block for all subdomains
- Verification: curl -I http://app.betteredmonton.org should return 301 redirect
No Automated Backups
- Issue: Manual backup script requires cron scheduling
- Impact: Data loss if server fails before manual backup
- Fix: Add cron job: 0 */6 * * * /path/to/backup.sh (every 6 hours)
- Verification: Check /var/log/cron for backup execution logs
Redis Exporter Unauthenticated
- Issue: REDIS_ADDR=redis:6379 (no password)
- Impact: If exporter runs on separate network segment, Redis exposed
- Fix: Change to REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis:6379
- Verification: Check redis-exporter logs, ensure no auth errors
No Disaster Recovery Documentation
- Issue: Restore procedure not documented
- Impact: Extended downtime during recovery, data corruption risk
- Fix: Document step-by-step restore process (DB import, volume restore, env config)
- Verification: Perform disaster recovery drill on staging environment

Medium Severity (Address Within 30 Days)

Single Bridge Network
- Issue: All services on same network; lateral movement easy if one compromised
- Impact: If one service is exploited, attacker can reach databases/Redis
- Fix: Split into separate networks (app-net, data-net, services-net)
- Verification: Verify service isolation via docker network inspect
No Nginx Rate Limiting
- Issue: Rate limiting only at application level (Express middleware)
- Impact: DDoS attacks can saturate Nginx/network before reaching API rate limiter
- Fix: Add nginx limit_req zones for /api/* paths
- Verification: Send 1000 req/sec, verify 429 responses from Nginx
No Log Aggregation
- Issue: Logs scattered across Docker containers
- Impact: Difficult to debug multi-service issues, no centralized audit trail
- Fix: Implement ELK stack or similar (Elasticsearch, Logstash, Kibana)
- Verification: Search logs from all services in one UI
No TLS Certificate Monitoring
- Issue: Pangolin manages certs, but no alerting on renewal failures
- Impact: Site goes offline when cert expires
- Fix: Add Prometheus alert for cert expiry (30 days before)
- Verification: Simulate expired cert, verify alert fires

Low Severity (Nice to Have)

No Service Mesh
- Issue: No observability of inter-service communication
- Impact: Difficult to debug network issues between containers
- Fix: Implement Linkerd or Istio for traffic management
- Verification: View service-to-service latency in Grafana
No Container Resource Limits
- Issue: Docker Compose doesn't set CPU/memory limits
- Impact: One service can starve others of resources
- Fix: Add deploy.resources.limits to docker-compose.yml
- Verification: Monitor resource usage under load
No Listmonk HTTPS
- Issue: API-to-Listmonk communication uses HTTP (inside Docker network)
- Impact: If network is compromised, credentials visible in plaintext
- Fix: Configure Listmonk with internal TLS certificate
- Verification: Inspect network traffic, verify encryption

Implementation Plan

Phase 1: Pre-Deployment Security Hardening (2-3 hours)

File: .env (production environment variables)

Changes Required:

Generate Secrets

# Run on production server
openssl rand -hex 32  # JWT_ACCESS_SECRET
openssl rand -hex 32  # JWT_REFRESH_SECRET
openssl rand -hex 32  # ENCRYPTION_KEY (must differ from JWT secrets)
openssl rand -hex 16  # LISTMONK_API_TOKEN

Update Environment Variables
- EMAIL_TEST_MODE=false
- NODE_TLS_REJECT_UNAUTHORIZED= (empty string for strict validation)
- GRAFANA_ADMIN_PASSWORD=<strong_password>
- GOTIFY_ADMIN_PASSWORD=<strong_password>
- N8N_USER_PASSWORD=<strong_password>
- NC_ADMIN_PASSWORD=<strong_password>
- LISTMONK_WEB_ADMIN_PASSWORD=<strong_password>
- V2_POSTGRES_PASSWORD=<strong_password>
- REDIS_PASSWORD=<strong_password>
- LISTMONK_DB_PASSWORD=<strong_password>
- GITEA_DB_PASSWD=<strong_password>
- GITEA_DB_ROOT_PASSWORD=<strong_password>
- N8N_ENCRYPTION_KEY=<strong_password>
Configure Production SMTP
- SMTP_HOST=<smtp.provider.com>
- SMTP_PORT=<465 or 587>
- SMTP_USER=<username>
- SMTP_PASS=<password> (will be encrypted by API on first startup)
- SMTP_SECURE=true (for port 465) or false (for STARTTLS on 587)
Listmonk SMTP Configuration
- LISTMONK_SMTP_HOST=<smtp.provider.com>
- LISTMONK_SMTP_PORT=<465 or 587>
- LISTMONK_SMTP_TLS_TYPE=STARTTLS (for 587) or TLS (for 465)
- LISTMONK_SMTP_AUTH_PROTOCOL=login
- LISTMONK_SMTP_USERNAME=<username>
- LISTMONK_SMTP_PASSWORD=<password>

Verification:

# Check all required env vars are set
grep "CHANGE_THIS" .env  # Should return nothing
grep "admin" .env | grep -v ADMIN_EMAIL  # Should return nothing (no default admin passwords)

# Test SMTP connection
docker compose exec api node -e "
  const nodemailer = require('nodemailer');
  const transport = nodemailer.createTransport({
    host: process.env.SMTP_HOST,
    port: parseInt(process.env.SMTP_PORT),
    secure: process.env.SMTP_SECURE === 'true',
    auth: {
      user: process.env.SMTP_USER,
      pass: process.env.SMTP_PASS
    }
  });
  transport.verify().then(console.log).catch(console.error);
"

Phase 2: Nginx Production Hardening (1 hour)

File: nginx/conf.d/default.conf (or new production.conf)

Changes Required:

Add HTTP → HTTPS Redirect

server {
    listen 80;
    server_name *.betteredmonton.org betteredmonton.org;

    # Health check endpoints (allow HTTP)
    location /health {
        proxy_pass http://changemaker-v2-api:4000;
    }

    # Redirect all other traffic to HTTPS
    location / {
        return 301 https://$host$request_uri;
    }
}

Add Nginx Rate Limiting

# Add to http block in nginx.conf
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=10r/m;

# Add to api.conf location blocks
location /api/auth/ {
    limit_req zone=auth_limit burst=20 nodelay;
    limit_req_status 429;
    proxy_pass http://changemaker-v2-api:4000;
}

location /api/ {
    limit_req zone=api_limit burst=200 nodelay;
    limit_req_status 429;
    proxy_pass http://changemaker-v2-api:4000;
}

Restrict Embed Proxy Ports to Localhost

# Add to each embed proxy server block
server {
    listen 8881;
    server_name localhost;

    # Reject non-localhost connections
    allow 127.0.0.1;
    deny all;

    location / {
        proxy_pass http://changemaker-v2-nocodb:8080;
        proxy_hide_header X-Frame-Options;
        proxy_hide_header Content-Security-Policy;
    }
}

Add Custom Error Pages

# Add to http block
error_page 502 503 504 /5xx.html;
location = /5xx.html {
    root /usr/share/nginx/html;
    internal;
}

error_page 429 /429.html;
location = /429.html {
    root /usr/share/nginx/html;
    internal;
}

Verification:

# Test HTTP redirect
curl -I http://app.betteredmonton.org | grep "301"  # Should see 301 Moved Permanently

# Test rate limiting
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}\n" http://api.betteredmonton.org/api/health; done
# Should see mostly 200s, then 429s

# Test embed proxy localhost restriction
curl -I http://<server_ip>:8881  # Should return 403 Forbidden
curl -I http://localhost:8881  # Should return 200 OK

Phase 3: Pangolin Tunnel Configuration (30 minutes)

File: .env (Pangolin environment variables)

Prerequisites:

Pangolin organization created at https://api.bnkserve.org
API key obtained from organization settings
DNS records created (see below)

Steps:

Configure Pangolin Environment Variables

PANGOLIN_API_URL=https://api.bnkserve.org/v1
PANGOLIN_API_KEY=<your_api_key>
PANGOLIN_ORG_ID=<your_org_id>
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org

Run Automated Setup

# Option 1: Via API endpoint
curl -X POST http://localhost:4000/api/pangolin/setup-automated \
  -H "Authorization: Bearer <admin_jwt_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "siteName": "Changemaker Lite Production",
    "domain": "betteredmonton.org"
  }'

# Option 2: Via CLI wrapper
./scripts/pangolin-setup.sh

Verify Tunnel Connectivity

# Check Newt container logs
docker compose logs -f newt
# Should see "Connected to Pangolin server" and "Tunnel established"

# Test external access
curl -I https://app.betteredmonton.org
# Should return 200 OK with HTTPS

DNS Configuration Required:

Create 12 CNAME records pointing to Pangolin endpoint:

app.betteredmonton.org      CNAME   pangolin.bnkserve.org
api.betteredmonton.org      CNAME   pangolin.bnkserve.org
db.betteredmonton.org       CNAME   pangolin.bnkserve.org
docs.betteredmonton.org     CNAME   pangolin.bnkserve.org
code.betteredmonton.org     CNAME   pangolin.bnkserve.org
git.betteredmonton.org      CNAME   pangolin.bnkserve.org
n8n.betteredmonton.org      CNAME   pangolin.bnkserve.org
listmonk.betteredmonton.org CNAME   pangolin.bnkserve.org
mail.betteredmonton.org     CNAME   pangolin.bnkserve.org
qr.betteredmonton.org       CNAME   pangolin.bnkserve.org
draw.betteredmonton.org     CNAME   pangolin.bnkserve.org
grafana.betteredmonton.org  CNAME   pangolin.bnkserve.org
home.betteredmonton.org     CNAME   pangolin.bnkserve.org

Phase 4: Backup Automation (30 minutes)

File: New cron job configuration

Steps:

Create Backup Directory

mkdir -p /var/backups/changemaker-lite
chmod 750 /var/backups/changemaker-lite

Test Manual Backup

cd /home/bunker-admin/changemaker.lite
./scripts/backup.sh
# Should create timestamped backup files in ./backups/

Configure S3 Upload (Optional)

# Add to .env
S3_BUCKET=changemaker-lite-backups
AWS_ACCESS_KEY_ID=<your_access_key>
AWS_SECRET_ACCESS_KEY=<your_secret_key>
AWS_REGION=us-east-1  # Or your preferred region

Add Cron Job

# Edit crontab
crontab -e

# Add the following lines:
# Backup every 6 hours at minute 0
0 */6 * * * cd /home/bunker-admin/changemaker.lite && ./scripts/backup.sh >> /var/log/changemaker-backup.log 2>&1

# Clean up old backups (keep last 7 days)
0 3 * * * find /home/bunker-admin/changemaker.lite/backups -type f -mtime +7 -delete

Setup Backup Monitoring Alert

# Add to configs/prometheus/alerts.yml
- alert: BackupJobFailed
  expr: time() - cm_backup_last_success_timestamp > 21600  # 6 hours
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "Backup job has not run successfully in over 6 hours"
    description: "Last successful backup was {{ $value | humanizeDuration }} ago"

Verification:

# Wait for cron execution (or run manually)
./scripts/backup.sh

# Check backup files exist
ls -lh backups/
# Should see 3 files: changemaker_v2-YYYYMMDD-HHMMSS.sql, listmonk-YYYYMMDD-HHMMSS.sql, uploads-YYYYMMDD-HHMMSS.tar.gz

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/

Phase 5: Docker Compose Updates (1 hour)

File: docker-compose.yml

Changes Required:

Fix Redis Exporter Authentication

redis-exporter:
  environment:
    - REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis-changemaker:6379

Add Container Resource Limits (Optional)

api:
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 2G
      reservations:
        cpus: '1'
        memory: 1G

media-api:
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G  # Higher for video processing
      reservations:
        cpus: '1'
        memory: 2G

v2-postgres:
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 4G
      reservations:
        cpus: '1'
        memory: 2G

Add Volume Size Limits (Optional)

volumes:
  v2-postgres-data:
    driver_opts:
      type: none
      device: /var/lib/docker/volumes/v2-postgres-data
      o: bind,size=50G

Verification:

# Recreate containers with new config
docker compose down
docker compose up -d

# Verify Redis exporter connects with auth
docker compose logs redis-exporter | grep "successfully"

# Check resource limits are applied
docker stats --no-stream | grep changemaker

Phase 6: Monitoring & Alerting Setup (1-2 hours)

File: configs/prometheus/alerts.yml

Additional Alerts to Add:

groups:
  - name: production_critical
    rules:
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 2592000  # 30 days
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}"

      - alert: DiskSpaceRunningLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space running low on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections ({{ $value }})"
          description: "PostgreSQL has {{ $value }} active connections"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 80%"
          description: "Redis is using {{ $value | humanizePercentage }} of allocated memory"

Gotify Configuration:

# Start Gotify container
docker compose --profile monitoring up -d gotify

# Access Gotify UI at http://localhost:8889
# Change admin password (default: admin/admin)

# Create application token for Alertmanager
# Copy token to configs/alertmanager/alertmanager.yml:

receivers:
  - name: 'gotify'
    webhook_configs:
      - url: 'http://gotify-changemaker:80/message?token=<your_app_token>'
        send_resolved: true

Verification:

# Start monitoring stack
docker compose --profile monitoring up -d

# Check Prometheus targets are up
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# Should return empty (all targets healthy)

# Test alert firing
docker compose stop api
# Wait 1 minute, check Alertmanager UI at http://localhost:9093
# Should see "APIDown" alert firing

# Verify Gotify receives notification
# Check Gotify UI, should see new message

# Restart API
docker compose start api

Phase 7: Final Production Verification (1-2 hours)

Production Deployment Checklist:

Security:

All default passwords changed (Grafana, Gotify, N8N, NocoDB, Listmonk)
JWT secrets generated via openssl rand -hex 32 (3 different values)
ENCRYPTION_KEY generated and different from JWT secrets
EMAIL_TEST_MODE=false set in production .env
NODE_TLS_REJECT_UNAUTHORIZED= (empty) for strict TLS validation
Redis password set and authenticated
PostgreSQL passwords strong (20+ characters)
Nginx rate limiting enabled
Embed proxy ports restricted to localhost

Networking:

All 12 DNS CNAME records created
Pangolin tunnel configured and connected
HTTP → HTTPS redirect working
All subdomains resolve via HTTPS
SSL certificates valid (checked via browser)
WebSocket connections work (test n8n, MkDocs, Code Server)

Email:

Production SMTP configured (host, port, user, pass)
Test email sent and received
Listmonk SMTP configured separately
Password reset email works
Shift confirmation email works

Backup:

Backup script tested manually
S3 credentials configured (if using)
Cron job added for automated backups (every 6 hours)
Old backup cleanup cron added (7 day retention)
Backup monitoring alert configured

Monitoring:

Prometheus collecting metrics from all services
Grafana dashboards showing data
Alertmanager configured with Gotify
SSL expiry alert configured (30 days warning)
Disk space alert configured (10% threshold)
Backup job alert configured (6 hour SLA)
Test alert sent to Gotify

Application:

Admin login works (JWT token issued)
Admin dashboard loads all components
API health check returns 200 OK
Media upload works (test 100MB+ video)
Geocoding works (test address lookup)
Map loads locations correctly
Campaign email sending works (test queue)
Listmonk sync works (if enabled)
Canvass map GPS tracking works (volunteer portal)

Performance:

Nginx rate limiting prevents abuse (test with 1000 req/sec)
Database connection pooling configured
Redis cache hit ratio >80% (check Grafana)
Page load times <2 seconds (test with network throttling)
Video upload completes within timeout (10GB max)

Disaster Recovery:

Full backup restored on staging environment
Database migration verified (Prisma migrations applied)
Environment variables match production
All services start cleanly after restore

Verification Steps

Post-Deployment Tests

1. SSL/TLS Verification

# Check all subdomains have valid SSL
for subdomain in app api db docs code git n8n listmonk mail qr draw grafana home; do
  echo "Testing $subdomain.betteredmonton.org"
  curl -I https://$subdomain.betteredmonton.org 2>&1 | grep -E "(HTTP|Subject:|Issuer:)"
done

# Should see:
# - HTTP/2 200 (or 301 redirect)
# - Valid certificate issuer (Let's Encrypt or Pangolin)
# - No certificate errors

2. Authentication Flow

# Test login endpoint
curl -X POST https://api.betteredmonton.org/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<admin_password>"}' \
  | jq '.accessToken'

# Should return JWT token

# Test token refresh
curl -X POST https://api.betteredmonton.org/api/auth/refresh \
  -H "Content-Type: application/json" \
  -d '{"refreshToken":"<refresh_token>"}' \
  | jq '.accessToken'

# Should return new access token

3. Email Delivery

# Trigger password reset email
curl -X POST https://api.betteredmonton.org/api/auth/forgot-password \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com"}'

# Check external email inbox for reset link
# Verify email arrives within 2 minutes

4. Rate Limiting

# Test auth endpoint rate limit (10/min)
for i in {1..15}; do
  curl -s -o /dev/null -w "%{http_code} " https://api.betteredmonton.org/api/auth/login \
    -H "Content-Type: application/json" \
    -d '{"email":"test","password":"test"}'
done

# Should see: 401 401 401 ... 429 429 429 (after 10 requests)

5. Database Connectivity

# Check API can connect to database
curl https://api.betteredmonton.org/api/health | jq '.database'
# Should return: "healthy"

# Check Redis connectivity
curl https://api.betteredmonton.org/api/health | jq '.redis'
# Should return: "healthy"

6. Media Upload

# Test video upload (requires auth token)
curl -X POST https://api.betteredmonton.org/media/videos/upload \
  -H "Authorization: Bearer <admin_jwt>" \
  -F "file=@test-video.mp4" \
  -F "title=Test Upload" \
  | jq '.id'

# Should return video ID

7. Monitoring Endpoints

# Prometheus targets
curl -s https://grafana.betteredmonton.org/api/datasources/proxy/1/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Should show all targets with health: "up"

# Grafana health
curl https://grafana.betteredmonton.org/api/health
# Should return: {"database":"ok","version":"..."}

8. Backup Verification

# Trigger manual backup
./scripts/backup.sh

# Check backup files created
ls -lh backups/ | tail -3

# Should see 3 new files with current timestamp

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/ | tail -3

Critical Files Reference

Configuration Files:

docker-compose.yml - Service orchestration (25+ services)
.env - Environment variables (100+ vars, not committed)
.env.example - Template with all required variables
nginx/nginx.conf - Global Nginx config + security headers
nginx/conf.d/api.conf - API + Media API reverse proxy
nginx/conf.d/services.conf - 12 service subdomains + embed proxies
configs/pangolin/resources.yml - Tunnel resource definitions
configs/prometheus/prometheus.yml - Metrics collection config
configs/prometheus/alerts.yml - Alert rules
configs/grafana/*.json - Pre-configured dashboards
configs/alertmanager/alertmanager.yml - Alert routing

Database Schema:

api/prisma/schema.prisma - Main database schema (30+ models)
api/prisma/migrations/ - Migration history
api/prisma/seed.ts - Initial data seeding

Deployment Scripts:

scripts/backup.sh - PostgreSQL + Listmonk + uploads backup
scripts/pangolin-setup.sh - CLI wrapper for automated tunnel setup

Environment Validation:

api/src/config/env.ts - Zod schema for all environment variables (100+ vars)

Rollback Procedure

If deployment fails or critical issues arise:

1. Immediate Rollback (5 minutes)

# Stop all containers
docker compose down

# Restore previous .env file
cp .env.backup .env

# Restart with old configuration
docker compose up -d

2. Database Rollback (15 minutes)

# Stop API to prevent new writes
docker compose stop api media-api

# Restore from latest backup
docker compose exec v2-postgres psql -U changemaker -d postgres -c "DROP DATABASE changemaker_v2;"
docker compose exec v2-postgres psql -U changemaker -d postgres -c "CREATE DATABASE changemaker_v2;"
docker compose exec -T v2-postgres psql -U changemaker -d changemaker_v2 < backups/changemaker_v2-YYYYMMDD-HHMMSS.sql

# Restart services
docker compose start api media-api

3. Full System Restore (30 minutes)

# Stop all services
docker compose down -v  # WARNING: Removes all volumes

# Restore PostgreSQL data
tar -xzf backups/postgres-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore Redis data (if backed up)
tar -xzf backups/redis-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore uploads
tar -xzf backups/uploads-YYYYMMDD-HHMMSS.tar.gz -C ./media/

# Restart all services
docker compose up -d

4. Verify Rollback Success

# Check all services healthy
docker compose ps | grep -v "Up"  # Should return nothing

# Test admin login
curl -X POST http://localhost:4000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<password>"}'

# Verify database has data
curl http://localhost:4000/api/health | jq '.database'

Post-Production Maintenance

Daily Tasks:

Monitor Grafana dashboards for anomalies
Check Gotify alerts for critical issues
Verify backups completed successfully (check logs)

Weekly Tasks:

Review API error logs for patterns
Check disk space usage (alert should fire if <10%)
Verify SSL certificate validity (30 days remaining)
Test disaster recovery on staging environment

Monthly Tasks:

Review access logs for suspicious activity
Update Docker images to latest versions (after testing on staging)
Audit user accounts and remove inactive users
Review and rotate API keys if necessary

Quarterly Tasks:

Conduct full security audit (penetration testing)
Review and update rate limiting thresholds based on traffic
Analyze backup storage costs and adjust retention policy
Test full disaster recovery procedure with restore drill

Summary

This plan provides a comprehensive pathway from development to production for the Changemaker Lite V2 networking infrastructure. The architecture is fundamentally sound with:

Strengths:

Single bridge network simplifies communication
Pangolin tunnel handles SSL/TLS externally (zero Nginx cert management)
Comprehensive security headers and policies
Automated backup script exists
Monitoring stack with Prometheus/Grafana ready
Rate limiting on critical endpoints

Critical Path for Production:

Phase 1: Security hardening (change passwords, configure SMTP) - MUST DO
Phase 3: Pangolin tunnel setup - MUST DO
Phase 4: Backup automation - SHOULD DO
Phase 6: Monitoring alerts - SHOULD DO
Phase 2: Nginx hardening - NICE TO HAVE

The remaining phases (network segmentation, resource limits, log aggregation) can be deferred to post-launch improvements without blocking production deployment.

Estimated Total Implementation Time: 6-10 hours (can be split across multiple days)

Estimated Downtime During Deployment: <5 minutes (only during final container restart)

36 KiB Raw Blame History

Production Networking Preparation Plan

Context

Current State Assessment

Network Architecture

Subdomain Routing Matrix

SSL/TLS & Tunnel Configuration

Security Posture

Database & Caching

Email Configuration

Monitoring & Observability

Backup & Disaster Recovery

Production Readiness Gaps

Critical Severity (Must Fix Before Production)

High Severity (Fix Before Launch)

Medium Severity (Address Within 30 Days)

Low Severity (Nice to Have)

Implementation Plan

Phase 1: Pre-Deployment Security Hardening (2-3 hours)

Phase 2: Nginx Production Hardening (1 hour)

Phase 3: Pangolin Tunnel Configuration (30 minutes)

Phase 4: Backup Automation (30 minutes)

Phase 5: Docker Compose Updates (1 hour)

Phase 6: Monitoring & Alerting Setup (1-2 hours)

Phase 7: Final Production Verification (1-2 hours)

Verification Steps

Post-Deployment Tests

Critical Files Reference

Rollback Procedure

Post-Production Maintenance

Summary

36 KiB

Raw Blame History