36 KiB
Production Networking Preparation Plan
Context
The Changemaker Lite V2 application needs to be prepared for production deployment. The current architecture is development-focused with Docker Compose orchestration, Nginx reverse proxy, and Pangolin tunnel integration for SSL/TLS termination. The user wants a comprehensive understanding of the networking setup and identification of production readiness gaps before going live.
Why this is needed:
- Current setup is optimized for local development (HTTP-only, MailHog, default passwords)
- Production deployment requires SSL/TLS via Pangolin tunnel, real SMTP, security hardening
- Need to identify all gaps between dev and production configurations
- Need actionable checklist for production cutover
What prompted this:
- User preparing to deploy production instance on betteredmonton.org domain
- Need to understand networking architecture, security posture, and deployment requirements
- Ensure all 12 subdomains route correctly through Pangolin tunnel
Intended outcome:
- Comprehensive documentation of current networking architecture
- Identified production readiness gaps with severity ratings
- Prioritized checklist for production deployment
- Configuration changes needed for production hardening
Current State Assessment
Network Architecture
Single-Bridge Network Design:
- All 25+ services on one Docker bridge network (
changemaker-lite) - Services communicate via container hostnames (DNS:
127.0.0.11) - Nginx acts as single reverse proxy for all external traffic
- Pangolin tunnel (Newt container) provides SSL/TLS termination
Service Topology:
Internet → Pangolin Tunnel (HTTPS) → Newt Container → Nginx (HTTP:80) → Backend Services
Critical Services:
- Express API (port 4000) - Main V2 API with Prisma ORM
- Fastify Media API (port 4100) - Video library management
- Admin GUI (port 3000) - React admin interface
- PostgreSQL V2 (port 5433 localhost-only) - Primary database
- Redis (port 6379) - Cache, rate limiting, BullMQ backend
- Nginx (ports 80/443) - Reverse proxy with 12 subdomain routes
Subdomain Routing Matrix
| Subdomain | Backend | Container Port | Purpose | Security Headers |
|---|---|---|---|---|
app.betteredmonton.org |
Admin GUI | 3000 | Admin interface | SAMEORIGIN |
api.betteredmonton.org |
Express + Media API | 4000/4100 | Main API + Media routes | SAMEORIGIN |
betteredmonton.org (root) |
MkDocs Site | 80 | Public documentation | Default |
db.betteredmonton.org |
NocoDB | 8080 | Data browser | CSP iframe |
docs.betteredmonton.org |
MkDocs Dev | 8000 | Live preview | CSP iframe + WS |
code.betteredmonton.org |
Code Server | 8080 | Web IDE | CSP iframe + WS |
git.betteredmonton.org |
Gitea | 3000 | Git hosting | CSP iframe |
n8n.betteredmonton.org |
n8n | 5678 | Workflow automation | CSP iframe + WS |
listmonk.betteredmonton.org |
Listmonk | 9000 | Newsletter platform | SAMEORIGIN |
mail.betteredmonton.org |
MailHog | 8025 | Email capture (dev) | CSP iframe + WS |
qr.betteredmonton.org |
Mini QR | 8080 | QR code generator | CSP iframe |
draw.betteredmonton.org |
Excalidraw | 80 | Collaborative whiteboard | CSP iframe + WS |
grafana.betteredmonton.org |
Grafana | 3000 | Monitoring dashboard | SAMEORIGIN |
home.betteredmonton.org |
Homepage | 3000 | Service dashboard | SAMEORIGIN |
Embed Proxy Ports (bypass security headers for iframe embedding):
- Ports 8881-8886 → Strip
X-Frame-OptionsandContent-Security-Policyheaders - Used by Admin GUI to embed third-party services (NocoDB, n8n, Gitea, MailHog, Mini QR, Excalidraw)
SSL/TLS & Tunnel Configuration
Current Setup:
- Nginx: HTTP-only (port 80), no SSL/TLS configuration
- Pangolin Tunnel: Handles all HTTPS termination externally
- Newt Container: Establishes encrypted tunnel to Pangolin server
- Certificate Management: Delegated entirely to Pangolin (zero config in Nginx)
Pangolin Environment Variables:
PANGOLIN_API_URL=https://api.bnkserve.org/v1 # Self-hosted Pangolin instance
PANGOLIN_API_KEY= # Bearer token authentication
PANGOLIN_ORG_ID= # Organization identifier
PANGOLIN_SITE_ID= # Created during initial setup
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org # Tunnel entry point
PANGOLIN_NEWT_ID= # Generated tunnel identity
PANGOLIN_NEWT_SECRET= # Tunnel authentication secret
Automated Setup (Feb 2026):
- One-command deployment via
/api/pangolin/setup-automatedendpoint - Central resource config:
configs/pangolin/resources.yml(12 services) - Atomic .env updates + Newt container restart + tunnel verification
- Reduces setup time from 15min → 2min (87% reduction)
Security Posture
Strengths:
- ✅ JWT access/refresh token rotation (atomic transactions)
- ✅ Password policy enforced at schema level (12+ chars, complexity requirements)
- ✅ Rate limiting on auth endpoints (10/min per IP)
- ✅ Redis authentication required (
requirepassenforced) - ✅ User enumeration prevention (401 for all auth failures)
- ✅ Database secrets encrypted with
ENCRYPTION_KEY - ✅ HSTS header with 1-year max-age + includeSubDomains
- ✅ CSP headers for iframe protection on sensitive services
- ✅ PostgreSQL bound to localhost only (not exposed to network)
- ✅ Security audit completed Feb 2026 (13 findings addressed)
Critical Gaps:
- ❌ No HTTP → HTTPS redirect in Nginx (relies on Pangolin)
- ❌ Embed proxy ports (8881-8886) bypass ALL security headers (XSS risk)
- ❌ No nginx-level rate limiting (only application-level)
- ❌ Grafana admin password defaults to "admin"
- ❌ Gotify admin password defaults to "admin"
- ❌ N8N default credentials in .env.example
- ❌ EMAIL_TEST_MODE=true by default (routes to MailHog in production)
- ❌ NODE_TLS_REJECT_UNAUTHORIZED not explicitly set (could accept self-signed certs)
Database & Caching
PostgreSQL V2 (changemaker-v2-postgres):
- Port binding:
127.0.0.1:5433:5432(localhost-only, production-safe) - Connection:
postgresql://changemaker:${V2_POSTGRES_PASSWORD}@changemaker-v2-postgres:5432/changemaker_v2 - Used by: Express API (Prisma), Media API (Prisma), NocoDB (separate
nocodb_metaDB) - Healthcheck:
pg_isreadywith 10s interval
Listmonk PostgreSQL (listmonk-db):
- Port binding:
127.0.0.1:5432:5432(localhost-only) - Isolated database lifecycle (separate from V2)
- Two-user architecture: Web admin + API user (plaintext tokens)
Redis (redis-changemaker):
- Port binding:
6379:6379(exposed to host network) - Authentication:
requirepass ${REDIS_PASSWORD}enforced - Connection:
redis://:${REDIS_PASSWORD}@redis-changemaker:6379 - Used for: Cache, BullMQ queues, rate limiting, geocoding cache
- SECURITY NOTE: redis-exporter uses unauthenticated connection string (potential risk)
Email Configuration
Development (Current):
EMAIL_TEST_MODE=true→ All emails route to MailHog (localhost:1025)- MailHog Web UI:
http://mail.betteredmonton.org(dev only) - No external SMTP configured
Production Requirements:
EMAIL_TEST_MODE=false→ Route to real SMTP server- SMTP credentials:
SMTP_HOST,SMTP_PORT,SMTP_USER,SMTP_PASS - Encrypt SMTP password with
ENCRYPTION_KEY(stored in DB) - Configure Listmonk SMTP separately (newsletter sending)
Email Systems:
- Campaign Emails (BullMQ queue) → Main SMTP
- System Emails (password reset, shift confirmations) → Main SMTP
- Newsletter Emails (Listmonk) → Listmonk SMTP (can be same or separate)
Monitoring & Observability
Prometheus Metrics:
- 12 custom
cm_*metrics (API uptime, queue size, sessions, etc.) - HTTP request metrics (duration, status codes, paths)
- Redis, PostgreSQL, container metrics via exporters
- Scrape interval: 15s
Grafana Dashboards:
- 3 pre-configured dashboards (API metrics, system metrics, canvass activity)
- Data source: Prometheus
- Default admin:
admin/admin(must change for production)
Alertmanager:
- Alert routing configured
- Requires Gotify setup for notifications (default: admin/admin)
Services Behind --profile monitoring:
- Prometheus (9090)
- Grafana (3001)
- Alertmanager (9093)
- cAdvisor (8080)
- Node Exporter (9100)
- Redis Exporter (9121)
- Gotify (8889)
Backup & Disaster Recovery
Current Backup Script (scripts/backup.sh):
- PostgreSQL V2 dump (pg_dump)
- Listmonk database dump
- Uploads directory archive (tar.gz)
- Optional S3 upload (requires
S3_BUCKET,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
Critical Gaps:
- ❌ No automated backup scheduling (cron not configured)
- ❌ No backup retention policy
- ❌ No disaster recovery playbook
- ❌ No restore procedure documentation
- ❌ No backup monitoring/alerting
Production Readiness Gaps
Critical Severity (Must Fix Before Production)
-
Default Admin Passwords
- Services: Grafana, Gotify, N8N, NocoDB, Listmonk
- Impact: Unauthorized access to admin dashboards, data exfiltration
- Fix: Change all default passwords in
.envbefore deployment - Verification: Attempt login with default credentials (should fail)
-
Email Test Mode Enabled
- Issue:
EMAIL_TEST_MODE=trueroutes all production emails to MailHog - Impact: Users never receive password reset, shift confirmation, campaign emails
- Fix: Set
EMAIL_TEST_MODE=false+ configure real SMTP credentials - Verification: Send test email, verify receipt in external inbox
- Issue:
-
Missing ENCRYPTION_KEY
- Issue: Required for encrypting DB secrets (SMTP passwords, API tokens)
- Impact: Application won't start in production if unset
- Fix: Generate via
openssl rand -hex 32, add to.env - Verification: Restart API, check logs for encryption errors
-
Embed Proxy XSS Risk
- Issue: Ports 8881-8886 strip all security headers (
X-Frame-Options, CSP) - Impact: If one service is compromised, attacker can iframe it from malicious site
- Fix: Restrict embed proxy ports to localhost-only OR implement IP whitelist
- Verification: Attempt to access embed proxy from external IP (should fail)
- Issue: Ports 8881-8886 strip all security headers (
High Severity (Fix Before Launch)
-
No HTTP → HTTPS Redirect
- Issue: Users can access
http://betteredmonton.orgwithout forced redirect - Impact: Mixed content warnings, insecure authentication cookies
- Fix: Add nginx redirect block for all subdomains
- Verification:
curl -I http://app.betteredmonton.orgshould return 301 redirect
- Issue: Users can access
-
No Automated Backups
- Issue: Manual backup script requires cron scheduling
- Impact: Data loss if server fails before manual backup
- Fix: Add cron job:
0 */6 * * * /path/to/backup.sh(every 6 hours) - Verification: Check
/var/log/cronfor backup execution logs
-
Redis Exporter Unauthenticated
- Issue:
REDIS_ADDR=redis:6379(no password) - Impact: If exporter runs on separate network segment, Redis exposed
- Fix: Change to
REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis:6379 - Verification: Check redis-exporter logs, ensure no auth errors
- Issue:
-
No Disaster Recovery Documentation
- Issue: Restore procedure not documented
- Impact: Extended downtime during recovery, data corruption risk
- Fix: Document step-by-step restore process (DB import, volume restore, env config)
- Verification: Perform disaster recovery drill on staging environment
Medium Severity (Address Within 30 Days)
-
Single Bridge Network
- Issue: All services on same network; lateral movement easy if one compromised
- Impact: If one service is exploited, attacker can reach databases/Redis
- Fix: Split into separate networks (app-net, data-net, services-net)
- Verification: Verify service isolation via
docker network inspect
-
No Nginx Rate Limiting
- Issue: Rate limiting only at application level (Express middleware)
- Impact: DDoS attacks can saturate Nginx/network before reaching API rate limiter
- Fix: Add nginx
limit_reqzones for/api/*paths - Verification: Send 1000 req/sec, verify 429 responses from Nginx
-
No Log Aggregation
- Issue: Logs scattered across Docker containers
- Impact: Difficult to debug multi-service issues, no centralized audit trail
- Fix: Implement ELK stack or similar (Elasticsearch, Logstash, Kibana)
- Verification: Search logs from all services in one UI
-
No TLS Certificate Monitoring
- Issue: Pangolin manages certs, but no alerting on renewal failures
- Impact: Site goes offline when cert expires
- Fix: Add Prometheus alert for cert expiry (30 days before)
- Verification: Simulate expired cert, verify alert fires
Low Severity (Nice to Have)
-
No Service Mesh
- Issue: No observability of inter-service communication
- Impact: Difficult to debug network issues between containers
- Fix: Implement Linkerd or Istio for traffic management
- Verification: View service-to-service latency in Grafana
-
No Container Resource Limits
- Issue: Docker Compose doesn't set CPU/memory limits
- Impact: One service can starve others of resources
- Fix: Add
deploy.resources.limitsto docker-compose.yml - Verification: Monitor resource usage under load
-
No Listmonk HTTPS
- Issue: API-to-Listmonk communication uses HTTP (inside Docker network)
- Impact: If network is compromised, credentials visible in plaintext
- Fix: Configure Listmonk with internal TLS certificate
- Verification: Inspect network traffic, verify encryption
Implementation Plan
Phase 1: Pre-Deployment Security Hardening (2-3 hours)
File: .env (production environment variables)
Changes Required:
-
Generate Secrets
# Run on production server openssl rand -hex 32 # JWT_ACCESS_SECRET openssl rand -hex 32 # JWT_REFRESH_SECRET openssl rand -hex 32 # ENCRYPTION_KEY (must differ from JWT secrets) openssl rand -hex 16 # LISTMONK_API_TOKEN -
Update Environment Variables
EMAIL_TEST_MODE=falseNODE_TLS_REJECT_UNAUTHORIZED=(empty string for strict validation)GRAFANA_ADMIN_PASSWORD=<strong_password>GOTIFY_ADMIN_PASSWORD=<strong_password>N8N_USER_PASSWORD=<strong_password>NC_ADMIN_PASSWORD=<strong_password>LISTMONK_WEB_ADMIN_PASSWORD=<strong_password>V2_POSTGRES_PASSWORD=<strong_password>REDIS_PASSWORD=<strong_password>LISTMONK_DB_PASSWORD=<strong_password>GITEA_DB_PASSWD=<strong_password>GITEA_DB_ROOT_PASSWORD=<strong_password>N8N_ENCRYPTION_KEY=<strong_password>
-
Configure Production SMTP
SMTP_HOST=<smtp.provider.com>SMTP_PORT=<465 or 587>SMTP_USER=<username>SMTP_PASS=<password>(will be encrypted by API on first startup)SMTP_SECURE=true(for port 465) orfalse(for STARTTLS on 587)
-
Listmonk SMTP Configuration
LISTMONK_SMTP_HOST=<smtp.provider.com>LISTMONK_SMTP_PORT=<465 or 587>LISTMONK_SMTP_TLS_TYPE=STARTTLS(for 587) orTLS(for 465)LISTMONK_SMTP_AUTH_PROTOCOL=loginLISTMONK_SMTP_USERNAME=<username>LISTMONK_SMTP_PASSWORD=<password>
Verification:
# Check all required env vars are set
grep "CHANGE_THIS" .env # Should return nothing
grep "admin" .env | grep -v ADMIN_EMAIL # Should return nothing (no default admin passwords)
# Test SMTP connection
docker compose exec api node -e "
const nodemailer = require('nodemailer');
const transport = nodemailer.createTransport({
host: process.env.SMTP_HOST,
port: parseInt(process.env.SMTP_PORT),
secure: process.env.SMTP_SECURE === 'true',
auth: {
user: process.env.SMTP_USER,
pass: process.env.SMTP_PASS
}
});
transport.verify().then(console.log).catch(console.error);
"
Phase 2: Nginx Production Hardening (1 hour)
File: nginx/conf.d/default.conf (or new production.conf)
Changes Required:
-
Add HTTP → HTTPS Redirect
server { listen 80; server_name *.betteredmonton.org betteredmonton.org; # Health check endpoints (allow HTTP) location /health { proxy_pass http://changemaker-v2-api:4000; } # Redirect all other traffic to HTTPS location / { return 301 https://$host$request_uri; } } -
Add Nginx Rate Limiting
# Add to http block in nginx.conf limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s; limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=10r/m; # Add to api.conf location blocks location /api/auth/ { limit_req zone=auth_limit burst=20 nodelay; limit_req_status 429; proxy_pass http://changemaker-v2-api:4000; } location /api/ { limit_req zone=api_limit burst=200 nodelay; limit_req_status 429; proxy_pass http://changemaker-v2-api:4000; } -
Restrict Embed Proxy Ports to Localhost
# Add to each embed proxy server block server { listen 8881; server_name localhost; # Reject non-localhost connections allow 127.0.0.1; deny all; location / { proxy_pass http://changemaker-v2-nocodb:8080; proxy_hide_header X-Frame-Options; proxy_hide_header Content-Security-Policy; } } -
Add Custom Error Pages
# Add to http block error_page 502 503 504 /5xx.html; location = /5xx.html { root /usr/share/nginx/html; internal; } error_page 429 /429.html; location = /429.html { root /usr/share/nginx/html; internal; }
Verification:
# Test HTTP redirect
curl -I http://app.betteredmonton.org | grep "301" # Should see 301 Moved Permanently
# Test rate limiting
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}\n" http://api.betteredmonton.org/api/health; done
# Should see mostly 200s, then 429s
# Test embed proxy localhost restriction
curl -I http://<server_ip>:8881 # Should return 403 Forbidden
curl -I http://localhost:8881 # Should return 200 OK
Phase 3: Pangolin Tunnel Configuration (30 minutes)
File: .env (Pangolin environment variables)
Prerequisites:
- Pangolin organization created at
https://api.bnkserve.org - API key obtained from organization settings
- DNS records created (see below)
Steps:
-
Configure Pangolin Environment Variables
PANGOLIN_API_URL=https://api.bnkserve.org/v1 PANGOLIN_API_KEY=<your_api_key> PANGOLIN_ORG_ID=<your_org_id> PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org -
Run Automated Setup
# Option 1: Via API endpoint curl -X POST http://localhost:4000/api/pangolin/setup-automated \ -H "Authorization: Bearer <admin_jwt_token>" \ -H "Content-Type: application/json" \ -d '{ "siteName": "Changemaker Lite Production", "domain": "betteredmonton.org" }' # Option 2: Via CLI wrapper ./scripts/pangolin-setup.sh -
Verify Tunnel Connectivity
# Check Newt container logs docker compose logs -f newt # Should see "Connected to Pangolin server" and "Tunnel established" # Test external access curl -I https://app.betteredmonton.org # Should return 200 OK with HTTPS
DNS Configuration Required:
Create 12 CNAME records pointing to Pangolin endpoint:
app.betteredmonton.org CNAME pangolin.bnkserve.org
api.betteredmonton.org CNAME pangolin.bnkserve.org
db.betteredmonton.org CNAME pangolin.bnkserve.org
docs.betteredmonton.org CNAME pangolin.bnkserve.org
code.betteredmonton.org CNAME pangolin.bnkserve.org
git.betteredmonton.org CNAME pangolin.bnkserve.org
n8n.betteredmonton.org CNAME pangolin.bnkserve.org
listmonk.betteredmonton.org CNAME pangolin.bnkserve.org
mail.betteredmonton.org CNAME pangolin.bnkserve.org
qr.betteredmonton.org CNAME pangolin.bnkserve.org
draw.betteredmonton.org CNAME pangolin.bnkserve.org
grafana.betteredmonton.org CNAME pangolin.bnkserve.org
home.betteredmonton.org CNAME pangolin.bnkserve.org
Phase 4: Backup Automation (30 minutes)
File: New cron job configuration
Steps:
-
Create Backup Directory
mkdir -p /var/backups/changemaker-lite chmod 750 /var/backups/changemaker-lite -
Test Manual Backup
cd /home/bunker-admin/changemaker.lite ./scripts/backup.sh # Should create timestamped backup files in ./backups/ -
Configure S3 Upload (Optional)
# Add to .env S3_BUCKET=changemaker-lite-backups AWS_ACCESS_KEY_ID=<your_access_key> AWS_SECRET_ACCESS_KEY=<your_secret_key> AWS_REGION=us-east-1 # Or your preferred region -
Add Cron Job
# Edit crontab crontab -e # Add the following lines: # Backup every 6 hours at minute 0 0 */6 * * * cd /home/bunker-admin/changemaker.lite && ./scripts/backup.sh >> /var/log/changemaker-backup.log 2>&1 # Clean up old backups (keep last 7 days) 0 3 * * * find /home/bunker-admin/changemaker.lite/backups -type f -mtime +7 -delete -
Setup Backup Monitoring Alert
# Add to configs/prometheus/alerts.yml - alert: BackupJobFailed expr: time() - cm_backup_last_success_timestamp > 21600 # 6 hours for: 1h labels: severity: critical annotations: summary: "Backup job has not run successfully in over 6 hours" description: "Last successful backup was {{ $value | humanizeDuration }} ago"
Verification:
# Wait for cron execution (or run manually)
./scripts/backup.sh
# Check backup files exist
ls -lh backups/
# Should see 3 files: changemaker_v2-YYYYMMDD-HHMMSS.sql, listmonk-YYYYMMDD-HHMMSS.sql, uploads-YYYYMMDD-HHMMSS.tar.gz
# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/
Phase 5: Docker Compose Updates (1 hour)
File: docker-compose.yml
Changes Required:
-
Fix Redis Exporter Authentication
redis-exporter: environment: - REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis-changemaker:6379 -
Add Container Resource Limits (Optional)
api: deploy: resources: limits: cpus: '2' memory: 2G reservations: cpus: '1' memory: 1G media-api: deploy: resources: limits: cpus: '2' memory: 4G # Higher for video processing reservations: cpus: '1' memory: 2G v2-postgres: deploy: resources: limits: cpus: '2' memory: 4G reservations: cpus: '1' memory: 2G -
Add Volume Size Limits (Optional)
volumes: v2-postgres-data: driver_opts: type: none device: /var/lib/docker/volumes/v2-postgres-data o: bind,size=50G
Verification:
# Recreate containers with new config
docker compose down
docker compose up -d
# Verify Redis exporter connects with auth
docker compose logs redis-exporter | grep "successfully"
# Check resource limits are applied
docker stats --no-stream | grep changemaker
Phase 6: Monitoring & Alerting Setup (1-2 hours)
File: configs/prometheus/alerts.yml
Additional Alerts to Add:
groups:
- name: production_critical
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 2592000 # 30 days
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
description: "Certificate expires in {{ $value | humanizeDuration }}"
- alert: DiskSpaceRunningLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space running low on {{ $labels.instance }}"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
- alert: DatabaseConnectionsHigh
expr: pg_stat_activity_count > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High number of database connections ({{ $value }})"
description: "PostgreSQL has {{ $value }} active connections"
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Redis memory usage above 80%"
description: "Redis is using {{ $value | humanizePercentage }} of allocated memory"
Gotify Configuration:
# Start Gotify container
docker compose --profile monitoring up -d gotify
# Access Gotify UI at http://localhost:8889
# Change admin password (default: admin/admin)
# Create application token for Alertmanager
# Copy token to configs/alertmanager/alertmanager.yml:
receivers:
- name: 'gotify'
webhook_configs:
- url: 'http://gotify-changemaker:80/message?token=<your_app_token>'
send_resolved: true
Verification:
# Start monitoring stack
docker compose --profile monitoring up -d
# Check Prometheus targets are up
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# Should return empty (all targets healthy)
# Test alert firing
docker compose stop api
# Wait 1 minute, check Alertmanager UI at http://localhost:9093
# Should see "APIDown" alert firing
# Verify Gotify receives notification
# Check Gotify UI, should see new message
# Restart API
docker compose start api
Phase 7: Final Production Verification (1-2 hours)
Production Deployment Checklist:
Security:
- All default passwords changed (Grafana, Gotify, N8N, NocoDB, Listmonk)
- JWT secrets generated via
openssl rand -hex 32(3 different values) - ENCRYPTION_KEY generated and different from JWT secrets
EMAIL_TEST_MODE=falseset in production .envNODE_TLS_REJECT_UNAUTHORIZED=(empty) for strict TLS validation- Redis password set and authenticated
- PostgreSQL passwords strong (20+ characters)
- Nginx rate limiting enabled
- Embed proxy ports restricted to localhost
Networking:
- All 12 DNS CNAME records created
- Pangolin tunnel configured and connected
- HTTP → HTTPS redirect working
- All subdomains resolve via HTTPS
- SSL certificates valid (checked via browser)
- WebSocket connections work (test n8n, MkDocs, Code Server)
Email:
- Production SMTP configured (host, port, user, pass)
- Test email sent and received
- Listmonk SMTP configured separately
- Password reset email works
- Shift confirmation email works
Backup:
- Backup script tested manually
- S3 credentials configured (if using)
- Cron job added for automated backups (every 6 hours)
- Old backup cleanup cron added (7 day retention)
- Backup monitoring alert configured
Monitoring:
- Prometheus collecting metrics from all services
- Grafana dashboards showing data
- Alertmanager configured with Gotify
- SSL expiry alert configured (30 days warning)
- Disk space alert configured (10% threshold)
- Backup job alert configured (6 hour SLA)
- Test alert sent to Gotify
Application:
- Admin login works (JWT token issued)
- Admin dashboard loads all components
- API health check returns 200 OK
- Media upload works (test 100MB+ video)
- Geocoding works (test address lookup)
- Map loads locations correctly
- Campaign email sending works (test queue)
- Listmonk sync works (if enabled)
- Canvass map GPS tracking works (volunteer portal)
Performance:
- Nginx rate limiting prevents abuse (test with 1000 req/sec)
- Database connection pooling configured
- Redis cache hit ratio >80% (check Grafana)
- Page load times <2 seconds (test with network throttling)
- Video upload completes within timeout (10GB max)
Disaster Recovery:
- Full backup restored on staging environment
- Database migration verified (Prisma migrations applied)
- Environment variables match production
- All services start cleanly after restore
Verification Steps
Post-Deployment Tests
1. SSL/TLS Verification
# Check all subdomains have valid SSL
for subdomain in app api db docs code git n8n listmonk mail qr draw grafana home; do
echo "Testing $subdomain.betteredmonton.org"
curl -I https://$subdomain.betteredmonton.org 2>&1 | grep -E "(HTTP|Subject:|Issuer:)"
done
# Should see:
# - HTTP/2 200 (or 301 redirect)
# - Valid certificate issuer (Let's Encrypt or Pangolin)
# - No certificate errors
2. Authentication Flow
# Test login endpoint
curl -X POST https://api.betteredmonton.org/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@betteredmonton.org","password":"<admin_password>"}' \
| jq '.accessToken'
# Should return JWT token
# Test token refresh
curl -X POST https://api.betteredmonton.org/api/auth/refresh \
-H "Content-Type: application/json" \
-d '{"refreshToken":"<refresh_token>"}' \
| jq '.accessToken'
# Should return new access token
3. Email Delivery
# Trigger password reset email
curl -X POST https://api.betteredmonton.org/api/auth/forgot-password \
-H "Content-Type: application/json" \
-d '{"email":"test@example.com"}'
# Check external email inbox for reset link
# Verify email arrives within 2 minutes
4. Rate Limiting
# Test auth endpoint rate limit (10/min)
for i in {1..15}; do
curl -s -o /dev/null -w "%{http_code} " https://api.betteredmonton.org/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test","password":"test"}'
done
# Should see: 401 401 401 ... 429 429 429 (after 10 requests)
5. Database Connectivity
# Check API can connect to database
curl https://api.betteredmonton.org/api/health | jq '.database'
# Should return: "healthy"
# Check Redis connectivity
curl https://api.betteredmonton.org/api/health | jq '.redis'
# Should return: "healthy"
6. Media Upload
# Test video upload (requires auth token)
curl -X POST https://api.betteredmonton.org/media/videos/upload \
-H "Authorization: Bearer <admin_jwt>" \
-F "file=@test-video.mp4" \
-F "title=Test Upload" \
| jq '.id'
# Should return video ID
7. Monitoring Endpoints
# Prometheus targets
curl -s https://grafana.betteredmonton.org/api/datasources/proxy/1/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Should show all targets with health: "up"
# Grafana health
curl https://grafana.betteredmonton.org/api/health
# Should return: {"database":"ok","version":"..."}
8. Backup Verification
# Trigger manual backup
./scripts/backup.sh
# Check backup files created
ls -lh backups/ | tail -3
# Should see 3 new files with current timestamp
# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/ | tail -3
Critical Files Reference
Configuration Files:
docker-compose.yml- Service orchestration (25+ services).env- Environment variables (100+ vars, not committed).env.example- Template with all required variablesnginx/nginx.conf- Global Nginx config + security headersnginx/conf.d/api.conf- API + Media API reverse proxynginx/conf.d/services.conf- 12 service subdomains + embed proxiesconfigs/pangolin/resources.yml- Tunnel resource definitionsconfigs/prometheus/prometheus.yml- Metrics collection configconfigs/prometheus/alerts.yml- Alert rulesconfigs/grafana/*.json- Pre-configured dashboardsconfigs/alertmanager/alertmanager.yml- Alert routing
Database Schema:
api/prisma/schema.prisma- Main database schema (30+ models)api/prisma/migrations/- Migration historyapi/prisma/seed.ts- Initial data seeding
Deployment Scripts:
scripts/backup.sh- PostgreSQL + Listmonk + uploads backupscripts/pangolin-setup.sh- CLI wrapper for automated tunnel setup
Environment Validation:
api/src/config/env.ts- Zod schema for all environment variables (100+ vars)
Rollback Procedure
If deployment fails or critical issues arise:
1. Immediate Rollback (5 minutes)
# Stop all containers
docker compose down
# Restore previous .env file
cp .env.backup .env
# Restart with old configuration
docker compose up -d
2. Database Rollback (15 minutes)
# Stop API to prevent new writes
docker compose stop api media-api
# Restore from latest backup
docker compose exec v2-postgres psql -U changemaker -d postgres -c "DROP DATABASE changemaker_v2;"
docker compose exec v2-postgres psql -U changemaker -d postgres -c "CREATE DATABASE changemaker_v2;"
docker compose exec -T v2-postgres psql -U changemaker -d changemaker_v2 < backups/changemaker_v2-YYYYMMDD-HHMMSS.sql
# Restart services
docker compose start api media-api
3. Full System Restore (30 minutes)
# Stop all services
docker compose down -v # WARNING: Removes all volumes
# Restore PostgreSQL data
tar -xzf backups/postgres-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/
# Restore Redis data (if backed up)
tar -xzf backups/redis-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/
# Restore uploads
tar -xzf backups/uploads-YYYYMMDD-HHMMSS.tar.gz -C ./media/
# Restart all services
docker compose up -d
4. Verify Rollback Success
# Check all services healthy
docker compose ps | grep -v "Up" # Should return nothing
# Test admin login
curl -X POST http://localhost:4000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"admin@betteredmonton.org","password":"<password>"}'
# Verify database has data
curl http://localhost:4000/api/health | jq '.database'
Post-Production Maintenance
Daily Tasks:
- Monitor Grafana dashboards for anomalies
- Check Gotify alerts for critical issues
- Verify backups completed successfully (check logs)
Weekly Tasks:
- Review API error logs for patterns
- Check disk space usage (alert should fire if <10%)
- Verify SSL certificate validity (30 days remaining)
- Test disaster recovery on staging environment
Monthly Tasks:
- Review access logs for suspicious activity
- Update Docker images to latest versions (after testing on staging)
- Audit user accounts and remove inactive users
- Review and rotate API keys if necessary
Quarterly Tasks:
- Conduct full security audit (penetration testing)
- Review and update rate limiting thresholds based on traffic
- Analyze backup storage costs and adjust retention policy
- Test full disaster recovery procedure with restore drill
Summary
This plan provides a comprehensive pathway from development to production for the Changemaker Lite V2 networking infrastructure. The architecture is fundamentally sound with:
Strengths:
- Single bridge network simplifies communication
- Pangolin tunnel handles SSL/TLS externally (zero Nginx cert management)
- Comprehensive security headers and policies
- Automated backup script exists
- Monitoring stack with Prometheus/Grafana ready
- Rate limiting on critical endpoints
Critical Path for Production:
- Phase 1: Security hardening (change passwords, configure SMTP) - MUST DO
- Phase 3: Pangolin tunnel setup - MUST DO
- Phase 4: Backup automation - SHOULD DO
- Phase 6: Monitoring alerts - SHOULD DO
- Phase 2: Nginx hardening - NICE TO HAVE
The remaining phases (network segmentation, resource limits, log aggregation) can be deferred to post-launch improvements without blocking production deployment.
Estimated Total Implementation Time: 6-10 hours (can be split across multiple days)
Estimated Downtime During Deployment: <5 minutes (only during final container restart)