changemaker.lite/production plan.md

# Production Networking Preparation Plan

## Context

The Changemaker Lite V2 application needs to be prepared for production deployment. The current architecture is development-focused with Docker Compose orchestration, Nginx reverse proxy, and Pangolin tunnel integration for SSL/TLS termination. The user wants a comprehensive understanding of the networking setup and identification of production readiness gaps before going live.

**Why this is needed:**
- Current setup is optimized for local development (HTTP-only, MailHog, default passwords)
- Production deployment requires SSL/TLS via Pangolin tunnel, real SMTP, security hardening
- Need to identify all gaps between dev and production configurations
- Need actionable checklist for production cutover

**What prompted this:**
- User preparing to deploy production instance on betteredmonton.org domain
- Need to understand networking architecture, security posture, and deployment requirements
- Ensure all 12 subdomains route correctly through Pangolin tunnel

**Intended outcome:**
- Comprehensive documentation of current networking architecture
- Identified production readiness gaps with severity ratings
- Prioritized checklist for production deployment
- Configuration changes needed for production hardening

---

## Current State Assessment

### Network Architecture

**Single-Bridge Network Design:**
- All 25+ services on one Docker bridge network (`changemaker-lite`)
- Services communicate via container hostnames (DNS: `127.0.0.11`)
- Nginx acts as single reverse proxy for all external traffic
- Pangolin tunnel (Newt container) provides SSL/TLS termination

**Service Topology:**
```
Internet → Pangolin Tunnel (HTTPS) → Newt Container → Nginx (HTTP:80) → Backend Services
```

**Critical Services:**
- **Express API** (port 4000) - Main V2 API with Prisma ORM
- **Fastify Media API** (port 4100) - Video library management
- **Admin GUI** (port 3000) - React admin interface
- **PostgreSQL V2** (port 5433 localhost-only) - Primary database
- **Redis** (port 6379) - Cache, rate limiting, BullMQ backend
- **Nginx** (ports 80/443) - Reverse proxy with 12 subdomain routes

### Subdomain Routing Matrix

| Subdomain | Backend | Container Port | Purpose | Security Headers |
|-----------|---------|----------------|---------|------------------|
| `app.betteredmonton.org` | Admin GUI | 3000 | Admin interface | SAMEORIGIN |
| `api.betteredmonton.org` | Express + Media API | 4000/4100 | Main API + Media routes | SAMEORIGIN |
| `betteredmonton.org` (root) | MkDocs Site | 80 | Public documentation | Default |
| `db.betteredmonton.org` | NocoDB | 8080 | Data browser | CSP iframe |
| `docs.betteredmonton.org` | MkDocs Dev | 8000 | Live preview | CSP iframe + WS |
| `code.betteredmonton.org` | Code Server | 8080 | Web IDE | CSP iframe + WS |
| `git.betteredmonton.org` | Gitea | 3000 | Git hosting | CSP iframe |
| `n8n.betteredmonton.org` | n8n | 5678 | Workflow automation | CSP iframe + WS |
| `listmonk.betteredmonton.org` | Listmonk | 9000 | Newsletter platform | SAMEORIGIN |
| `mail.betteredmonton.org` | MailHog | 8025 | Email capture (dev) | CSP iframe + WS |
| `qr.betteredmonton.org` | Mini QR | 8080 | QR code generator | CSP iframe |
| `draw.betteredmonton.org` | Excalidraw | 80 | Collaborative whiteboard | CSP iframe + WS |
| `grafana.betteredmonton.org` | Grafana | 3000 | Monitoring dashboard | SAMEORIGIN |
| `home.betteredmonton.org` | Homepage | 3000 | Service dashboard | SAMEORIGIN |

**Embed Proxy Ports** (bypass security headers for iframe embedding):
- Ports 8881-8886 → Strip `X-Frame-Options` and `Content-Security-Policy` headers
- Used by Admin GUI to embed third-party services (NocoDB, n8n, Gitea, MailHog, Mini QR, Excalidraw)

### SSL/TLS & Tunnel Configuration

**Current Setup:**
- **Nginx**: HTTP-only (port 80), no SSL/TLS configuration
- **Pangolin Tunnel**: Handles all HTTPS termination externally
- **Newt Container**: Establishes encrypted tunnel to Pangolin server
- **Certificate Management**: Delegated entirely to Pangolin (zero config in Nginx)

**Pangolin Environment Variables:**
```bash
PANGOLIN_API_URL=https://api.bnkserve.org/v1     # Self-hosted Pangolin instance
PANGOLIN_API_KEY=                                 # Bearer token authentication
PANGOLIN_ORG_ID=                                  # Organization identifier
PANGOLIN_SITE_ID=                                 # Created during initial setup
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org  # Tunnel entry point
PANGOLIN_NEWT_ID=                                 # Generated tunnel identity
PANGOLIN_NEWT_SECRET=                             # Tunnel authentication secret
```

**Automated Setup** (Feb 2026):
- One-command deployment via `/api/pangolin/setup-automated` endpoint
- Central resource config: `configs/pangolin/resources.yml` (12 services)
- Atomic .env updates + Newt container restart + tunnel verification
- Reduces setup time from 15min → 2min (87% reduction)

### Security Posture

**Strengths:**
- ✅ JWT access/refresh token rotation (atomic transactions)
- ✅ Password policy enforced at schema level (12+ chars, complexity requirements)
- ✅ Rate limiting on auth endpoints (10/min per IP)
- ✅ Redis authentication required (`requirepass` enforced)
- ✅ User enumeration prevention (401 for all auth failures)
- ✅ Database secrets encrypted with `ENCRYPTION_KEY`
- ✅ HSTS header with 1-year max-age + includeSubDomains
- ✅ CSP headers for iframe protection on sensitive services
- ✅ PostgreSQL bound to localhost only (not exposed to network)
- ✅ Security audit completed Feb 2026 (13 findings addressed)

**Critical Gaps:**
- ❌ No HTTP → HTTPS redirect in Nginx (relies on Pangolin)
- ❌ Embed proxy ports (8881-8886) bypass ALL security headers (XSS risk)
- ❌ No nginx-level rate limiting (only application-level)
- ❌ Grafana admin password defaults to "admin"
- ❌ Gotify admin password defaults to "admin"
- ❌ N8N default credentials in .env.example
- ❌ EMAIL_TEST_MODE=true by default (routes to MailHog in production)
- ❌ NODE_TLS_REJECT_UNAUTHORIZED not explicitly set (could accept self-signed certs)

### Database & Caching

**PostgreSQL V2** (`changemaker-v2-postgres`):
- Port binding: `127.0.0.1:5433:5432` (localhost-only, production-safe)
- Connection: `postgresql://changemaker:${V2_POSTGRES_PASSWORD}@changemaker-v2-postgres:5432/changemaker_v2`
- Used by: Express API (Prisma), Media API (Prisma), NocoDB (separate `nocodb_meta` DB)
- Healthcheck: `pg_isready` with 10s interval

**Listmonk PostgreSQL** (`listmonk-db`):
- Port binding: `127.0.0.1:5432:5432` (localhost-only)
- Isolated database lifecycle (separate from V2)
- Two-user architecture: Web admin + API user (plaintext tokens)

**Redis** (`redis-changemaker`):
- Port binding: `6379:6379` (exposed to host network)
- Authentication: `requirepass ${REDIS_PASSWORD}` enforced
- Connection: `redis://:${REDIS_PASSWORD}@redis-changemaker:6379`
- Used for: Cache, BullMQ queues, rate limiting, geocoding cache
- **SECURITY NOTE**: redis-exporter uses unauthenticated connection string (potential risk)

### Email Configuration

**Development (Current):**
- `EMAIL_TEST_MODE=true` → All emails route to MailHog (localhost:1025)
- MailHog Web UI: `http://mail.betteredmonton.org` (dev only)
- No external SMTP configured

**Production Requirements:**
- `EMAIL_TEST_MODE=false` → Route to real SMTP server
- SMTP credentials: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`
- Encrypt SMTP password with `ENCRYPTION_KEY` (stored in DB)
- Configure Listmonk SMTP separately (newsletter sending)

**Email Systems:**
1. **Campaign Emails** (BullMQ queue) → Main SMTP
2. **System Emails** (password reset, shift confirmations) → Main SMTP
3. **Newsletter Emails** (Listmonk) → Listmonk SMTP (can be same or separate)

### Monitoring & Observability

**Prometheus Metrics:**
- 12 custom `cm_*` metrics (API uptime, queue size, sessions, etc.)
- HTTP request metrics (duration, status codes, paths)
- Redis, PostgreSQL, container metrics via exporters
- Scrape interval: 15s

**Grafana Dashboards:**
- 3 pre-configured dashboards (API metrics, system metrics, canvass activity)
- Data source: Prometheus
- Default admin: `admin/admin` (must change for production)

**Alertmanager:**
- Alert routing configured
- Requires Gotify setup for notifications (default: admin/admin)

**Services Behind `--profile monitoring`:**
- Prometheus (9090)
- Grafana (3001)
- Alertmanager (9093)
- cAdvisor (8080)
- Node Exporter (9100)
- Redis Exporter (9121)
- Gotify (8889)

### Backup & Disaster Recovery

**Current Backup Script** (`scripts/backup.sh`):
- PostgreSQL V2 dump (pg_dump)
- Listmonk database dump
- Uploads directory archive (tar.gz)
- Optional S3 upload (requires `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)

**Critical Gaps:**
- ❌ No automated backup scheduling (cron not configured)
- ❌ No backup retention policy
- ❌ No disaster recovery playbook
- ❌ No restore procedure documentation
- ❌ No backup monitoring/alerting

---

## Production Readiness Gaps

### Critical Severity (Must Fix Before Production)

1. **Default Admin Passwords**
   - **Services:** Grafana, Gotify, N8N, NocoDB, Listmonk
   - **Impact:** Unauthorized access to admin dashboards, data exfiltration
   - **Fix:** Change all default passwords in `.env` before deployment
   - **Verification:** Attempt login with default credentials (should fail)

2. **Email Test Mode Enabled**
   - **Issue:** `EMAIL_TEST_MODE=true` routes all production emails to MailHog
   - **Impact:** Users never receive password reset, shift confirmation, campaign emails
   - **Fix:** Set `EMAIL_TEST_MODE=false` + configure real SMTP credentials
   - **Verification:** Send test email, verify receipt in external inbox

3. **Missing ENCRYPTION_KEY**
   - **Issue:** Required for encrypting DB secrets (SMTP passwords, API tokens)
   - **Impact:** Application won't start in production if unset
   - **Fix:** Generate via `openssl rand -hex 32`, add to `.env`
   - **Verification:** Restart API, check logs for encryption errors

4. **Embed Proxy XSS Risk**
   - **Issue:** Ports 8881-8886 strip all security headers (`X-Frame-Options`, CSP)
   - **Impact:** If one service is compromised, attacker can iframe it from malicious site
   - **Fix:** Restrict embed proxy ports to localhost-only OR implement IP whitelist
   - **Verification:** Attempt to access embed proxy from external IP (should fail)

### High Severity (Fix Before Launch)

5. **No HTTP → HTTPS Redirect**
   - **Issue:** Users can access `http://betteredmonton.org` without forced redirect
   - **Impact:** Mixed content warnings, insecure authentication cookies
   - **Fix:** Add nginx redirect block for all subdomains
   - **Verification:** `curl -I http://app.betteredmonton.org` should return 301 redirect

6. **No Automated Backups**
   - **Issue:** Manual backup script requires cron scheduling
   - **Impact:** Data loss if server fails before manual backup
   - **Fix:** Add cron job: `0 */6 * * * /path/to/backup.sh` (every 6 hours)
   - **Verification:** Check `/var/log/cron` for backup execution logs

7. **Redis Exporter Unauthenticated**
   - **Issue:** `REDIS_ADDR=redis:6379` (no password)
   - **Impact:** If exporter runs on separate network segment, Redis exposed
   - **Fix:** Change to `REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis:6379`
   - **Verification:** Check redis-exporter logs, ensure no auth errors

8. **No Disaster Recovery Documentation**
   - **Issue:** Restore procedure not documented
   - **Impact:** Extended downtime during recovery, data corruption risk
   - **Fix:** Document step-by-step restore process (DB import, volume restore, env config)
   - **Verification:** Perform disaster recovery drill on staging environment

### Medium Severity (Address Within 30 Days)

9. **Single Bridge Network**
   - **Issue:** All services on same network; lateral movement easy if one compromised
   - **Impact:** If one service is exploited, attacker can reach databases/Redis
   - **Fix:** Split into separate networks (app-net, data-net, services-net)
   - **Verification:** Verify service isolation via `docker network inspect`

10. **No Nginx Rate Limiting**
    - **Issue:** Rate limiting only at application level (Express middleware)
    - **Impact:** DDoS attacks can saturate Nginx/network before reaching API rate limiter
    - **Fix:** Add nginx `limit_req` zones for `/api/*` paths
    - **Verification:** Send 1000 req/sec, verify 429 responses from Nginx

11. **No Log Aggregation**
    - **Issue:** Logs scattered across Docker containers
    - **Impact:** Difficult to debug multi-service issues, no centralized audit trail
    - **Fix:** Implement ELK stack or similar (Elasticsearch, Logstash, Kibana)
    - **Verification:** Search logs from all services in one UI

12. **No TLS Certificate Monitoring**
    - **Issue:** Pangolin manages certs, but no alerting on renewal failures
    - **Impact:** Site goes offline when cert expires
    - **Fix:** Add Prometheus alert for cert expiry (30 days before)
    - **Verification:** Simulate expired cert, verify alert fires

### Low Severity (Nice to Have)

13. **No Service Mesh**
    - **Issue:** No observability of inter-service communication
    - **Impact:** Difficult to debug network issues between containers
    - **Fix:** Implement Linkerd or Istio for traffic management
    - **Verification:** View service-to-service latency in Grafana

14. **No Container Resource Limits**
    - **Issue:** Docker Compose doesn't set CPU/memory limits
    - **Impact:** One service can starve others of resources
    - **Fix:** Add `deploy.resources.limits` to docker-compose.yml
    - **Verification:** Monitor resource usage under load

15. **No Listmonk HTTPS**
    - **Issue:** API-to-Listmonk communication uses HTTP (inside Docker network)
    - **Impact:** If network is compromised, credentials visible in plaintext
    - **Fix:** Configure Listmonk with internal TLS certificate
    - **Verification:** Inspect network traffic, verify encryption

---

## Implementation Plan

### Phase 1: Pre-Deployment Security Hardening (2-3 hours)

**File:** `.env` (production environment variables)

**Changes Required:**

1. **Generate Secrets**
   ```bash
   # Run on production server
   openssl rand -hex 32  # JWT_ACCESS_SECRET
   openssl rand -hex 32  # JWT_REFRESH_SECRET
   openssl rand -hex 32  # ENCRYPTION_KEY (must differ from JWT secrets)
   openssl rand -hex 16  # LISTMONK_API_TOKEN
   ```

2. **Update Environment Variables**
   - `EMAIL_TEST_MODE=false`
   - `NODE_TLS_REJECT_UNAUTHORIZED=` (empty string for strict validation)
   - `GRAFANA_ADMIN_PASSWORD=<strong_password>`
   - `GOTIFY_ADMIN_PASSWORD=<strong_password>`
   - `N8N_USER_PASSWORD=<strong_password>`
   - `NC_ADMIN_PASSWORD=<strong_password>`
   - `LISTMONK_WEB_ADMIN_PASSWORD=<strong_password>`
   - `V2_POSTGRES_PASSWORD=<strong_password>`
   - `REDIS_PASSWORD=<strong_password>`
   - `LISTMONK_DB_PASSWORD=<strong_password>`
   - `GITEA_DB_PASSWD=<strong_password>`
   - `GITEA_DB_ROOT_PASSWORD=<strong_password>`
   - `N8N_ENCRYPTION_KEY=<strong_password>`

3. **Configure Production SMTP**
   - `SMTP_HOST=<smtp.provider.com>`
   - `SMTP_PORT=<465 or 587>`
   - `SMTP_USER=<username>`
   - `SMTP_PASS=<password>` (will be encrypted by API on first startup)
   - `SMTP_SECURE=true` (for port 465) or `false` (for STARTTLS on 587)

4. **Listmonk SMTP Configuration**
   - `LISTMONK_SMTP_HOST=<smtp.provider.com>`
   - `LISTMONK_SMTP_PORT=<465 or 587>`
   - `LISTMONK_SMTP_TLS_TYPE=STARTTLS` (for 587) or `TLS` (for 465)
   - `LISTMONK_SMTP_AUTH_PROTOCOL=login`
   - `LISTMONK_SMTP_USERNAME=<username>`
   - `LISTMONK_SMTP_PASSWORD=<password>`

**Verification:**
```bash
# Check all required env vars are set
grep "CHANGE_THIS" .env  # Should return nothing
grep "admin" .env | grep -v ADMIN_EMAIL  # Should return nothing (no default admin passwords)

# Test SMTP connection
docker compose exec api node -e "
  const nodemailer = require('nodemailer');
  const transport = nodemailer.createTransport({
    host: process.env.SMTP_HOST,
    port: parseInt(process.env.SMTP_PORT),
    secure: process.env.SMTP_SECURE === 'true',
    auth: {
      user: process.env.SMTP_USER,
      pass: process.env.SMTP_PASS
    }
  });
  transport.verify().then(console.log).catch(console.error);
"
```

---

### Phase 2: Nginx Production Hardening (1 hour)

**File:** `nginx/conf.d/default.conf` (or new `production.conf`)

**Changes Required:**

1. **Add HTTP → HTTPS Redirect**
   ```nginx
   server {
       listen 80;
       server_name *.betteredmonton.org betteredmonton.org;

       # Health check endpoints (allow HTTP)
       location /health {
           proxy_pass http://changemaker-v2-api:4000;
       }

       # Redirect all other traffic to HTTPS
       location / {
           return 301 https://$host$request_uri;
       }
   }
   ```

2. **Add Nginx Rate Limiting**
   ```nginx
   # Add to http block in nginx.conf
   limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
   limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=10r/m;

   # Add to api.conf location blocks
   location /api/auth/ {
       limit_req zone=auth_limit burst=20 nodelay;
       limit_req_status 429;
       proxy_pass http://changemaker-v2-api:4000;
   }

   location /api/ {
       limit_req zone=api_limit burst=200 nodelay;
       limit_req_status 429;
       proxy_pass http://changemaker-v2-api:4000;
   }
   ```

3. **Restrict Embed Proxy Ports to Localhost**
   ```nginx
   # Add to each embed proxy server block
   server {
       listen 8881;
       server_name localhost;

       # Reject non-localhost connections
       allow 127.0.0.1;
       deny all;

       location / {
           proxy_pass http://changemaker-v2-nocodb:8080;
           proxy_hide_header X-Frame-Options;
           proxy_hide_header Content-Security-Policy;
       }
   }
   ```

4. **Add Custom Error Pages**
   ```nginx
   # Add to http block
   error_page 502 503 504 /5xx.html;
   location = /5xx.html {
       root /usr/share/nginx/html;
       internal;
   }

   error_page 429 /429.html;
   location = /429.html {
       root /usr/share/nginx/html;
       internal;
   }
   ```

**Verification:**
```bash
# Test HTTP redirect
curl -I http://app.betteredmonton.org | grep "301"  # Should see 301 Moved Permanently

# Test rate limiting
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}\n" http://api.betteredmonton.org/api/health; done
# Should see mostly 200s, then 429s

# Test embed proxy localhost restriction
curl -I http://<server_ip>:8881  # Should return 403 Forbidden
curl -I http://localhost:8881  # Should return 200 OK
```

---

### Phase 3: Pangolin Tunnel Configuration (30 minutes)

**File:** `.env` (Pangolin environment variables)

**Prerequisites:**
- Pangolin organization created at `https://api.bnkserve.org`
- API key obtained from organization settings
- DNS records created (see below)

**Steps:**

1. **Configure Pangolin Environment Variables**
   ```bash
   PANGOLIN_API_URL=https://api.bnkserve.org/v1
   PANGOLIN_API_KEY=<your_api_key>
   PANGOLIN_ORG_ID=<your_org_id>
   PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org
   ```

2. **Run Automated Setup**
   ```bash
   # Option 1: Via API endpoint
   curl -X POST http://localhost:4000/api/pangolin/setup-automated \
     -H "Authorization: Bearer <admin_jwt_token>" \
     -H "Content-Type: application/json" \
     -d '{
       "siteName": "Changemaker Lite Production",
       "domain": "betteredmonton.org"
     }'

   # Option 2: Via CLI wrapper
   ./scripts/pangolin-setup.sh
   ```

3. **Verify Tunnel Connectivity**
   ```bash
   # Check Newt container logs
   docker compose logs -f newt
   # Should see "Connected to Pangolin server" and "Tunnel established"

   # Test external access
   curl -I https://app.betteredmonton.org
   # Should return 200 OK with HTTPS
   ```

**DNS Configuration Required:**

Create 12 CNAME records pointing to Pangolin endpoint:
```
app.betteredmonton.org      CNAME   pangolin.bnkserve.org
api.betteredmonton.org      CNAME   pangolin.bnkserve.org
db.betteredmonton.org       CNAME   pangolin.bnkserve.org
docs.betteredmonton.org     CNAME   pangolin.bnkserve.org
code.betteredmonton.org     CNAME   pangolin.bnkserve.org
git.betteredmonton.org      CNAME   pangolin.bnkserve.org
n8n.betteredmonton.org      CNAME   pangolin.bnkserve.org
listmonk.betteredmonton.org CNAME   pangolin.bnkserve.org
mail.betteredmonton.org     CNAME   pangolin.bnkserve.org
qr.betteredmonton.org       CNAME   pangolin.bnkserve.org
draw.betteredmonton.org     CNAME   pangolin.bnkserve.org
grafana.betteredmonton.org  CNAME   pangolin.bnkserve.org
home.betteredmonton.org     CNAME   pangolin.bnkserve.org
```

---

### Phase 4: Backup Automation (30 minutes)

**File:** New cron job configuration

**Steps:**

1. **Create Backup Directory**
   ```bash
   mkdir -p /var/backups/changemaker-lite
   chmod 750 /var/backups/changemaker-lite
   ```

2. **Test Manual Backup**
   ```bash
   cd /home/bunker-admin/changemaker.lite
   ./scripts/backup.sh
   # Should create timestamped backup files in ./backups/
   ```

3. **Configure S3 Upload (Optional)**
   ```bash
   # Add to .env
   S3_BUCKET=changemaker-lite-backups
   AWS_ACCESS_KEY_ID=<your_access_key>
   AWS_SECRET_ACCESS_KEY=<your_secret_key>
   AWS_REGION=us-east-1  # Or your preferred region
   ```

4. **Add Cron Job**
   ```bash
   # Edit crontab
   crontab -e

   # Add the following lines:
   # Backup every 6 hours at minute 0
   0 */6 * * * cd /home/bunker-admin/changemaker.lite && ./scripts/backup.sh >> /var/log/changemaker-backup.log 2>&1

   # Clean up old backups (keep last 7 days)
   0 3 * * * find /home/bunker-admin/changemaker.lite/backups -type f -mtime +7 -delete
   ```

5. **Setup Backup Monitoring Alert**
   ```bash
   # Add to configs/prometheus/alerts.yml
   - alert: BackupJobFailed
     expr: time() - cm_backup_last_success_timestamp > 21600  # 6 hours
     for: 1h
     labels:
       severity: critical
     annotations:
       summary: "Backup job has not run successfully in over 6 hours"
       description: "Last successful backup was {{ $value | humanizeDuration }} ago"
   ```

**Verification:**
```bash
# Wait for cron execution (or run manually)
./scripts/backup.sh

# Check backup files exist
ls -lh backups/
# Should see 3 files: changemaker_v2-YYYYMMDD-HHMMSS.sql, listmonk-YYYYMMDD-HHMMSS.sql, uploads-YYYYMMDD-HHMMSS.tar.gz

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/
```

---

### Phase 5: Docker Compose Updates (1 hour)

**File:** `docker-compose.yml`

**Changes Required:**

1. **Fix Redis Exporter Authentication**
   ```yaml
   redis-exporter:
     environment:
       - REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis-changemaker:6379
   ```

2. **Add Container Resource Limits (Optional)**
   ```yaml
   api:
     deploy:
       resources:
         limits:
           cpus: '2'
           memory: 2G
         reservations:
           cpus: '1'
           memory: 1G

   media-api:
     deploy:
       resources:
         limits:
           cpus: '2'
           memory: 4G  # Higher for video processing
         reservations:
           cpus: '1'
           memory: 2G

   v2-postgres:
     deploy:
       resources:
         limits:
           cpus: '2'
           memory: 4G
         reservations:
           cpus: '1'
           memory: 2G
   ```

3. **Add Volume Size Limits (Optional)**
   ```yaml
   volumes:
     v2-postgres-data:
       driver_opts:
         type: none
         device: /var/lib/docker/volumes/v2-postgres-data
         o: bind,size=50G
   ```

**Verification:**
```bash
# Recreate containers with new config
docker compose down
docker compose up -d

# Verify Redis exporter connects with auth
docker compose logs redis-exporter | grep "successfully"

# Check resource limits are applied
docker stats --no-stream | grep changemaker
```

---

### Phase 6: Monitoring & Alerting Setup (1-2 hours)

**File:** `configs/prometheus/alerts.yml`

**Additional Alerts to Add:**

```yaml
groups:
  - name: production_critical
    rules:
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 2592000  # 30 days
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}"

      - alert: DiskSpaceRunningLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space running low on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of database connections ({{ $value }})"
          description: "PostgreSQL has {{ $value }} active connections"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage above 80%"
          description: "Redis is using {{ $value | humanizePercentage }} of allocated memory"
```

**Gotify Configuration:**

```bash
# Start Gotify container
docker compose --profile monitoring up -d gotify

# Access Gotify UI at http://localhost:8889
# Change admin password (default: admin/admin)

# Create application token for Alertmanager
# Copy token to configs/alertmanager/alertmanager.yml:

receivers:
  - name: 'gotify'
    webhook_configs:
      - url: 'http://gotify-changemaker:80/message?token=<your_app_token>'
        send_resolved: true
```

**Verification:**
```bash
# Start monitoring stack
docker compose --profile monitoring up -d

# Check Prometheus targets are up
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# Should return empty (all targets healthy)

# Test alert firing
docker compose stop api
# Wait 1 minute, check Alertmanager UI at http://localhost:9093
# Should see "APIDown" alert firing

# Verify Gotify receives notification
# Check Gotify UI, should see new message

# Restart API
docker compose start api
```

---

### Phase 7: Final Production Verification (1-2 hours)

**Production Deployment Checklist:**

**Security:**
- [ ] All default passwords changed (Grafana, Gotify, N8N, NocoDB, Listmonk)
- [ ] JWT secrets generated via `openssl rand -hex 32` (3 different values)
- [ ] ENCRYPTION_KEY generated and different from JWT secrets
- [ ] `EMAIL_TEST_MODE=false` set in production .env
- [ ] `NODE_TLS_REJECT_UNAUTHORIZED=` (empty) for strict TLS validation
- [ ] Redis password set and authenticated
- [ ] PostgreSQL passwords strong (20+ characters)
- [ ] Nginx rate limiting enabled
- [ ] Embed proxy ports restricted to localhost

**Networking:**
- [ ] All 12 DNS CNAME records created
- [ ] Pangolin tunnel configured and connected
- [ ] HTTP → HTTPS redirect working
- [ ] All subdomains resolve via HTTPS
- [ ] SSL certificates valid (checked via browser)
- [ ] WebSocket connections work (test n8n, MkDocs, Code Server)

**Email:**
- [ ] Production SMTP configured (host, port, user, pass)
- [ ] Test email sent and received
- [ ] Listmonk SMTP configured separately
- [ ] Password reset email works
- [ ] Shift confirmation email works

**Backup:**
- [ ] Backup script tested manually
- [ ] S3 credentials configured (if using)
- [ ] Cron job added for automated backups (every 6 hours)
- [ ] Old backup cleanup cron added (7 day retention)
- [ ] Backup monitoring alert configured

**Monitoring:**
- [ ] Prometheus collecting metrics from all services
- [ ] Grafana dashboards showing data
- [ ] Alertmanager configured with Gotify
- [ ] SSL expiry alert configured (30 days warning)
- [ ] Disk space alert configured (10% threshold)
- [ ] Backup job alert configured (6 hour SLA)
- [ ] Test alert sent to Gotify

**Application:**
- [ ] Admin login works (JWT token issued)
- [ ] Admin dashboard loads all components
- [ ] API health check returns 200 OK
- [ ] Media upload works (test 100MB+ video)
- [ ] Geocoding works (test address lookup)
- [ ] Map loads locations correctly
- [ ] Campaign email sending works (test queue)
- [ ] Listmonk sync works (if enabled)
- [ ] Canvass map GPS tracking works (volunteer portal)

**Performance:**
- [ ] Nginx rate limiting prevents abuse (test with 1000 req/sec)
- [ ] Database connection pooling configured
- [ ] Redis cache hit ratio >80% (check Grafana)
- [ ] Page load times <2 seconds (test with network throttling)
- [ ] Video upload completes within timeout (10GB max)

**Disaster Recovery:**
- [ ] Full backup restored on staging environment
- [ ] Database migration verified (Prisma migrations applied)
- [ ] Environment variables match production
- [ ] All services start cleanly after restore

---

## Verification Steps

### Post-Deployment Tests

**1. SSL/TLS Verification**
```bash
# Check all subdomains have valid SSL
for subdomain in app api db docs code git n8n listmonk mail qr draw grafana home; do
  echo "Testing $subdomain.betteredmonton.org"
  curl -I https://$subdomain.betteredmonton.org 2>&1 | grep -E "(HTTP|Subject:|Issuer:)"
done

# Should see:
# - HTTP/2 200 (or 301 redirect)
# - Valid certificate issuer (Let's Encrypt or Pangolin)
# - No certificate errors
```

**2. Authentication Flow**
```bash
# Test login endpoint
curl -X POST https://api.betteredmonton.org/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<admin_password>"}' \
  | jq '.accessToken'

# Should return JWT token

# Test token refresh
curl -X POST https://api.betteredmonton.org/api/auth/refresh \
  -H "Content-Type: application/json" \
  -d '{"refreshToken":"<refresh_token>"}' \
  | jq '.accessToken'

# Should return new access token
```

**3. Email Delivery**
```bash
# Trigger password reset email
curl -X POST https://api.betteredmonton.org/api/auth/forgot-password \
  -H "Content-Type: application/json" \
  -d '{"email":"test@example.com"}'

# Check external email inbox for reset link
# Verify email arrives within 2 minutes
```

**4. Rate Limiting**
```bash
# Test auth endpoint rate limit (10/min)
for i in {1..15}; do
  curl -s -o /dev/null -w "%{http_code} " https://api.betteredmonton.org/api/auth/login \
    -H "Content-Type: application/json" \
    -d '{"email":"test","password":"test"}'
done

# Should see: 401 401 401 ... 429 429 429 (after 10 requests)
```

**5. Database Connectivity**
```bash
# Check API can connect to database
curl https://api.betteredmonton.org/api/health | jq '.database'
# Should return: "healthy"

# Check Redis connectivity
curl https://api.betteredmonton.org/api/health | jq '.redis'
# Should return: "healthy"
```

**6. Media Upload**
```bash
# Test video upload (requires auth token)
curl -X POST https://api.betteredmonton.org/media/videos/upload \
  -H "Authorization: Bearer <admin_jwt>" \
  -F "file=@test-video.mp4" \
  -F "title=Test Upload" \
  | jq '.id'

# Should return video ID
```

**7. Monitoring Endpoints**
```bash
# Prometheus targets
curl -s https://grafana.betteredmonton.org/api/datasources/proxy/1/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Should show all targets with health: "up"

# Grafana health
curl https://grafana.betteredmonton.org/api/health
# Should return: {"database":"ok","version":"..."}
```

**8. Backup Verification**
```bash
# Trigger manual backup
./scripts/backup.sh

# Check backup files created
ls -lh backups/ | tail -3

# Should see 3 new files with current timestamp

# If S3 configured, verify upload
aws s3 ls s3://changemaker-lite-backups/ | tail -3
```

---

## Critical Files Reference

**Configuration Files:**
- `docker-compose.yml` - Service orchestration (25+ services)
- `.env` - Environment variables (100+ vars, not committed)
- `.env.example` - Template with all required variables
- `nginx/nginx.conf` - Global Nginx config + security headers
- `nginx/conf.d/api.conf` - API + Media API reverse proxy
- `nginx/conf.d/services.conf` - 12 service subdomains + embed proxies
- `configs/pangolin/resources.yml` - Tunnel resource definitions
- `configs/prometheus/prometheus.yml` - Metrics collection config
- `configs/prometheus/alerts.yml` - Alert rules
- `configs/grafana/*.json` - Pre-configured dashboards
- `configs/alertmanager/alertmanager.yml` - Alert routing

**Database Schema:**
- `api/prisma/schema.prisma` - Main database schema (30+ models)
- `api/prisma/migrations/` - Migration history
- `api/prisma/seed.ts` - Initial data seeding

**Deployment Scripts:**
- `scripts/backup.sh` - PostgreSQL + Listmonk + uploads backup
- `scripts/pangolin-setup.sh` - CLI wrapper for automated tunnel setup

**Environment Validation:**
- `api/src/config/env.ts` - Zod schema for all environment variables (100+ vars)

---

## Rollback Procedure

If deployment fails or critical issues arise:

**1. Immediate Rollback (5 minutes)**
```bash
# Stop all containers
docker compose down

# Restore previous .env file
cp .env.backup .env

# Restart with old configuration
docker compose up -d
```

**2. Database Rollback (15 minutes)**
```bash
# Stop API to prevent new writes
docker compose stop api media-api

# Restore from latest backup
docker compose exec v2-postgres psql -U changemaker -d postgres -c "DROP DATABASE changemaker_v2;"
docker compose exec v2-postgres psql -U changemaker -d postgres -c "CREATE DATABASE changemaker_v2;"
docker compose exec -T v2-postgres psql -U changemaker -d changemaker_v2 < backups/changemaker_v2-YYYYMMDD-HHMMSS.sql

# Restart services
docker compose start api media-api
```

**3. Full System Restore (30 minutes)**
```bash
# Stop all services
docker compose down -v  # WARNING: Removes all volumes

# Restore PostgreSQL data
tar -xzf backups/postgres-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore Redis data (if backed up)
tar -xzf backups/redis-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/

# Restore uploads
tar -xzf backups/uploads-YYYYMMDD-HHMMSS.tar.gz -C ./media/

# Restart all services
docker compose up -d
```

**4. Verify Rollback Success**
```bash
# Check all services healthy
docker compose ps | grep -v "Up"  # Should return nothing

# Test admin login
curl -X POST http://localhost:4000/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"admin@betteredmonton.org","password":"<password>"}'

# Verify database has data
curl http://localhost:4000/api/health | jq '.database'
```

---

## Post-Production Maintenance

**Daily Tasks:**
- Monitor Grafana dashboards for anomalies
- Check Gotify alerts for critical issues
- Verify backups completed successfully (check logs)

**Weekly Tasks:**
- Review API error logs for patterns
- Check disk space usage (alert should fire if <10%)
- Verify SSL certificate validity (30 days remaining)
- Test disaster recovery on staging environment

**Monthly Tasks:**
- Review access logs for suspicious activity
- Update Docker images to latest versions (after testing on staging)
- Audit user accounts and remove inactive users
- Review and rotate API keys if necessary

**Quarterly Tasks:**
- Conduct full security audit (penetration testing)
- Review and update rate limiting thresholds based on traffic
- Analyze backup storage costs and adjust retention policy
- Test full disaster recovery procedure with restore drill

---

## Summary

This plan provides a comprehensive pathway from development to production for the Changemaker Lite V2 networking infrastructure. The architecture is fundamentally sound with:

**Strengths:**
- Single bridge network simplifies communication
- Pangolin tunnel handles SSL/TLS externally (zero Nginx cert management)
- Comprehensive security headers and policies
- Automated backup script exists
- Monitoring stack with Prometheus/Grafana ready
- Rate limiting on critical endpoints

**Critical Path for Production:**
1. Phase 1: Security hardening (change passwords, configure SMTP) - **MUST DO**
2. Phase 3: Pangolin tunnel setup - **MUST DO**
3. Phase 4: Backup automation - **SHOULD DO**
4. Phase 6: Monitoring alerts - **SHOULD DO**
5. Phase 2: Nginx hardening - **NICE TO HAVE**

The remaining phases (network segmentation, resource limits, log aggregation) can be deferred to post-launch improvements without blocking production deployment.

**Estimated Total Implementation Time:** 6-10 hours (can be split across multiple days)

**Estimated Downtime During Deployment:** <5 minutes (only during final container restart)