1092 lines
36 KiB
Markdown
1092 lines
36 KiB
Markdown
# Production Networking Preparation Plan
|
|
|
|
## Context
|
|
|
|
The Changemaker Lite V2 application needs to be prepared for production deployment. The current architecture is development-focused with Docker Compose orchestration, Nginx reverse proxy, and Pangolin tunnel integration for SSL/TLS termination. The user wants a comprehensive understanding of the networking setup and identification of production readiness gaps before going live.
|
|
|
|
**Why this is needed:**
|
|
- Current setup is optimized for local development (HTTP-only, MailHog, default passwords)
|
|
- Production deployment requires SSL/TLS via Pangolin tunnel, real SMTP, security hardening
|
|
- Need to identify all gaps between dev and production configurations
|
|
- Need actionable checklist for production cutover
|
|
|
|
**What prompted this:**
|
|
- User preparing to deploy production instance on betteredmonton.org domain
|
|
- Need to understand networking architecture, security posture, and deployment requirements
|
|
- Ensure all 12 subdomains route correctly through Pangolin tunnel
|
|
|
|
**Intended outcome:**
|
|
- Comprehensive documentation of current networking architecture
|
|
- Identified production readiness gaps with severity ratings
|
|
- Prioritized checklist for production deployment
|
|
- Configuration changes needed for production hardening
|
|
|
|
---
|
|
|
|
## Current State Assessment
|
|
|
|
### Network Architecture
|
|
|
|
**Single-Bridge Network Design:**
|
|
- All 25+ services on one Docker bridge network (`changemaker-lite`)
|
|
- Services communicate via container hostnames (DNS: `127.0.0.11`)
|
|
- Nginx acts as single reverse proxy for all external traffic
|
|
- Pangolin tunnel (Newt container) provides SSL/TLS termination
|
|
|
|
**Service Topology:**
|
|
```
|
|
Internet → Pangolin Tunnel (HTTPS) → Newt Container → Nginx (HTTP:80) → Backend Services
|
|
```
|
|
|
|
**Critical Services:**
|
|
- **Express API** (port 4000) - Main V2 API with Prisma ORM
|
|
- **Fastify Media API** (port 4100) - Video library management
|
|
- **Admin GUI** (port 3000) - React admin interface
|
|
- **PostgreSQL V2** (port 5433 localhost-only) - Primary database
|
|
- **Redis** (port 6379) - Cache, rate limiting, BullMQ backend
|
|
- **Nginx** (ports 80/443) - Reverse proxy with 12 subdomain routes
|
|
|
|
### Subdomain Routing Matrix
|
|
|
|
| Subdomain | Backend | Container Port | Purpose | Security Headers |
|
|
|-----------|---------|----------------|---------|------------------|
|
|
| `app.betteredmonton.org` | Admin GUI | 3000 | Admin interface | SAMEORIGIN |
|
|
| `api.betteredmonton.org` | Express + Media API | 4000/4100 | Main API + Media routes | SAMEORIGIN |
|
|
| `betteredmonton.org` (root) | MkDocs Site | 80 | Public documentation | Default |
|
|
| `db.betteredmonton.org` | NocoDB | 8080 | Data browser | CSP iframe |
|
|
| `docs.betteredmonton.org` | MkDocs Dev | 8000 | Live preview | CSP iframe + WS |
|
|
| `code.betteredmonton.org` | Code Server | 8080 | Web IDE | CSP iframe + WS |
|
|
| `git.betteredmonton.org` | Gitea | 3000 | Git hosting | CSP iframe |
|
|
| `n8n.betteredmonton.org` | n8n | 5678 | Workflow automation | CSP iframe + WS |
|
|
| `listmonk.betteredmonton.org` | Listmonk | 9000 | Newsletter platform | SAMEORIGIN |
|
|
| `mail.betteredmonton.org` | MailHog | 8025 | Email capture (dev) | CSP iframe + WS |
|
|
| `qr.betteredmonton.org` | Mini QR | 8080 | QR code generator | CSP iframe |
|
|
| `draw.betteredmonton.org` | Excalidraw | 80 | Collaborative whiteboard | CSP iframe + WS |
|
|
| `grafana.betteredmonton.org` | Grafana | 3000 | Monitoring dashboard | SAMEORIGIN |
|
|
| `home.betteredmonton.org` | Homepage | 3000 | Service dashboard | SAMEORIGIN |
|
|
|
|
**Embed Proxy Ports** (bypass security headers for iframe embedding):
|
|
- Ports 8881-8886 → Strip `X-Frame-Options` and `Content-Security-Policy` headers
|
|
- Used by Admin GUI to embed third-party services (NocoDB, n8n, Gitea, MailHog, Mini QR, Excalidraw)
|
|
|
|
### SSL/TLS & Tunnel Configuration
|
|
|
|
**Current Setup:**
|
|
- **Nginx**: HTTP-only (port 80), no SSL/TLS configuration
|
|
- **Pangolin Tunnel**: Handles all HTTPS termination externally
|
|
- **Newt Container**: Establishes encrypted tunnel to Pangolin server
|
|
- **Certificate Management**: Delegated entirely to Pangolin (zero config in Nginx)
|
|
|
|
**Pangolin Environment Variables:**
|
|
```bash
|
|
PANGOLIN_API_URL=https://api.bnkserve.org/v1 # Self-hosted Pangolin instance
|
|
PANGOLIN_API_KEY= # Bearer token authentication
|
|
PANGOLIN_ORG_ID= # Organization identifier
|
|
PANGOLIN_SITE_ID= # Created during initial setup
|
|
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org # Tunnel entry point
|
|
PANGOLIN_NEWT_ID= # Generated tunnel identity
|
|
PANGOLIN_NEWT_SECRET= # Tunnel authentication secret
|
|
```
|
|
|
|
**Automated Setup** (Feb 2026):
|
|
- One-command deployment via `/api/pangolin/setup-automated` endpoint
|
|
- Central resource config: `configs/pangolin/resources.yml` (12 services)
|
|
- Atomic .env updates + Newt container restart + tunnel verification
|
|
- Reduces setup time from 15min → 2min (87% reduction)
|
|
|
|
### Security Posture
|
|
|
|
**Strengths:**
|
|
- ✅ JWT access/refresh token rotation (atomic transactions)
|
|
- ✅ Password policy enforced at schema level (12+ chars, complexity requirements)
|
|
- ✅ Rate limiting on auth endpoints (10/min per IP)
|
|
- ✅ Redis authentication required (`requirepass` enforced)
|
|
- ✅ User enumeration prevention (401 for all auth failures)
|
|
- ✅ Database secrets encrypted with `ENCRYPTION_KEY`
|
|
- ✅ HSTS header with 1-year max-age + includeSubDomains
|
|
- ✅ CSP headers for iframe protection on sensitive services
|
|
- ✅ PostgreSQL bound to localhost only (not exposed to network)
|
|
- ✅ Security audit completed Feb 2026 (13 findings addressed)
|
|
|
|
**Critical Gaps:**
|
|
- ❌ No HTTP → HTTPS redirect in Nginx (relies on Pangolin)
|
|
- ❌ Embed proxy ports (8881-8886) bypass ALL security headers (XSS risk)
|
|
- ❌ No nginx-level rate limiting (only application-level)
|
|
- ❌ Grafana admin password defaults to "admin"
|
|
- ❌ Gotify admin password defaults to "admin"
|
|
- ❌ N8N default credentials in .env.example
|
|
- ❌ EMAIL_TEST_MODE=true by default (routes to MailHog in production)
|
|
- ❌ NODE_TLS_REJECT_UNAUTHORIZED not explicitly set (could accept self-signed certs)
|
|
|
|
### Database & Caching
|
|
|
|
**PostgreSQL V2** (`changemaker-v2-postgres`):
|
|
- Port binding: `127.0.0.1:5433:5432` (localhost-only, production-safe)
|
|
- Connection: `postgresql://changemaker:${V2_POSTGRES_PASSWORD}@changemaker-v2-postgres:5432/changemaker_v2`
|
|
- Used by: Express API (Prisma), Media API (Prisma), NocoDB (separate `nocodb_meta` DB)
|
|
- Healthcheck: `pg_isready` with 10s interval
|
|
|
|
**Listmonk PostgreSQL** (`listmonk-db`):
|
|
- Port binding: `127.0.0.1:5432:5432` (localhost-only)
|
|
- Isolated database lifecycle (separate from V2)
|
|
- Two-user architecture: Web admin + API user (plaintext tokens)
|
|
|
|
**Redis** (`redis-changemaker`):
|
|
- Port binding: `6379:6379` (exposed to host network)
|
|
- Authentication: `requirepass ${REDIS_PASSWORD}` enforced
|
|
- Connection: `redis://:${REDIS_PASSWORD}@redis-changemaker:6379`
|
|
- Used for: Cache, BullMQ queues, rate limiting, geocoding cache
|
|
- **SECURITY NOTE**: redis-exporter uses unauthenticated connection string (potential risk)
|
|
|
|
### Email Configuration
|
|
|
|
**Development (Current):**
|
|
- `EMAIL_TEST_MODE=true` → All emails route to MailHog (localhost:1025)
|
|
- MailHog Web UI: `http://mail.betteredmonton.org` (dev only)
|
|
- No external SMTP configured
|
|
|
|
**Production Requirements:**
|
|
- `EMAIL_TEST_MODE=false` → Route to real SMTP server
|
|
- SMTP credentials: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASS`
|
|
- Encrypt SMTP password with `ENCRYPTION_KEY` (stored in DB)
|
|
- Configure Listmonk SMTP separately (newsletter sending)
|
|
|
|
**Email Systems:**
|
|
1. **Campaign Emails** (BullMQ queue) → Main SMTP
|
|
2. **System Emails** (password reset, shift confirmations) → Main SMTP
|
|
3. **Newsletter Emails** (Listmonk) → Listmonk SMTP (can be same or separate)
|
|
|
|
### Monitoring & Observability
|
|
|
|
**Prometheus Metrics:**
|
|
- 12 custom `cm_*` metrics (API uptime, queue size, sessions, etc.)
|
|
- HTTP request metrics (duration, status codes, paths)
|
|
- Redis, PostgreSQL, container metrics via exporters
|
|
- Scrape interval: 15s
|
|
|
|
**Grafana Dashboards:**
|
|
- 3 pre-configured dashboards (API metrics, system metrics, canvass activity)
|
|
- Data source: Prometheus
|
|
- Default admin: `admin/admin` (must change for production)
|
|
|
|
**Alertmanager:**
|
|
- Alert routing configured
|
|
- Requires Gotify setup for notifications (default: admin/admin)
|
|
|
|
**Services Behind `--profile monitoring`:**
|
|
- Prometheus (9090)
|
|
- Grafana (3001)
|
|
- Alertmanager (9093)
|
|
- cAdvisor (8080)
|
|
- Node Exporter (9100)
|
|
- Redis Exporter (9121)
|
|
- Gotify (8889)
|
|
|
|
### Backup & Disaster Recovery
|
|
|
|
**Current Backup Script** (`scripts/backup.sh`):
|
|
- PostgreSQL V2 dump (pg_dump)
|
|
- Listmonk database dump
|
|
- Uploads directory archive (tar.gz)
|
|
- Optional S3 upload (requires `S3_BUCKET`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
|
|
|
|
**Critical Gaps:**
|
|
- ❌ No automated backup scheduling (cron not configured)
|
|
- ❌ No backup retention policy
|
|
- ❌ No disaster recovery playbook
|
|
- ❌ No restore procedure documentation
|
|
- ❌ No backup monitoring/alerting
|
|
|
|
---
|
|
|
|
## Production Readiness Gaps
|
|
|
|
### Critical Severity (Must Fix Before Production)
|
|
|
|
1. **Default Admin Passwords**
|
|
- **Services:** Grafana, Gotify, N8N, NocoDB, Listmonk
|
|
- **Impact:** Unauthorized access to admin dashboards, data exfiltration
|
|
- **Fix:** Change all default passwords in `.env` before deployment
|
|
- **Verification:** Attempt login with default credentials (should fail)
|
|
|
|
2. **Email Test Mode Enabled**
|
|
- **Issue:** `EMAIL_TEST_MODE=true` routes all production emails to MailHog
|
|
- **Impact:** Users never receive password reset, shift confirmation, campaign emails
|
|
- **Fix:** Set `EMAIL_TEST_MODE=false` + configure real SMTP credentials
|
|
- **Verification:** Send test email, verify receipt in external inbox
|
|
|
|
3. **Missing ENCRYPTION_KEY**
|
|
- **Issue:** Required for encrypting DB secrets (SMTP passwords, API tokens)
|
|
- **Impact:** Application won't start in production if unset
|
|
- **Fix:** Generate via `openssl rand -hex 32`, add to `.env`
|
|
- **Verification:** Restart API, check logs for encryption errors
|
|
|
|
4. **Embed Proxy XSS Risk**
|
|
- **Issue:** Ports 8881-8886 strip all security headers (`X-Frame-Options`, CSP)
|
|
- **Impact:** If one service is compromised, attacker can iframe it from malicious site
|
|
- **Fix:** Restrict embed proxy ports to localhost-only OR implement IP whitelist
|
|
- **Verification:** Attempt to access embed proxy from external IP (should fail)
|
|
|
|
### High Severity (Fix Before Launch)
|
|
|
|
5. **No HTTP → HTTPS Redirect**
|
|
- **Issue:** Users can access `http://betteredmonton.org` without forced redirect
|
|
- **Impact:** Mixed content warnings, insecure authentication cookies
|
|
- **Fix:** Add nginx redirect block for all subdomains
|
|
- **Verification:** `curl -I http://app.betteredmonton.org` should return 301 redirect
|
|
|
|
6. **No Automated Backups**
|
|
- **Issue:** Manual backup script requires cron scheduling
|
|
- **Impact:** Data loss if server fails before manual backup
|
|
- **Fix:** Add cron job: `0 */6 * * * /path/to/backup.sh` (every 6 hours)
|
|
- **Verification:** Check `/var/log/cron` for backup execution logs
|
|
|
|
7. **Redis Exporter Unauthenticated**
|
|
- **Issue:** `REDIS_ADDR=redis:6379` (no password)
|
|
- **Impact:** If exporter runs on separate network segment, Redis exposed
|
|
- **Fix:** Change to `REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis:6379`
|
|
- **Verification:** Check redis-exporter logs, ensure no auth errors
|
|
|
|
8. **No Disaster Recovery Documentation**
|
|
- **Issue:** Restore procedure not documented
|
|
- **Impact:** Extended downtime during recovery, data corruption risk
|
|
- **Fix:** Document step-by-step restore process (DB import, volume restore, env config)
|
|
- **Verification:** Perform disaster recovery drill on staging environment
|
|
|
|
### Medium Severity (Address Within 30 Days)
|
|
|
|
9. **Single Bridge Network**
|
|
- **Issue:** All services on same network; lateral movement easy if one compromised
|
|
- **Impact:** If one service is exploited, attacker can reach databases/Redis
|
|
- **Fix:** Split into separate networks (app-net, data-net, services-net)
|
|
- **Verification:** Verify service isolation via `docker network inspect`
|
|
|
|
10. **No Nginx Rate Limiting**
|
|
- **Issue:** Rate limiting only at application level (Express middleware)
|
|
- **Impact:** DDoS attacks can saturate Nginx/network before reaching API rate limiter
|
|
- **Fix:** Add nginx `limit_req` zones for `/api/*` paths
|
|
- **Verification:** Send 1000 req/sec, verify 429 responses from Nginx
|
|
|
|
11. **No Log Aggregation**
|
|
- **Issue:** Logs scattered across Docker containers
|
|
- **Impact:** Difficult to debug multi-service issues, no centralized audit trail
|
|
- **Fix:** Implement ELK stack or similar (Elasticsearch, Logstash, Kibana)
|
|
- **Verification:** Search logs from all services in one UI
|
|
|
|
12. **No TLS Certificate Monitoring**
|
|
- **Issue:** Pangolin manages certs, but no alerting on renewal failures
|
|
- **Impact:** Site goes offline when cert expires
|
|
- **Fix:** Add Prometheus alert for cert expiry (30 days before)
|
|
- **Verification:** Simulate expired cert, verify alert fires
|
|
|
|
### Low Severity (Nice to Have)
|
|
|
|
13. **No Service Mesh**
|
|
- **Issue:** No observability of inter-service communication
|
|
- **Impact:** Difficult to debug network issues between containers
|
|
- **Fix:** Implement Linkerd or Istio for traffic management
|
|
- **Verification:** View service-to-service latency in Grafana
|
|
|
|
14. **No Container Resource Limits**
|
|
- **Issue:** Docker Compose doesn't set CPU/memory limits
|
|
- **Impact:** One service can starve others of resources
|
|
- **Fix:** Add `deploy.resources.limits` to docker-compose.yml
|
|
- **Verification:** Monitor resource usage under load
|
|
|
|
15. **No Listmonk HTTPS**
|
|
- **Issue:** API-to-Listmonk communication uses HTTP (inside Docker network)
|
|
- **Impact:** If network is compromised, credentials visible in plaintext
|
|
- **Fix:** Configure Listmonk with internal TLS certificate
|
|
- **Verification:** Inspect network traffic, verify encryption
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Pre-Deployment Security Hardening (2-3 hours)
|
|
|
|
**File:** `.env` (production environment variables)
|
|
|
|
**Changes Required:**
|
|
|
|
1. **Generate Secrets**
|
|
```bash
|
|
# Run on production server
|
|
openssl rand -hex 32 # JWT_ACCESS_SECRET
|
|
openssl rand -hex 32 # JWT_REFRESH_SECRET
|
|
openssl rand -hex 32 # ENCRYPTION_KEY (must differ from JWT secrets)
|
|
openssl rand -hex 16 # LISTMONK_API_TOKEN
|
|
```
|
|
|
|
2. **Update Environment Variables**
|
|
- `EMAIL_TEST_MODE=false`
|
|
- `NODE_TLS_REJECT_UNAUTHORIZED=` (empty string for strict validation)
|
|
- `GRAFANA_ADMIN_PASSWORD=<strong_password>`
|
|
- `GOTIFY_ADMIN_PASSWORD=<strong_password>`
|
|
- `N8N_USER_PASSWORD=<strong_password>`
|
|
- `NC_ADMIN_PASSWORD=<strong_password>`
|
|
- `LISTMONK_WEB_ADMIN_PASSWORD=<strong_password>`
|
|
- `V2_POSTGRES_PASSWORD=<strong_password>`
|
|
- `REDIS_PASSWORD=<strong_password>`
|
|
- `LISTMONK_DB_PASSWORD=<strong_password>`
|
|
- `GITEA_DB_PASSWD=<strong_password>`
|
|
- `GITEA_DB_ROOT_PASSWORD=<strong_password>`
|
|
- `N8N_ENCRYPTION_KEY=<strong_password>`
|
|
|
|
3. **Configure Production SMTP**
|
|
- `SMTP_HOST=<smtp.provider.com>`
|
|
- `SMTP_PORT=<465 or 587>`
|
|
- `SMTP_USER=<username>`
|
|
- `SMTP_PASS=<password>` (will be encrypted by API on first startup)
|
|
- `SMTP_SECURE=true` (for port 465) or `false` (for STARTTLS on 587)
|
|
|
|
4. **Listmonk SMTP Configuration**
|
|
- `LISTMONK_SMTP_HOST=<smtp.provider.com>`
|
|
- `LISTMONK_SMTP_PORT=<465 or 587>`
|
|
- `LISTMONK_SMTP_TLS_TYPE=STARTTLS` (for 587) or `TLS` (for 465)
|
|
- `LISTMONK_SMTP_AUTH_PROTOCOL=login`
|
|
- `LISTMONK_SMTP_USERNAME=<username>`
|
|
- `LISTMONK_SMTP_PASSWORD=<password>`
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Check all required env vars are set
|
|
grep "CHANGE_THIS" .env # Should return nothing
|
|
grep "admin" .env | grep -v ADMIN_EMAIL # Should return nothing (no default admin passwords)
|
|
|
|
# Test SMTP connection
|
|
docker compose exec api node -e "
|
|
const nodemailer = require('nodemailer');
|
|
const transport = nodemailer.createTransport({
|
|
host: process.env.SMTP_HOST,
|
|
port: parseInt(process.env.SMTP_PORT),
|
|
secure: process.env.SMTP_SECURE === 'true',
|
|
auth: {
|
|
user: process.env.SMTP_USER,
|
|
pass: process.env.SMTP_PASS
|
|
}
|
|
});
|
|
transport.verify().then(console.log).catch(console.error);
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: Nginx Production Hardening (1 hour)
|
|
|
|
**File:** `nginx/conf.d/default.conf` (or new `production.conf`)
|
|
|
|
**Changes Required:**
|
|
|
|
1. **Add HTTP → HTTPS Redirect**
|
|
```nginx
|
|
server {
|
|
listen 80;
|
|
server_name *.betteredmonton.org betteredmonton.org;
|
|
|
|
# Health check endpoints (allow HTTP)
|
|
location /health {
|
|
proxy_pass http://changemaker-v2-api:4000;
|
|
}
|
|
|
|
# Redirect all other traffic to HTTPS
|
|
location / {
|
|
return 301 https://$host$request_uri;
|
|
}
|
|
}
|
|
```
|
|
|
|
2. **Add Nginx Rate Limiting**
|
|
```nginx
|
|
# Add to http block in nginx.conf
|
|
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
|
|
limit_req_zone $binary_remote_addr zone=auth_limit:10m rate=10r/m;
|
|
|
|
# Add to api.conf location blocks
|
|
location /api/auth/ {
|
|
limit_req zone=auth_limit burst=20 nodelay;
|
|
limit_req_status 429;
|
|
proxy_pass http://changemaker-v2-api:4000;
|
|
}
|
|
|
|
location /api/ {
|
|
limit_req zone=api_limit burst=200 nodelay;
|
|
limit_req_status 429;
|
|
proxy_pass http://changemaker-v2-api:4000;
|
|
}
|
|
```
|
|
|
|
3. **Restrict Embed Proxy Ports to Localhost**
|
|
```nginx
|
|
# Add to each embed proxy server block
|
|
server {
|
|
listen 8881;
|
|
server_name localhost;
|
|
|
|
# Reject non-localhost connections
|
|
allow 127.0.0.1;
|
|
deny all;
|
|
|
|
location / {
|
|
proxy_pass http://changemaker-v2-nocodb:8080;
|
|
proxy_hide_header X-Frame-Options;
|
|
proxy_hide_header Content-Security-Policy;
|
|
}
|
|
}
|
|
```
|
|
|
|
4. **Add Custom Error Pages**
|
|
```nginx
|
|
# Add to http block
|
|
error_page 502 503 504 /5xx.html;
|
|
location = /5xx.html {
|
|
root /usr/share/nginx/html;
|
|
internal;
|
|
}
|
|
|
|
error_page 429 /429.html;
|
|
location = /429.html {
|
|
root /usr/share/nginx/html;
|
|
internal;
|
|
}
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Test HTTP redirect
|
|
curl -I http://app.betteredmonton.org | grep "301" # Should see 301 Moved Permanently
|
|
|
|
# Test rate limiting
|
|
for i in {1..100}; do curl -s -o /dev/null -w "%{http_code}\n" http://api.betteredmonton.org/api/health; done
|
|
# Should see mostly 200s, then 429s
|
|
|
|
# Test embed proxy localhost restriction
|
|
curl -I http://<server_ip>:8881 # Should return 403 Forbidden
|
|
curl -I http://localhost:8881 # Should return 200 OK
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 3: Pangolin Tunnel Configuration (30 minutes)
|
|
|
|
**File:** `.env` (Pangolin environment variables)
|
|
|
|
**Prerequisites:**
|
|
- Pangolin organization created at `https://api.bnkserve.org`
|
|
- API key obtained from organization settings
|
|
- DNS records created (see below)
|
|
|
|
**Steps:**
|
|
|
|
1. **Configure Pangolin Environment Variables**
|
|
```bash
|
|
PANGOLIN_API_URL=https://api.bnkserve.org/v1
|
|
PANGOLIN_API_KEY=<your_api_key>
|
|
PANGOLIN_ORG_ID=<your_org_id>
|
|
PANGOLIN_ENDPOINT=https://pangolin.bnkserve.org
|
|
```
|
|
|
|
2. **Run Automated Setup**
|
|
```bash
|
|
# Option 1: Via API endpoint
|
|
curl -X POST http://localhost:4000/api/pangolin/setup-automated \
|
|
-H "Authorization: Bearer <admin_jwt_token>" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"siteName": "Changemaker Lite Production",
|
|
"domain": "betteredmonton.org"
|
|
}'
|
|
|
|
# Option 2: Via CLI wrapper
|
|
./scripts/pangolin-setup.sh
|
|
```
|
|
|
|
3. **Verify Tunnel Connectivity**
|
|
```bash
|
|
# Check Newt container logs
|
|
docker compose logs -f newt
|
|
# Should see "Connected to Pangolin server" and "Tunnel established"
|
|
|
|
# Test external access
|
|
curl -I https://app.betteredmonton.org
|
|
# Should return 200 OK with HTTPS
|
|
```
|
|
|
|
**DNS Configuration Required:**
|
|
|
|
Create 12 CNAME records pointing to Pangolin endpoint:
|
|
```
|
|
app.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
api.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
db.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
docs.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
code.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
git.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
n8n.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
listmonk.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
mail.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
qr.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
draw.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
grafana.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
home.betteredmonton.org CNAME pangolin.bnkserve.org
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Backup Automation (30 minutes)
|
|
|
|
**File:** New cron job configuration
|
|
|
|
**Steps:**
|
|
|
|
1. **Create Backup Directory**
|
|
```bash
|
|
mkdir -p /var/backups/changemaker-lite
|
|
chmod 750 /var/backups/changemaker-lite
|
|
```
|
|
|
|
2. **Test Manual Backup**
|
|
```bash
|
|
cd /home/bunker-admin/changemaker.lite
|
|
./scripts/backup.sh
|
|
# Should create timestamped backup files in ./backups/
|
|
```
|
|
|
|
3. **Configure S3 Upload (Optional)**
|
|
```bash
|
|
# Add to .env
|
|
S3_BUCKET=changemaker-lite-backups
|
|
AWS_ACCESS_KEY_ID=<your_access_key>
|
|
AWS_SECRET_ACCESS_KEY=<your_secret_key>
|
|
AWS_REGION=us-east-1 # Or your preferred region
|
|
```
|
|
|
|
4. **Add Cron Job**
|
|
```bash
|
|
# Edit crontab
|
|
crontab -e
|
|
|
|
# Add the following lines:
|
|
# Backup every 6 hours at minute 0
|
|
0 */6 * * * cd /home/bunker-admin/changemaker.lite && ./scripts/backup.sh >> /var/log/changemaker-backup.log 2>&1
|
|
|
|
# Clean up old backups (keep last 7 days)
|
|
0 3 * * * find /home/bunker-admin/changemaker.lite/backups -type f -mtime +7 -delete
|
|
```
|
|
|
|
5. **Setup Backup Monitoring Alert**
|
|
```bash
|
|
# Add to configs/prometheus/alerts.yml
|
|
- alert: BackupJobFailed
|
|
expr: time() - cm_backup_last_success_timestamp > 21600 # 6 hours
|
|
for: 1h
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Backup job has not run successfully in over 6 hours"
|
|
description: "Last successful backup was {{ $value | humanizeDuration }} ago"
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Wait for cron execution (or run manually)
|
|
./scripts/backup.sh
|
|
|
|
# Check backup files exist
|
|
ls -lh backups/
|
|
# Should see 3 files: changemaker_v2-YYYYMMDD-HHMMSS.sql, listmonk-YYYYMMDD-HHMMSS.sql, uploads-YYYYMMDD-HHMMSS.tar.gz
|
|
|
|
# If S3 configured, verify upload
|
|
aws s3 ls s3://changemaker-lite-backups/
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 5: Docker Compose Updates (1 hour)
|
|
|
|
**File:** `docker-compose.yml`
|
|
|
|
**Changes Required:**
|
|
|
|
1. **Fix Redis Exporter Authentication**
|
|
```yaml
|
|
redis-exporter:
|
|
environment:
|
|
- REDIS_ADDR=redis://:${REDIS_PASSWORD}@redis-changemaker:6379
|
|
```
|
|
|
|
2. **Add Container Resource Limits (Optional)**
|
|
```yaml
|
|
api:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '2'
|
|
memory: 2G
|
|
reservations:
|
|
cpus: '1'
|
|
memory: 1G
|
|
|
|
media-api:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '2'
|
|
memory: 4G # Higher for video processing
|
|
reservations:
|
|
cpus: '1'
|
|
memory: 2G
|
|
|
|
v2-postgres:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '2'
|
|
memory: 4G
|
|
reservations:
|
|
cpus: '1'
|
|
memory: 2G
|
|
```
|
|
|
|
3. **Add Volume Size Limits (Optional)**
|
|
```yaml
|
|
volumes:
|
|
v2-postgres-data:
|
|
driver_opts:
|
|
type: none
|
|
device: /var/lib/docker/volumes/v2-postgres-data
|
|
o: bind,size=50G
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Recreate containers with new config
|
|
docker compose down
|
|
docker compose up -d
|
|
|
|
# Verify Redis exporter connects with auth
|
|
docker compose logs redis-exporter | grep "successfully"
|
|
|
|
# Check resource limits are applied
|
|
docker stats --no-stream | grep changemaker
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 6: Monitoring & Alerting Setup (1-2 hours)
|
|
|
|
**File:** `configs/prometheus/alerts.yml`
|
|
|
|
**Additional Alerts to Add:**
|
|
|
|
```yaml
|
|
groups:
|
|
- name: production_critical
|
|
rules:
|
|
- alert: SSLCertExpiringSoon
|
|
expr: probe_ssl_earliest_cert_expiry - time() < 2592000 # 30 days
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "SSL certificate expiring soon for {{ $labels.instance }}"
|
|
description: "Certificate expires in {{ $value | humanizeDuration }}"
|
|
|
|
- alert: DiskSpaceRunningLow
|
|
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Disk space running low on {{ $labels.instance }}"
|
|
description: "Only {{ $value | humanizePercentage }} disk space remaining"
|
|
|
|
- alert: DatabaseConnectionsHigh
|
|
expr: pg_stat_activity_count > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High number of database connections ({{ $value }})"
|
|
description: "PostgreSQL has {{ $value }} active connections"
|
|
|
|
- alert: RedisMemoryHigh
|
|
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Redis memory usage above 80%"
|
|
description: "Redis is using {{ $value | humanizePercentage }} of allocated memory"
|
|
```
|
|
|
|
**Gotify Configuration:**
|
|
|
|
```bash
|
|
# Start Gotify container
|
|
docker compose --profile monitoring up -d gotify
|
|
|
|
# Access Gotify UI at http://localhost:8889
|
|
# Change admin password (default: admin/admin)
|
|
|
|
# Create application token for Alertmanager
|
|
# Copy token to configs/alertmanager/alertmanager.yml:
|
|
|
|
receivers:
|
|
- name: 'gotify'
|
|
webhook_configs:
|
|
- url: 'http://gotify-changemaker:80/message?token=<your_app_token>'
|
|
send_resolved: true
|
|
```
|
|
|
|
**Verification:**
|
|
```bash
|
|
# Start monitoring stack
|
|
docker compose --profile monitoring up -d
|
|
|
|
# Check Prometheus targets are up
|
|
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
|
|
# Should return empty (all targets healthy)
|
|
|
|
# Test alert firing
|
|
docker compose stop api
|
|
# Wait 1 minute, check Alertmanager UI at http://localhost:9093
|
|
# Should see "APIDown" alert firing
|
|
|
|
# Verify Gotify receives notification
|
|
# Check Gotify UI, should see new message
|
|
|
|
# Restart API
|
|
docker compose start api
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 7: Final Production Verification (1-2 hours)
|
|
|
|
**Production Deployment Checklist:**
|
|
|
|
**Security:**
|
|
- [ ] All default passwords changed (Grafana, Gotify, N8N, NocoDB, Listmonk)
|
|
- [ ] JWT secrets generated via `openssl rand -hex 32` (3 different values)
|
|
- [ ] ENCRYPTION_KEY generated and different from JWT secrets
|
|
- [ ] `EMAIL_TEST_MODE=false` set in production .env
|
|
- [ ] `NODE_TLS_REJECT_UNAUTHORIZED=` (empty) for strict TLS validation
|
|
- [ ] Redis password set and authenticated
|
|
- [ ] PostgreSQL passwords strong (20+ characters)
|
|
- [ ] Nginx rate limiting enabled
|
|
- [ ] Embed proxy ports restricted to localhost
|
|
|
|
**Networking:**
|
|
- [ ] All 12 DNS CNAME records created
|
|
- [ ] Pangolin tunnel configured and connected
|
|
- [ ] HTTP → HTTPS redirect working
|
|
- [ ] All subdomains resolve via HTTPS
|
|
- [ ] SSL certificates valid (checked via browser)
|
|
- [ ] WebSocket connections work (test n8n, MkDocs, Code Server)
|
|
|
|
**Email:**
|
|
- [ ] Production SMTP configured (host, port, user, pass)
|
|
- [ ] Test email sent and received
|
|
- [ ] Listmonk SMTP configured separately
|
|
- [ ] Password reset email works
|
|
- [ ] Shift confirmation email works
|
|
|
|
**Backup:**
|
|
- [ ] Backup script tested manually
|
|
- [ ] S3 credentials configured (if using)
|
|
- [ ] Cron job added for automated backups (every 6 hours)
|
|
- [ ] Old backup cleanup cron added (7 day retention)
|
|
- [ ] Backup monitoring alert configured
|
|
|
|
**Monitoring:**
|
|
- [ ] Prometheus collecting metrics from all services
|
|
- [ ] Grafana dashboards showing data
|
|
- [ ] Alertmanager configured with Gotify
|
|
- [ ] SSL expiry alert configured (30 days warning)
|
|
- [ ] Disk space alert configured (10% threshold)
|
|
- [ ] Backup job alert configured (6 hour SLA)
|
|
- [ ] Test alert sent to Gotify
|
|
|
|
**Application:**
|
|
- [ ] Admin login works (JWT token issued)
|
|
- [ ] Admin dashboard loads all components
|
|
- [ ] API health check returns 200 OK
|
|
- [ ] Media upload works (test 100MB+ video)
|
|
- [ ] Geocoding works (test address lookup)
|
|
- [ ] Map loads locations correctly
|
|
- [ ] Campaign email sending works (test queue)
|
|
- [ ] Listmonk sync works (if enabled)
|
|
- [ ] Canvass map GPS tracking works (volunteer portal)
|
|
|
|
**Performance:**
|
|
- [ ] Nginx rate limiting prevents abuse (test with 1000 req/sec)
|
|
- [ ] Database connection pooling configured
|
|
- [ ] Redis cache hit ratio >80% (check Grafana)
|
|
- [ ] Page load times <2 seconds (test with network throttling)
|
|
- [ ] Video upload completes within timeout (10GB max)
|
|
|
|
**Disaster Recovery:**
|
|
- [ ] Full backup restored on staging environment
|
|
- [ ] Database migration verified (Prisma migrations applied)
|
|
- [ ] Environment variables match production
|
|
- [ ] All services start cleanly after restore
|
|
|
|
---
|
|
|
|
## Verification Steps
|
|
|
|
### Post-Deployment Tests
|
|
|
|
**1. SSL/TLS Verification**
|
|
```bash
|
|
# Check all subdomains have valid SSL
|
|
for subdomain in app api db docs code git n8n listmonk mail qr draw grafana home; do
|
|
echo "Testing $subdomain.betteredmonton.org"
|
|
curl -I https://$subdomain.betteredmonton.org 2>&1 | grep -E "(HTTP|Subject:|Issuer:)"
|
|
done
|
|
|
|
# Should see:
|
|
# - HTTP/2 200 (or 301 redirect)
|
|
# - Valid certificate issuer (Let's Encrypt or Pangolin)
|
|
# - No certificate errors
|
|
```
|
|
|
|
**2. Authentication Flow**
|
|
```bash
|
|
# Test login endpoint
|
|
curl -X POST https://api.betteredmonton.org/api/auth/login \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"admin@betteredmonton.org","password":"<admin_password>"}' \
|
|
| jq '.accessToken'
|
|
|
|
# Should return JWT token
|
|
|
|
# Test token refresh
|
|
curl -X POST https://api.betteredmonton.org/api/auth/refresh \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"refreshToken":"<refresh_token>"}' \
|
|
| jq '.accessToken'
|
|
|
|
# Should return new access token
|
|
```
|
|
|
|
**3. Email Delivery**
|
|
```bash
|
|
# Trigger password reset email
|
|
curl -X POST https://api.betteredmonton.org/api/auth/forgot-password \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"test@example.com"}'
|
|
|
|
# Check external email inbox for reset link
|
|
# Verify email arrives within 2 minutes
|
|
```
|
|
|
|
**4. Rate Limiting**
|
|
```bash
|
|
# Test auth endpoint rate limit (10/min)
|
|
for i in {1..15}; do
|
|
curl -s -o /dev/null -w "%{http_code} " https://api.betteredmonton.org/api/auth/login \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"test","password":"test"}'
|
|
done
|
|
|
|
# Should see: 401 401 401 ... 429 429 429 (after 10 requests)
|
|
```
|
|
|
|
**5. Database Connectivity**
|
|
```bash
|
|
# Check API can connect to database
|
|
curl https://api.betteredmonton.org/api/health | jq '.database'
|
|
# Should return: "healthy"
|
|
|
|
# Check Redis connectivity
|
|
curl https://api.betteredmonton.org/api/health | jq '.redis'
|
|
# Should return: "healthy"
|
|
```
|
|
|
|
**6. Media Upload**
|
|
```bash
|
|
# Test video upload (requires auth token)
|
|
curl -X POST https://api.betteredmonton.org/media/videos/upload \
|
|
-H "Authorization: Bearer <admin_jwt>" \
|
|
-F "file=@test-video.mp4" \
|
|
-F "title=Test Upload" \
|
|
| jq '.id'
|
|
|
|
# Should return video ID
|
|
```
|
|
|
|
**7. Monitoring Endpoints**
|
|
```bash
|
|
# Prometheus targets
|
|
curl -s https://grafana.betteredmonton.org/api/datasources/proxy/1/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
|
|
|
|
# Should show all targets with health: "up"
|
|
|
|
# Grafana health
|
|
curl https://grafana.betteredmonton.org/api/health
|
|
# Should return: {"database":"ok","version":"..."}
|
|
```
|
|
|
|
**8. Backup Verification**
|
|
```bash
|
|
# Trigger manual backup
|
|
./scripts/backup.sh
|
|
|
|
# Check backup files created
|
|
ls -lh backups/ | tail -3
|
|
|
|
# Should see 3 new files with current timestamp
|
|
|
|
# If S3 configured, verify upload
|
|
aws s3 ls s3://changemaker-lite-backups/ | tail -3
|
|
```
|
|
|
|
---
|
|
|
|
## Critical Files Reference
|
|
|
|
**Configuration Files:**
|
|
- `docker-compose.yml` - Service orchestration (25+ services)
|
|
- `.env` - Environment variables (100+ vars, not committed)
|
|
- `.env.example` - Template with all required variables
|
|
- `nginx/nginx.conf` - Global Nginx config + security headers
|
|
- `nginx/conf.d/api.conf` - API + Media API reverse proxy
|
|
- `nginx/conf.d/services.conf` - 12 service subdomains + embed proxies
|
|
- `configs/pangolin/resources.yml` - Tunnel resource definitions
|
|
- `configs/prometheus/prometheus.yml` - Metrics collection config
|
|
- `configs/prometheus/alerts.yml` - Alert rules
|
|
- `configs/grafana/*.json` - Pre-configured dashboards
|
|
- `configs/alertmanager/alertmanager.yml` - Alert routing
|
|
|
|
**Database Schema:**
|
|
- `api/prisma/schema.prisma` - Main database schema (30+ models)
|
|
- `api/prisma/migrations/` - Migration history
|
|
- `api/prisma/seed.ts` - Initial data seeding
|
|
|
|
**Deployment Scripts:**
|
|
- `scripts/backup.sh` - PostgreSQL + Listmonk + uploads backup
|
|
- `scripts/pangolin-setup.sh` - CLI wrapper for automated tunnel setup
|
|
|
|
**Environment Validation:**
|
|
- `api/src/config/env.ts` - Zod schema for all environment variables (100+ vars)
|
|
|
|
---
|
|
|
|
## Rollback Procedure
|
|
|
|
If deployment fails or critical issues arise:
|
|
|
|
**1. Immediate Rollback (5 minutes)**
|
|
```bash
|
|
# Stop all containers
|
|
docker compose down
|
|
|
|
# Restore previous .env file
|
|
cp .env.backup .env
|
|
|
|
# Restart with old configuration
|
|
docker compose up -d
|
|
```
|
|
|
|
**2. Database Rollback (15 minutes)**
|
|
```bash
|
|
# Stop API to prevent new writes
|
|
docker compose stop api media-api
|
|
|
|
# Restore from latest backup
|
|
docker compose exec v2-postgres psql -U changemaker -d postgres -c "DROP DATABASE changemaker_v2;"
|
|
docker compose exec v2-postgres psql -U changemaker -d postgres -c "CREATE DATABASE changemaker_v2;"
|
|
docker compose exec -T v2-postgres psql -U changemaker -d changemaker_v2 < backups/changemaker_v2-YYYYMMDD-HHMMSS.sql
|
|
|
|
# Restart services
|
|
docker compose start api media-api
|
|
```
|
|
|
|
**3. Full System Restore (30 minutes)**
|
|
```bash
|
|
# Stop all services
|
|
docker compose down -v # WARNING: Removes all volumes
|
|
|
|
# Restore PostgreSQL data
|
|
tar -xzf backups/postgres-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/
|
|
|
|
# Restore Redis data (if backed up)
|
|
tar -xzf backups/redis-data-YYYYMMDD-HHMMSS.tar.gz -C /var/lib/docker/volumes/
|
|
|
|
# Restore uploads
|
|
tar -xzf backups/uploads-YYYYMMDD-HHMMSS.tar.gz -C ./media/
|
|
|
|
# Restart all services
|
|
docker compose up -d
|
|
```
|
|
|
|
**4. Verify Rollback Success**
|
|
```bash
|
|
# Check all services healthy
|
|
docker compose ps | grep -v "Up" # Should return nothing
|
|
|
|
# Test admin login
|
|
curl -X POST http://localhost:4000/api/auth/login \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"email":"admin@betteredmonton.org","password":"<password>"}'
|
|
|
|
# Verify database has data
|
|
curl http://localhost:4000/api/health | jq '.database'
|
|
```
|
|
|
|
---
|
|
|
|
## Post-Production Maintenance
|
|
|
|
**Daily Tasks:**
|
|
- Monitor Grafana dashboards for anomalies
|
|
- Check Gotify alerts for critical issues
|
|
- Verify backups completed successfully (check logs)
|
|
|
|
**Weekly Tasks:**
|
|
- Review API error logs for patterns
|
|
- Check disk space usage (alert should fire if <10%)
|
|
- Verify SSL certificate validity (30 days remaining)
|
|
- Test disaster recovery on staging environment
|
|
|
|
**Monthly Tasks:**
|
|
- Review access logs for suspicious activity
|
|
- Update Docker images to latest versions (after testing on staging)
|
|
- Audit user accounts and remove inactive users
|
|
- Review and rotate API keys if necessary
|
|
|
|
**Quarterly Tasks:**
|
|
- Conduct full security audit (penetration testing)
|
|
- Review and update rate limiting thresholds based on traffic
|
|
- Analyze backup storage costs and adjust retention policy
|
|
- Test full disaster recovery procedure with restore drill
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
This plan provides a comprehensive pathway from development to production for the Changemaker Lite V2 networking infrastructure. The architecture is fundamentally sound with:
|
|
|
|
**Strengths:**
|
|
- Single bridge network simplifies communication
|
|
- Pangolin tunnel handles SSL/TLS externally (zero Nginx cert management)
|
|
- Comprehensive security headers and policies
|
|
- Automated backup script exists
|
|
- Monitoring stack with Prometheus/Grafana ready
|
|
- Rate limiting on critical endpoints
|
|
|
|
**Critical Path for Production:**
|
|
1. Phase 1: Security hardening (change passwords, configure SMTP) - **MUST DO**
|
|
2. Phase 3: Pangolin tunnel setup - **MUST DO**
|
|
3. Phase 4: Backup automation - **SHOULD DO**
|
|
4. Phase 6: Monitoring alerts - **SHOULD DO**
|
|
5. Phase 2: Nginx hardening - **NICE TO HAVE**
|
|
|
|
The remaining phases (network segmentation, resource limits, log aggregation) can be deferred to post-launch improvements without blocking production deployment.
|
|
|
|
**Estimated Total Implementation Time:** 6-10 hours (can be split across multiple days)
|
|
|
|
**Estimated Downtime During Deployment:** <5 minutes (only during final container restart)
|