changemaker.lite/bunker-ops/ROLLOUT_PLAN.md
2026-02-18 17:15:31 -07:00

544 lines
19 KiB
Markdown

# Bunker Ops — Staged Rollout Plan
Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.
---
## Current State (Completed)
### Phase 0: Foundation ✅
**Repo changes (v2 branch):**
- `INSTANCE_LABEL`, `BUNKER_OPS_ENABLED`, `BUNKER_OPS_REMOTE_WRITE_URL` env vars added
- Prometheus metrics tagged with `instance` label
- Redis-exporter auth fixed (correct container name + password)
- Backup script pushes metrics when Bunker Ops is enabled
- `docker-compose.override.yml` in `.gitignore`
**Ansible skeleton (`bunker-ops/`):**
- `ansible.cfg` — SSH pipelining, yaml callback, vault password path
- Inventory structure with example host_vars and group defaults
- 3 roles: `common` (OS/Docker/UFW), `changemaker` (full deploy), `monitoring` (Prometheus/remote_write)
- 5 playbooks: `deploy`, `upgrade`, `backup`, `configure`, `monitoring`
- 2 scripts: `bootstrap-vault.sh` (secret generation), `add-instance.sh` (instance scaffolding)
- `env.j2` template mapping all 100+ `.env` variables to Ansible vars
---
## Phase 1: First Managed Instance (Week 1-2)
**Goal:** Validate the full Ansible pipeline end-to-end on a single real instance.
### 1.1 Prepare a test server
- Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
- Set up SSH key access for a `deploy` user with passwordless sudo
- Ensure ports 80, 443, SSH are reachable
### 1.2 Scaffold the instance
```bash
cd bunker-ops
echo "$(openssl rand -base64 32)" > .vault_pass
chmod 600 .vault_pass
./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1
```
### 1.3 Run the full deploy
```bash
ansible-playbook playbooks/deploy.yml --limit test-01
```
### 1.4 Validate
- [ ] All containers running (`docker compose ps`)
- [ ] API responds at `/api/health`
- [ ] Admin GUI loads and login works
- [ ] Prisma migrations applied cleanly
- [ ] Backup cron is installed (`crontab -l`)
- [ ] UFW is active with correct rules
- [ ] fail2ban is running
### 1.5 Test day-2 operations
- [ ] `configure.yml` — change a feature flag, verify API restarts
- [ ] `upgrade.yml` — make a Git commit, run upgrade, verify new code is live
- [ ] `backup.yml` — trigger backup, verify archive created
- [ ] Secret rotation — change Redis password in vault, reconfigure, verify connectivity
### 1.6 Fix and iterate
Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:
- Docker image pull timing
- Prisma migration ordering
- Directory permission edge cases
- OS-specific package availability
**Deliverable:** One fully Ansible-managed instance running in production.
---
## Phase 2: Pangolin Tunnel Integration (Week 2-3)
**Goal:** Automate the full Pangolin tunnel setup within Ansible.
### 2.1 Add Pangolin setup task
Create `roles/changemaker/tasks/pangolin.yml`:
- Call Pangolin API to create a site (if `cml_pangolin_api_url` is set)
- Store returned `PANGOLIN_SITE_ID`, `PANGOLIN_NEWT_ID`, `PANGOLIN_NEWT_SECRET` in vault
- Sync resource definitions from `configs/pangolin/resources.yml`
- Set all resources to "Not Protected"
- Restart the Newt container
This replaces the manual Pangolin setup flow that currently lives in the admin GUI.
### 2.2 Validate tunnel works
- [ ] Instance accessible via `https://app.<domain>` through Pangolin
- [ ] API accessible via `https://api.<domain>`
- [ ] All 12 subdomains route correctly
- [ ] CORS headers present
### 2.3 Idempotency
Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.
**Deliverable:** Single-command deployment from bare server to publicly accessible instance.
---
## Phase 3: Onboard Existing Instances (Week 3-4)
**Goal:** Migrate manually-installed instances to Ansible management.
### 3.1 Import strategy
For each existing instance that was set up with `config.sh`:
1. **Scaffold host_vars:**
```bash
./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1
```
2. **Import existing secrets** from the server's `.env` into the vault:
```bash
# SSH in and extract current secrets:
ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env"
# Copy into vault.yml (replace generated values with existing ones)
ansible-vault edit inventory/host_vars/<hostname>/vault.yml
```
3. **Test with `--check --diff`** first:
```bash
ansible-playbook playbooks/configure.yml --limit <hostname> --check --diff
```
This shows what `.env` lines would change without actually changing anything.
4. **Apply configuration management:**
```bash
ansible-playbook playbooks/configure.yml --limit <hostname>
```
### 3.2 Avoid disruption
- **Do NOT re-run the `common` role** on production servers that are already set up. Use `--tags env,deploy` to skip OS provisioning.
- **Do NOT re-run the seed** on instances with existing data. The seed task has `failed_when: false` for safety, but verify.
- **Backup first** — always run `playbooks/backup.yml` before importing an existing instance.
### 3.3 Instance inventory target
| Instance | Domain | Status | Tier |
|----------|--------|--------|------|
| test-01 | test.cmlite.org | Phase 1 deploy | 1 |
| edmonton-prod | betteredmonton.org | Import from config.sh | 1 |
| ... | ... | ... | ... |
Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.
**Deliverable:** All existing production instances under Ansible management (Tier 1).
---
## Phase 4: Central Observability Server (Week 4-6)
**Goal:** Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.
### 4.1 Create `roles/bunker-ops/`
New role for the central server:
```
roles/bunker-ops/
├── tasks/main.yml
├── templates/
│ ├── docker-compose.yml.j2
│ └── nginx.conf.j2
├── defaults/main.yml
└── handlers/main.yml
```
**Docker Compose stack:**
| Service | Image | Purpose |
|---------|-------|---------|
| VictoriaMetrics | `victoriametrics/victoria-metrics` | Receives `remote_write` from instances, 12-month retention |
| Grafana | `grafana/grafana` | Fleet dashboards, VM as datasource |
| Uptime Kuma | `louislam/uptime-kuma` | HTTP health monitors per instance |
| Nginx | `nginx:alpine` | TLS termination, auth on write endpoint |
**Key configuration:**
- VictoriaMetrics listens on `:8428` for writes, `:8428/select` for queries
- Nginx authenticates `remote_write` requests with Bearer token
- Grafana auto-provisioned with VictoriaMetrics as default datasource
- Uptime Kuma monitors `https://api.<domain>/api/health` for each instance
### 4.2 Create `playbooks/central.yml`
```yaml
- name: Deploy Bunker Ops Central
hosts: bunker_ops_central
become: true
roles:
- common
- bunker-ops
```
### 4.3 Authentication for remote_write
- Generate a shared write token: `openssl rand -hex 32`
- Store in central server's Nginx config (validates incoming `Authorization: Bearer <token>`)
- Distribute same token to all Tier 2 instances via `vault_bunker_ops_remote_write_token`
- This ensures only authorized instances can push metrics
### 4.4 Deploy and verify
```bash
ansible-playbook playbooks/central.yml
```
Verify:
- [ ] VictoriaMetrics accepts test write: `curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1'`
- [ ] Grafana accessible at `https://grafana.ops.bnkserve.org`
- [ ] Uptime Kuma accessible and monitoring test instance
**Deliverable:** Central server running VictoriaMetrics + Grafana + Uptime Kuma.
---
## Phase 5: Fleet Dashboards (Week 6-7)
**Goal:** Build three Grafana dashboards for fleet-wide visibility.
### 5.1 Fleet Overview Dashboard
File: `files/grafana/fleet-overview.json`
**Panels:**
- **Stat row:** Total instances up/down — `count(up{job="changemaker-v2-api"} == 1)`
- **Instance table:** All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
- **Time series — Canvass visits:** `sum(rate(cm_canvass_visits_total[5m])) by (instance)`
- **Time series — Emails sent:** `sum(rate(cm_emails_sent_total[5m])) by (instance)`
- **Time series — HTTP request rate:** `sum(rate(http_requests_total[5m])) by (instance)`
- **Gauge — Fleet email queue:** `sum(cm_email_queue_size) by (instance)`
**Variables:**
- `$instance` — Multi-select, populated from `label_values(up{job="changemaker-v2-api"}, instance)`
### 5.2 Instance Drill-Down Dashboard
File: `files/grafana/instance-drilldown.json`
**Variables:**
- `$instance` — Single-select
**Panel groups:**
- **Health:** API uptime, HTTP error rate, p50/p95/p99 latency
- **Influence:** Emails sent/failed, queue depth, response submissions
- **Canvass:** Active sessions, visits by outcome, shift signups
- **Geocoding:** Cache hit rate, request rate by provider, duration
- **System:** CPU usage, memory, disk I/O, network (from `node_*` metrics)
This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.
### 5.3 Backup Status Dashboard
File: `files/grafana/backup-status.json`
**Panels:**
- **Gauge — Time since last backup:** `time() - cm_backup_last_success_timestamp` per instance. Green < 24h, yellow < 48h, red > 48h.
- **Table — Backup sizes:** `cm_backup_size_bytes` per instance with sparkline trend
- **Alert rule — BackupStale:** Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)
### 5.4 Auto-provisioning
Grafana dashboards auto-provisioned from JSON files via a `dashboards.yml` provisioner config, same pattern as the existing per-instance Grafana setup.
**Deliverable:** Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.
---
## Phase 6: Promote Instances to Tier 2 (Week 7-8)
**Goal:** Enable fleet observability on all managed instances.
### 6.1 For each instance
1. Update `host_vars/<hostname>/main.yml`:
```yaml
bunker_ops_enabled: true
bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write"
```
2. Add write token to `host_vars/<hostname>/vault.yml`:
```yaml
vault_bunker_ops_remote_write_token: "<shared-token>"
```
3. Apply:
```bash
ansible-playbook playbooks/monitoring.yml --limit <hostname>
```
### 6.2 Verify data flow
- Check VictoriaMetrics for incoming data: `curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}'`
- Check Grafana fleet overview shows the new instance
- Verify backup metrics appear after next backup run
### 6.3 Bandwidth audit
Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.
**Deliverable:** All instances reporting to central dashboards.
---
## Phase 7: Alerting & Notifications (Week 8-9)
**Goal:** Central alerting for fleet-wide issues.
### 7.1 Alert rules on central VictoriaMetrics
Create `roles/bunker-ops/templates/alerts.yml.j2`:
| Alert | Condition | Severity |
|-------|-----------|----------|
| `InstanceDown` | `up{job="changemaker-v2-api"} == 0` for 5m | critical |
| `HighErrorRate` | `rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1` | warning |
| `EmailQueueBacklog` | `cm_email_queue_size > 100` for 15m | warning |
| `BackupStale` | `time() - cm_backup_last_success_timestamp > 90000` (25h) | critical |
| `DiskSpaceLow` | `node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1` | critical |
| `HighMemoryUsage` | `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1` for 10m | warning |
| `CanvassSessionAbandoned` | `cm_active_canvass_sessions > 20` for 1h | info |
### 7.2 Notification channels
Central Alertmanager routes alerts to:
- **Gotify** — Push notifications to admin phone
- **Email** — Summary digests to fleet admin email
- **Webhook** — Optional Rocket.Chat / Slack integration
### 7.3 Silence rules
- Suppress `InstanceDown` during planned maintenance windows
- Group alerts by instance to avoid notification storms
**Deliverable:** Automated alerts for instance health, backups, and resource exhaustion.
---
## Phase 8: Upgrade Automation & CI (Week 9-11)
**Goal:** Streamline the upgrade pipeline.
### 8.1 Gitea webhook → n8n → Ansible
When a new commit is pushed to the `v2` branch on the central Gitea:
1. **Gitea** fires a webhook to **n8n**
2. **n8n** workflow triggers `ansible-playbook playbooks/upgrade.yml`
3. Rolling upgrade proceeds (25% batches)
4. Health checks gate each batch
5. n8n sends a summary notification
### 8.2 Canary deployment
Add a `canary` group to inventory:
```yaml
all:
children:
canary:
hosts:
test-01:
changemaker_instances:
hosts:
edmonton-prod:
calgary-prod:
...
```
New `playbooks/canary-upgrade.yml`:
1. Upgrade canary instance first
2. Wait 30 minutes
3. Run health checks
4. If healthy, proceed with `upgrade.yml` on remaining instances
5. If unhealthy, alert and stop
### 8.3 Rollback playbook
Create `playbooks/rollback.yml`:
- `git checkout <previous-tag>` on the instance
- `docker compose up -d --build`
- Run health checks
- Requires knowing the previous good commit (store in a fact file per host)
**Deliverable:** Semi-automated upgrade pipeline with canary gates and rollback capability.
---
## Phase 9: Self-Service Instance Provisioning (Week 11-13)
**Goal:** Enable clients to request and receive a new instance with minimal operator intervention.
### 9.1 Provisioning API
Build a lightweight FastAPI or Express service on the central server:
**Endpoints:**
- `POST /api/instances` — Create a new instance (accepts domain, features, tier)
- `GET /api/instances` — List all instances with status
- `GET /api/instances/:id/status` — Health + metrics summary
- `DELETE /api/instances/:id` — Decommission
**Workflow:**
1. API receives request with domain, SSH host, feature flags
2. Runs `add-instance.sh` to scaffold host_vars
3. Triggers `ansible-playbook playbooks/deploy.yml --limit <hostname>`
4. Monitors deployment progress
5. Returns status when deployment completes
### 9.2 Fleet admin dashboard
A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):
- Instance list with health status
- One-click upgrade, backup, configure
- New instance wizard
- Grafana iframe embeds for metrics
### 9.3 DNS automation
If using Pangolin for all instances:
- Pangolin handles DNS + TLS automatically
- The provisioning API creates Pangolin resources as part of deploy
If using Cloudflare or other DNS:
- Add a `roles/dns/` role with Cloudflare API integration
- Automatically create A/CNAME records for all subdomains
**Deliverable:** Operator can provision a new instance with a single API call or form submission.
---
## Phase 10: Multi-Tenant Hardening (Week 13-16)
**Goal:** Security and isolation for a fleet of independent client instances.
### 10.1 Network isolation
Each instance runs on its own server — already isolated at the OS level. Additional hardening:
- UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
- No inter-instance SSH access
- Central server can SSH to instances, not vice versa
### 10.2 Secret rotation schedule
Automate periodic secret rotation:
| Secret | Rotation frequency | Method |
|--------|-------------------|--------|
| JWT access secret | Quarterly | vault edit + configure playbook |
| Database passwords | Annually | vault edit + full redeploy |
| Redis password | Annually | vault edit + configure playbook |
| Pangolin tokens | On-demand | Re-run Pangolin setup |
| Remote write token | Annually | Update central + all instances |
Create a `playbooks/rotate-secrets.yml` that generates new secrets and applies them.
### 10.3 Audit logging
- Ansible logs all operations to a central log file
- Each playbook run produces a summary (host, timestamp, changes made)
- Integrate with Git: all inventory changes are committed to a private repo
### 10.4 Compliance documentation
For each instance, Ansible can generate:
- Inventory of services and versions
- Security posture report (UFW rules, fail2ban status, TLS cert expiry)
- Backup compliance (last backup date, retention policy)
- Data residency confirmation (server location, no PII in metrics)
**Deliverable:** Hardened fleet with automated rotation, audit trail, and compliance artifacts.
---
## Timeline Summary
| Phase | Duration | Milestone |
|-------|----------|-----------|
| 0: Foundation | ✅ Done | Ansible skeleton + repo changes |
| 1: First instance | Week 1-2 | End-to-end deploy validated |
| 2: Pangolin integration | Week 2-3 | Single-command public deployment |
| 3: Import existing | Week 3-4 | All instances under management |
| 4: Central server | Week 4-6 | VictoriaMetrics + Grafana running |
| 5: Fleet dashboards | Week 6-7 | 3 operational dashboards |
| 6: Tier 2 promotion | Week 7-8 | All instances reporting centrally |
| 7: Alerting | Week 8-9 | Automated health + backup alerts |
| 8: CI/Upgrade automation | Week 9-11 | Canary + rolling upgrades |
| 9: Self-service | Week 11-13 | Provisioning API + admin UI |
| 10: Multi-tenant hardening | Week 13-16 | Rotation, audit, compliance |
**Total: ~16 weeks from foundation to fully hardened fleet.**
Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.
---
## FOSS Stack Summary
Every component is Free and Open Source Software:
| Component | License | Role in Stack |
|-----------|---------|---------------|
| Ansible | GPL-3.0 | Deployment automation & configuration management |
| VictoriaMetrics | Apache-2.0 | Centralized time-series database (Prometheus-compatible) |
| Grafana | AGPL-3.0 | Fleet dashboards & visualization |
| Uptime Kuma | MIT | HTTP health monitoring |
| Prometheus | Apache-2.0 | Per-instance metrics collection (existing) |
| Alertmanager | Apache-2.0 | Alert routing & deduplication |
| Docker + Compose | Apache-2.0 | Container orchestration |
| Ubuntu | Various FOSS | Host operating system |
| UFW / iptables | GPL | Firewall |
| fail2ban | GPL-2.0 | Brute-force protection |
| OpenSSL | Apache-2.0 | Secret generation |
No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.
---
## Risk Register
| Risk | Impact | Mitigation |
|------|--------|------------|
| Vault password lost | Cannot decrypt any secrets | Store in password manager + offline backup |
| Central server down | No fleet dashboards (instances unaffected) | `remote_write` WAL retries for ~2h; instances self-sufficient |
| SSH key compromise | Attacker gains access to managed servers | Rotate keys, use separate deploy user, enable 2FA on SSH |
| Ansible playbook bug | Bad config deployed to fleet | `serial: 1` for deploys, `--check --diff` before apply, canary phase |
| Docker Hub rate limits | Image pulls fail during upgrade | Use a registry mirror or pre-pull images |
| Prisma migration conflict | Database schema mismatch | Always run `migrate deploy` (applies pending only), never `migrate dev` in production |
| Instance disk full | Backup fails, containers crash | `BackupStale` + `DiskSpaceLow` alerts, retention cleanup |