# Bunker Ops — Staged Rollout Plan Full plan for rolling out the fleet management and observability system across Changemaker Lite instances. --- ## Current State (Completed) ### Phase 0: Foundation ✅ **Repo changes (v2 branch):** - `INSTANCE_LABEL`, `BUNKER_OPS_ENABLED`, `BUNKER_OPS_REMOTE_WRITE_URL` env vars added - Prometheus metrics tagged with `instance` label - Redis-exporter auth fixed (correct container name + password) - Backup script pushes metrics when Bunker Ops is enabled - `docker-compose.override.yml` in `.gitignore` **Ansible skeleton (`bunker-ops/`):** - `ansible.cfg` — SSH pipelining, yaml callback, vault password path - Inventory structure with example host_vars and group defaults - 3 roles: `common` (OS/Docker/UFW), `changemaker` (full deploy), `monitoring` (Prometheus/remote_write) - 5 playbooks: `deploy`, `upgrade`, `backup`, `configure`, `monitoring` - 2 scripts: `bootstrap-vault.sh` (secret generation), `add-instance.sh` (instance scaffolding) - `env.j2` template mapping all 100+ `.env` variables to Ansible vars --- ## Phase 1: First Managed Instance (Week 1-2) **Goal:** Validate the full Ansible pipeline end-to-end on a single real instance. ### 1.1 Prepare a test server - Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM) - Set up SSH key access for a `deploy` user with passwordless sudo - Ensure ports 80, 443, SSH are reachable ### 1.2 Scaffold the instance ```bash cd bunker-ops echo "$(openssl rand -base64 32)" > .vault_pass chmod 600 .vault_pass ./scripts/add-instance.sh test-01 test.cmlite.org --tier 1 ``` ### 1.3 Run the full deploy ```bash ansible-playbook playbooks/deploy.yml --limit test-01 ``` ### 1.4 Validate - [ ] All containers running (`docker compose ps`) - [ ] API responds at `/api/health` - [ ] Admin GUI loads and login works - [ ] Prisma migrations applied cleanly - [ ] Backup cron is installed (`crontab -l`) - [ ] UFW is active with correct rules - [ ] fail2ban is running ### 1.5 Test day-2 operations - [ ] `configure.yml` — change a feature flag, verify API restarts - [ ] `upgrade.yml` — make a Git commit, run upgrade, verify new code is live - [ ] `backup.yml` — trigger backup, verify archive created - [ ] Secret rotation — change Redis password in vault, reconfigure, verify connectivity ### 1.6 Fix and iterate Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in: - Docker image pull timing - Prisma migration ordering - Directory permission edge cases - OS-specific package availability **Deliverable:** One fully Ansible-managed instance running in production. --- ## Phase 2: Pangolin Tunnel Integration (Week 2-3) **Goal:** Automate the full Pangolin tunnel setup within Ansible. ### 2.1 Add Pangolin setup task Create `roles/changemaker/tasks/pangolin.yml`: - Call Pangolin API to create a site (if `cml_pangolin_api_url` is set) - Store returned `PANGOLIN_SITE_ID`, `PANGOLIN_NEWT_ID`, `PANGOLIN_NEWT_SECRET` in vault - Sync resource definitions from `configs/pangolin/resources.yml` - Set all resources to "Not Protected" - Restart the Newt container This replaces the manual Pangolin setup flow that currently lives in the admin GUI. ### 2.2 Validate tunnel works - [ ] Instance accessible via `https://app.` through Pangolin - [ ] API accessible via `https://api.` - [ ] All 12 subdomains route correctly - [ ] CORS headers present ### 2.3 Idempotency Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones. **Deliverable:** Single-command deployment from bare server to publicly accessible instance. --- ## Phase 3: Onboard Existing Instances (Week 3-4) **Goal:** Migrate manually-installed instances to Ansible management. ### 3.1 Import strategy For each existing instance that was set up with `config.sh`: 1. **Scaffold host_vars:** ```bash ./scripts/add-instance.sh --tier 1 ``` 2. **Import existing secrets** from the server's `.env` into the vault: ```bash # SSH in and extract current secrets: ssh deploy@ "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env" # Copy into vault.yml (replace generated values with existing ones) ansible-vault edit inventory/host_vars//vault.yml ``` 3. **Test with `--check --diff`** first: ```bash ansible-playbook playbooks/configure.yml --limit --check --diff ``` This shows what `.env` lines would change without actually changing anything. 4. **Apply configuration management:** ```bash ansible-playbook playbooks/configure.yml --limit ``` ### 3.2 Avoid disruption - **Do NOT re-run the `common` role** on production servers that are already set up. Use `--tags env,deploy` to skip OS provisioning. - **Do NOT re-run the seed** on instances with existing data. The seed task has `failed_when: false` for safety, but verify. - **Backup first** — always run `playbooks/backup.yml` before importing an existing instance. ### 3.3 Instance inventory target | Instance | Domain | Status | Tier | |----------|--------|--------|------| | test-01 | test.cmlite.org | Phase 1 deploy | 1 | | edmonton-prod | betteredmonton.org | Import from config.sh | 1 | | ... | ... | ... | ... | Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3. **Deliverable:** All existing production instances under Ansible management (Tier 1). --- ## Phase 4: Central Observability Server (Week 4-6) **Goal:** Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma. ### 4.1 Create `roles/bunker-ops/` New role for the central server: ``` roles/bunker-ops/ ├── tasks/main.yml ├── templates/ │ ├── docker-compose.yml.j2 │ └── nginx.conf.j2 ├── defaults/main.yml └── handlers/main.yml ``` **Docker Compose stack:** | Service | Image | Purpose | |---------|-------|---------| | VictoriaMetrics | `victoriametrics/victoria-metrics` | Receives `remote_write` from instances, 12-month retention | | Grafana | `grafana/grafana` | Fleet dashboards, VM as datasource | | Uptime Kuma | `louislam/uptime-kuma` | HTTP health monitors per instance | | Nginx | `nginx:alpine` | TLS termination, auth on write endpoint | **Key configuration:** - VictoriaMetrics listens on `:8428` for writes, `:8428/select` for queries - Nginx authenticates `remote_write` requests with Bearer token - Grafana auto-provisioned with VictoriaMetrics as default datasource - Uptime Kuma monitors `https://api./api/health` for each instance ### 4.2 Create `playbooks/central.yml` ```yaml - name: Deploy Bunker Ops Central hosts: bunker_ops_central become: true roles: - common - bunker-ops ``` ### 4.3 Authentication for remote_write - Generate a shared write token: `openssl rand -hex 32` - Store in central server's Nginx config (validates incoming `Authorization: Bearer `) - Distribute same token to all Tier 2 instances via `vault_bunker_ops_remote_write_token` - This ensures only authorized instances can push metrics ### 4.4 Deploy and verify ```bash ansible-playbook playbooks/central.yml ``` Verify: - [ ] VictoriaMetrics accepts test write: `curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer ' --data-binary 'test_metric{instance="test"} 1'` - [ ] Grafana accessible at `https://grafana.ops.bnkserve.org` - [ ] Uptime Kuma accessible and monitoring test instance **Deliverable:** Central server running VictoriaMetrics + Grafana + Uptime Kuma. --- ## Phase 5: Fleet Dashboards (Week 6-7) **Goal:** Build three Grafana dashboards for fleet-wide visibility. ### 5.1 Fleet Overview Dashboard File: `files/grafana/fleet-overview.json` **Panels:** - **Stat row:** Total instances up/down — `count(up{job="changemaker-v2-api"} == 1)` - **Instance table:** All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age - **Time series — Canvass visits:** `sum(rate(cm_canvass_visits_total[5m])) by (instance)` - **Time series — Emails sent:** `sum(rate(cm_emails_sent_total[5m])) by (instance)` - **Time series — HTTP request rate:** `sum(rate(http_requests_total[5m])) by (instance)` - **Gauge — Fleet email queue:** `sum(cm_email_queue_size) by (instance)` **Variables:** - `$instance` — Multi-select, populated from `label_values(up{job="changemaker-v2-api"}, instance)` ### 5.2 Instance Drill-Down Dashboard File: `files/grafana/instance-drilldown.json` **Variables:** - `$instance` — Single-select **Panel groups:** - **Health:** API uptime, HTTP error rate, p50/p95/p99 latency - **Influence:** Emails sent/failed, queue depth, response submissions - **Canvass:** Active sessions, visits by outcome, shift signups - **Geocoding:** Cache hit rate, request rate by provider, duration - **System:** CPU usage, memory, disk I/O, network (from `node_*` metrics) This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics. ### 5.3 Backup Status Dashboard File: `files/grafana/backup-status.json` **Panels:** - **Gauge — Time since last backup:** `time() - cm_backup_last_success_timestamp` per instance. Green < 24h, yellow < 48h, red > 48h. - **Table — Backup sizes:** `cm_backup_size_bytes` per instance with sparkline trend - **Alert rule — BackupStale:** Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron) ### 5.4 Auto-provisioning Grafana dashboards auto-provisioned from JSON files via a `dashboards.yml` provisioner config, same pattern as the existing per-instance Grafana setup. **Deliverable:** Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status. --- ## Phase 6: Promote Instances to Tier 2 (Week 7-8) **Goal:** Enable fleet observability on all managed instances. ### 6.1 For each instance 1. Update `host_vars//main.yml`: ```yaml bunker_ops_enabled: true bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write" ``` 2. Add write token to `host_vars//vault.yml`: ```yaml vault_bunker_ops_remote_write_token: "" ``` 3. Apply: ```bash ansible-playbook playbooks/monitoring.yml --limit ``` ### 6.2 Verify data flow - Check VictoriaMetrics for incoming data: `curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance=""}'` - Check Grafana fleet overview shows the new instance - Verify backup metrics appear after next backup run ### 6.3 Bandwidth audit Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet. **Deliverable:** All instances reporting to central dashboards. --- ## Phase 7: Alerting & Notifications (Week 8-9) **Goal:** Central alerting for fleet-wide issues. ### 7.1 Alert rules on central VictoriaMetrics Create `roles/bunker-ops/templates/alerts.yml.j2`: | Alert | Condition | Severity | |-------|-----------|----------| | `InstanceDown` | `up{job="changemaker-v2-api"} == 0` for 5m | critical | | `HighErrorRate` | `rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1` | warning | | `EmailQueueBacklog` | `cm_email_queue_size > 100` for 15m | warning | | `BackupStale` | `time() - cm_backup_last_success_timestamp > 90000` (25h) | critical | | `DiskSpaceLow` | `node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1` | critical | | `HighMemoryUsage` | `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1` for 10m | warning | | `CanvassSessionAbandoned` | `cm_active_canvass_sessions > 20` for 1h | info | ### 7.2 Notification channels Central Alertmanager routes alerts to: - **Gotify** — Push notifications to admin phone - **Email** — Summary digests to fleet admin email - **Webhook** — Optional Rocket.Chat / Slack integration ### 7.3 Silence rules - Suppress `InstanceDown` during planned maintenance windows - Group alerts by instance to avoid notification storms **Deliverable:** Automated alerts for instance health, backups, and resource exhaustion. --- ## Phase 8: Upgrade Automation & CI (Week 9-11) **Goal:** Streamline the upgrade pipeline. ### 8.1 Gitea webhook → n8n → Ansible When a new commit is pushed to the `v2` branch on the central Gitea: 1. **Gitea** fires a webhook to **n8n** 2. **n8n** workflow triggers `ansible-playbook playbooks/upgrade.yml` 3. Rolling upgrade proceeds (25% batches) 4. Health checks gate each batch 5. n8n sends a summary notification ### 8.2 Canary deployment Add a `canary` group to inventory: ```yaml all: children: canary: hosts: test-01: changemaker_instances: hosts: edmonton-prod: calgary-prod: ... ``` New `playbooks/canary-upgrade.yml`: 1. Upgrade canary instance first 2. Wait 30 minutes 3. Run health checks 4. If healthy, proceed with `upgrade.yml` on remaining instances 5. If unhealthy, alert and stop ### 8.3 Rollback playbook Create `playbooks/rollback.yml`: - `git checkout ` on the instance - `docker compose up -d --build` - Run health checks - Requires knowing the previous good commit (store in a fact file per host) **Deliverable:** Semi-automated upgrade pipeline with canary gates and rollback capability. --- ## Phase 9: Self-Service Instance Provisioning (Week 11-13) **Goal:** Enable clients to request and receive a new instance with minimal operator intervention. ### 9.1 Provisioning API Build a lightweight FastAPI or Express service on the central server: **Endpoints:** - `POST /api/instances` — Create a new instance (accepts domain, features, tier) - `GET /api/instances` — List all instances with status - `GET /api/instances/:id/status` — Health + metrics summary - `DELETE /api/instances/:id` — Decommission **Workflow:** 1. API receives request with domain, SSH host, feature flags 2. Runs `add-instance.sh` to scaffold host_vars 3. Triggers `ansible-playbook playbooks/deploy.yml --limit ` 4. Monitors deployment progress 5. Returns status when deployment completes ### 9.2 Fleet admin dashboard A simple web UI (could be a dedicated page in the central Grafana or a standalone React app): - Instance list with health status - One-click upgrade, backup, configure - New instance wizard - Grafana iframe embeds for metrics ### 9.3 DNS automation If using Pangolin for all instances: - Pangolin handles DNS + TLS automatically - The provisioning API creates Pangolin resources as part of deploy If using Cloudflare or other DNS: - Add a `roles/dns/` role with Cloudflare API integration - Automatically create A/CNAME records for all subdomains **Deliverable:** Operator can provision a new instance with a single API call or form submission. --- ## Phase 10: Multi-Tenant Hardening (Week 13-16) **Goal:** Security and isolation for a fleet of independent client instances. ### 10.1 Network isolation Each instance runs on its own server — already isolated at the OS level. Additional hardening: - UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics) - No inter-instance SSH access - Central server can SSH to instances, not vice versa ### 10.2 Secret rotation schedule Automate periodic secret rotation: | Secret | Rotation frequency | Method | |--------|-------------------|--------| | JWT access secret | Quarterly | vault edit + configure playbook | | Database passwords | Annually | vault edit + full redeploy | | Redis password | Annually | vault edit + configure playbook | | Pangolin tokens | On-demand | Re-run Pangolin setup | | Remote write token | Annually | Update central + all instances | Create a `playbooks/rotate-secrets.yml` that generates new secrets and applies them. ### 10.3 Audit logging - Ansible logs all operations to a central log file - Each playbook run produces a summary (host, timestamp, changes made) - Integrate with Git: all inventory changes are committed to a private repo ### 10.4 Compliance documentation For each instance, Ansible can generate: - Inventory of services and versions - Security posture report (UFW rules, fail2ban status, TLS cert expiry) - Backup compliance (last backup date, retention policy) - Data residency confirmation (server location, no PII in metrics) **Deliverable:** Hardened fleet with automated rotation, audit trail, and compliance artifacts. --- ## Timeline Summary | Phase | Duration | Milestone | |-------|----------|-----------| | 0: Foundation | ✅ Done | Ansible skeleton + repo changes | | 1: First instance | Week 1-2 | End-to-end deploy validated | | 2: Pangolin integration | Week 2-3 | Single-command public deployment | | 3: Import existing | Week 3-4 | All instances under management | | 4: Central server | Week 4-6 | VictoriaMetrics + Grafana running | | 5: Fleet dashboards | Week 6-7 | 3 operational dashboards | | 6: Tier 2 promotion | Week 7-8 | All instances reporting centrally | | 7: Alerting | Week 8-9 | Automated health + backup alerts | | 8: CI/Upgrade automation | Week 9-11 | Canary + rolling upgrades | | 9: Self-service | Week 11-13 | Provisioning API + admin UI | | 10: Multi-tenant hardening | Week 13-16 | Rotation, audit, compliance | **Total: ~16 weeks from foundation to fully hardened fleet.** Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity. --- ## FOSS Stack Summary Every component is Free and Open Source Software: | Component | License | Role in Stack | |-----------|---------|---------------| | Ansible | GPL-3.0 | Deployment automation & configuration management | | VictoriaMetrics | Apache-2.0 | Centralized time-series database (Prometheus-compatible) | | Grafana | AGPL-3.0 | Fleet dashboards & visualization | | Uptime Kuma | MIT | HTTP health monitoring | | Prometheus | Apache-2.0 | Per-instance metrics collection (existing) | | Alertmanager | Apache-2.0 | Alert routing & deduplication | | Docker + Compose | Apache-2.0 | Container orchestration | | Ubuntu | Various FOSS | Host operating system | | UFW / iptables | GPL | Firewall | | fail2ban | GPL-2.0 | Brute-force protection | | OpenSSL | Apache-2.0 | Secret generation | No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls. --- ## Risk Register | Risk | Impact | Mitigation | |------|--------|------------| | Vault password lost | Cannot decrypt any secrets | Store in password manager + offline backup | | Central server down | No fleet dashboards (instances unaffected) | `remote_write` WAL retries for ~2h; instances self-sufficient | | SSH key compromise | Attacker gains access to managed servers | Rotate keys, use separate deploy user, enable 2FA on SSH | | Ansible playbook bug | Bad config deployed to fleet | `serial: 1` for deploys, `--check --diff` before apply, canary phase | | Docker Hub rate limits | Image pulls fail during upgrade | Use a registry mirror or pre-pull images | | Prisma migration conflict | Database schema mismatch | Always run `migrate deploy` (applies pending only), never `migrate dev` in production | | Instance disk full | Backup fails, containers crash | `BackupStale` + `DiskSpaceLow` alerts, retention cleanup |