19 KiB
Bunker Ops — Staged Rollout Plan
Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.
Current State (Completed)
Phase 0: Foundation ✅
Repo changes (v2 branch):
INSTANCE_LABEL,BUNKER_OPS_ENABLED,BUNKER_OPS_REMOTE_WRITE_URLenv vars added- Prometheus metrics tagged with
instancelabel - Redis-exporter auth fixed (correct container name + password)
- Backup script pushes metrics when Bunker Ops is enabled
docker-compose.override.ymlin.gitignore
Ansible skeleton (bunker-ops/):
ansible.cfg— SSH pipelining, yaml callback, vault password path- Inventory structure with example host_vars and group defaults
- 3 roles:
common(OS/Docker/UFW),changemaker(full deploy),monitoring(Prometheus/remote_write) - 5 playbooks:
deploy,upgrade,backup,configure,monitoring - 2 scripts:
bootstrap-vault.sh(secret generation),add-instance.sh(instance scaffolding) env.j2template mapping all 100+.envvariables to Ansible vars
Phase 1: First Managed Instance (Week 1-2)
Goal: Validate the full Ansible pipeline end-to-end on a single real instance.
1.1 Prepare a test server
- Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
- Set up SSH key access for a
deployuser with passwordless sudo - Ensure ports 80, 443, SSH are reachable
1.2 Scaffold the instance
cd bunker-ops
echo "$(openssl rand -base64 32)" > .vault_pass
chmod 600 .vault_pass
./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1
1.3 Run the full deploy
ansible-playbook playbooks/deploy.yml --limit test-01
1.4 Validate
- All containers running (
docker compose ps) - API responds at
/api/health - Admin GUI loads and login works
- Prisma migrations applied cleanly
- Backup cron is installed (
crontab -l) - UFW is active with correct rules
- fail2ban is running
1.5 Test day-2 operations
configure.yml— change a feature flag, verify API restartsupgrade.yml— make a Git commit, run upgrade, verify new code is livebackup.yml— trigger backup, verify archive created- Secret rotation — change Redis password in vault, reconfigure, verify connectivity
1.6 Fix and iterate
Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:
- Docker image pull timing
- Prisma migration ordering
- Directory permission edge cases
- OS-specific package availability
Deliverable: One fully Ansible-managed instance running in production.
Phase 2: Pangolin Tunnel Integration (Week 2-3)
Goal: Automate the full Pangolin tunnel setup within Ansible.
2.1 Add Pangolin setup task
Create roles/changemaker/tasks/pangolin.yml:
- Call Pangolin API to create a site (if
cml_pangolin_api_urlis set) - Store returned
PANGOLIN_SITE_ID,PANGOLIN_NEWT_ID,PANGOLIN_NEWT_SECRETin vault - Sync resource definitions from
configs/pangolin/resources.yml - Set all resources to "Not Protected"
- Restart the Newt container
This replaces the manual Pangolin setup flow that currently lives in the admin GUI.
2.2 Validate tunnel works
- Instance accessible via
https://app.<domain>through Pangolin - API accessible via
https://api.<domain> - All 12 subdomains route correctly
- CORS headers present
2.3 Idempotency
Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.
Deliverable: Single-command deployment from bare server to publicly accessible instance.
Phase 3: Onboard Existing Instances (Week 3-4)
Goal: Migrate manually-installed instances to Ansible management.
3.1 Import strategy
For each existing instance that was set up with config.sh:
-
Scaffold host_vars:
./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1 -
Import existing secrets from the server's
.envinto the vault:# SSH in and extract current secrets: ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env" # Copy into vault.yml (replace generated values with existing ones) ansible-vault edit inventory/host_vars/<hostname>/vault.yml -
Test with
--check --difffirst:ansible-playbook playbooks/configure.yml --limit <hostname> --check --diffThis shows what
.envlines would change without actually changing anything. -
Apply configuration management:
ansible-playbook playbooks/configure.yml --limit <hostname>
3.2 Avoid disruption
- Do NOT re-run the
commonrole on production servers that are already set up. Use--tags env,deployto skip OS provisioning. - Do NOT re-run the seed on instances with existing data. The seed task has
failed_when: falsefor safety, but verify. - Backup first — always run
playbooks/backup.ymlbefore importing an existing instance.
3.3 Instance inventory target
| Instance | Domain | Status | Tier |
|---|---|---|---|
| test-01 | test.cmlite.org | Phase 1 deploy | 1 |
| edmonton-prod | betteredmonton.org | Import from config.sh | 1 |
| ... | ... | ... | ... |
Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.
Deliverable: All existing production instances under Ansible management (Tier 1).
Phase 4: Central Observability Server (Week 4-6)
Goal: Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.
4.1 Create roles/bunker-ops/
New role for the central server:
roles/bunker-ops/
├── tasks/main.yml
├── templates/
│ ├── docker-compose.yml.j2
│ └── nginx.conf.j2
├── defaults/main.yml
└── handlers/main.yml
Docker Compose stack:
| Service | Image | Purpose |
|---|---|---|
| VictoriaMetrics | victoriametrics/victoria-metrics |
Receives remote_write from instances, 12-month retention |
| Grafana | grafana/grafana |
Fleet dashboards, VM as datasource |
| Uptime Kuma | louislam/uptime-kuma |
HTTP health monitors per instance |
| Nginx | nginx:alpine |
TLS termination, auth on write endpoint |
Key configuration:
- VictoriaMetrics listens on
:8428for writes,:8428/selectfor queries - Nginx authenticates
remote_writerequests with Bearer token - Grafana auto-provisioned with VictoriaMetrics as default datasource
- Uptime Kuma monitors
https://api.<domain>/api/healthfor each instance
4.2 Create playbooks/central.yml
- name: Deploy Bunker Ops Central
hosts: bunker_ops_central
become: true
roles:
- common
- bunker-ops
4.3 Authentication for remote_write
- Generate a shared write token:
openssl rand -hex 32 - Store in central server's Nginx config (validates incoming
Authorization: Bearer <token>) - Distribute same token to all Tier 2 instances via
vault_bunker_ops_remote_write_token - This ensures only authorized instances can push metrics
4.4 Deploy and verify
ansible-playbook playbooks/central.yml
Verify:
- VictoriaMetrics accepts test write:
curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1' - Grafana accessible at
https://grafana.ops.bnkserve.org - Uptime Kuma accessible and monitoring test instance
Deliverable: Central server running VictoriaMetrics + Grafana + Uptime Kuma.
Phase 5: Fleet Dashboards (Week 6-7)
Goal: Build three Grafana dashboards for fleet-wide visibility.
5.1 Fleet Overview Dashboard
File: files/grafana/fleet-overview.json
Panels:
- Stat row: Total instances up/down —
count(up{job="changemaker-v2-api"} == 1) - Instance table: All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
- Time series — Canvass visits:
sum(rate(cm_canvass_visits_total[5m])) by (instance) - Time series — Emails sent:
sum(rate(cm_emails_sent_total[5m])) by (instance) - Time series — HTTP request rate:
sum(rate(http_requests_total[5m])) by (instance) - Gauge — Fleet email queue:
sum(cm_email_queue_size) by (instance)
Variables:
$instance— Multi-select, populated fromlabel_values(up{job="changemaker-v2-api"}, instance)
5.2 Instance Drill-Down Dashboard
File: files/grafana/instance-drilldown.json
Variables:
$instance— Single-select
Panel groups:
- Health: API uptime, HTTP error rate, p50/p95/p99 latency
- Influence: Emails sent/failed, queue depth, response submissions
- Canvass: Active sessions, visits by outcome, shift signups
- Geocoding: Cache hit rate, request rate by provider, duration
- System: CPU usage, memory, disk I/O, network (from
node_*metrics)
This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.
5.3 Backup Status Dashboard
File: files/grafana/backup-status.json
Panels:
- Gauge — Time since last backup:
time() - cm_backup_last_success_timestampper instance. Green < 24h, yellow < 48h, red > 48h. - Table — Backup sizes:
cm_backup_size_bytesper instance with sparkline trend - Alert rule — BackupStale: Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)
5.4 Auto-provisioning
Grafana dashboards auto-provisioned from JSON files via a dashboards.yml provisioner config, same pattern as the existing per-instance Grafana setup.
Deliverable: Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.
Phase 6: Promote Instances to Tier 2 (Week 7-8)
Goal: Enable fleet observability on all managed instances.
6.1 For each instance
-
Update
host_vars/<hostname>/main.yml:bunker_ops_enabled: true bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write" -
Add write token to
host_vars/<hostname>/vault.yml:vault_bunker_ops_remote_write_token: "<shared-token>" -
Apply:
ansible-playbook playbooks/monitoring.yml --limit <hostname>
6.2 Verify data flow
- Check VictoriaMetrics for incoming data:
curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}' - Check Grafana fleet overview shows the new instance
- Verify backup metrics appear after next backup run
6.3 Bandwidth audit
Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.
Deliverable: All instances reporting to central dashboards.
Phase 7: Alerting & Notifications (Week 8-9)
Goal: Central alerting for fleet-wide issues.
7.1 Alert rules on central VictoriaMetrics
Create roles/bunker-ops/templates/alerts.yml.j2:
| Alert | Condition | Severity |
|---|---|---|
InstanceDown |
up{job="changemaker-v2-api"} == 0 for 5m |
critical |
HighErrorRate |
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1 |
warning |
EmailQueueBacklog |
cm_email_queue_size > 100 for 15m |
warning |
BackupStale |
time() - cm_backup_last_success_timestamp > 90000 (25h) |
critical |
DiskSpaceLow |
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1 |
critical |
HighMemoryUsage |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for 10m |
warning |
CanvassSessionAbandoned |
cm_active_canvass_sessions > 20 for 1h |
info |
7.2 Notification channels
Central Alertmanager routes alerts to:
- Gotify — Push notifications to admin phone
- Email — Summary digests to fleet admin email
- Webhook — Optional Rocket.Chat / Slack integration
7.3 Silence rules
- Suppress
InstanceDownduring planned maintenance windows - Group alerts by instance to avoid notification storms
Deliverable: Automated alerts for instance health, backups, and resource exhaustion.
Phase 8: Upgrade Automation & CI (Week 9-11)
Goal: Streamline the upgrade pipeline.
8.1 Gitea webhook → n8n → Ansible
When a new commit is pushed to the v2 branch on the central Gitea:
- Gitea fires a webhook to n8n
- n8n workflow triggers
ansible-playbook playbooks/upgrade.yml - Rolling upgrade proceeds (25% batches)
- Health checks gate each batch
- n8n sends a summary notification
8.2 Canary deployment
Add a canary group to inventory:
all:
children:
canary:
hosts:
test-01:
changemaker_instances:
hosts:
edmonton-prod:
calgary-prod:
...
New playbooks/canary-upgrade.yml:
- Upgrade canary instance first
- Wait 30 minutes
- Run health checks
- If healthy, proceed with
upgrade.ymlon remaining instances - If unhealthy, alert and stop
8.3 Rollback playbook
Create playbooks/rollback.yml:
git checkout <previous-tag>on the instancedocker compose up -d --build- Run health checks
- Requires knowing the previous good commit (store in a fact file per host)
Deliverable: Semi-automated upgrade pipeline with canary gates and rollback capability.
Phase 9: Self-Service Instance Provisioning (Week 11-13)
Goal: Enable clients to request and receive a new instance with minimal operator intervention.
9.1 Provisioning API
Build a lightweight FastAPI or Express service on the central server:
Endpoints:
POST /api/instances— Create a new instance (accepts domain, features, tier)GET /api/instances— List all instances with statusGET /api/instances/:id/status— Health + metrics summaryDELETE /api/instances/:id— Decommission
Workflow:
- API receives request with domain, SSH host, feature flags
- Runs
add-instance.shto scaffold host_vars - Triggers
ansible-playbook playbooks/deploy.yml --limit <hostname> - Monitors deployment progress
- Returns status when deployment completes
9.2 Fleet admin dashboard
A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):
- Instance list with health status
- One-click upgrade, backup, configure
- New instance wizard
- Grafana iframe embeds for metrics
9.3 DNS automation
If using Pangolin for all instances:
- Pangolin handles DNS + TLS automatically
- The provisioning API creates Pangolin resources as part of deploy
If using Cloudflare or other DNS:
- Add a
roles/dns/role with Cloudflare API integration - Automatically create A/CNAME records for all subdomains
Deliverable: Operator can provision a new instance with a single API call or form submission.
Phase 10: Multi-Tenant Hardening (Week 13-16)
Goal: Security and isolation for a fleet of independent client instances.
10.1 Network isolation
Each instance runs on its own server — already isolated at the OS level. Additional hardening:
- UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
- No inter-instance SSH access
- Central server can SSH to instances, not vice versa
10.2 Secret rotation schedule
Automate periodic secret rotation:
| Secret | Rotation frequency | Method |
|---|---|---|
| JWT access secret | Quarterly | vault edit + configure playbook |
| Database passwords | Annually | vault edit + full redeploy |
| Redis password | Annually | vault edit + configure playbook |
| Pangolin tokens | On-demand | Re-run Pangolin setup |
| Remote write token | Annually | Update central + all instances |
Create a playbooks/rotate-secrets.yml that generates new secrets and applies them.
10.3 Audit logging
- Ansible logs all operations to a central log file
- Each playbook run produces a summary (host, timestamp, changes made)
- Integrate with Git: all inventory changes are committed to a private repo
10.4 Compliance documentation
For each instance, Ansible can generate:
- Inventory of services and versions
- Security posture report (UFW rules, fail2ban status, TLS cert expiry)
- Backup compliance (last backup date, retention policy)
- Data residency confirmation (server location, no PII in metrics)
Deliverable: Hardened fleet with automated rotation, audit trail, and compliance artifacts.
Timeline Summary
| Phase | Duration | Milestone |
|---|---|---|
| 0: Foundation | ✅ Done | Ansible skeleton + repo changes |
| 1: First instance | Week 1-2 | End-to-end deploy validated |
| 2: Pangolin integration | Week 2-3 | Single-command public deployment |
| 3: Import existing | Week 3-4 | All instances under management |
| 4: Central server | Week 4-6 | VictoriaMetrics + Grafana running |
| 5: Fleet dashboards | Week 6-7 | 3 operational dashboards |
| 6: Tier 2 promotion | Week 7-8 | All instances reporting centrally |
| 7: Alerting | Week 8-9 | Automated health + backup alerts |
| 8: CI/Upgrade automation | Week 9-11 | Canary + rolling upgrades |
| 9: Self-service | Week 11-13 | Provisioning API + admin UI |
| 10: Multi-tenant hardening | Week 13-16 | Rotation, audit, compliance |
Total: ~16 weeks from foundation to fully hardened fleet.
Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.
FOSS Stack Summary
Every component is Free and Open Source Software:
| Component | License | Role in Stack |
|---|---|---|
| Ansible | GPL-3.0 | Deployment automation & configuration management |
| VictoriaMetrics | Apache-2.0 | Centralized time-series database (Prometheus-compatible) |
| Grafana | AGPL-3.0 | Fleet dashboards & visualization |
| Uptime Kuma | MIT | HTTP health monitoring |
| Prometheus | Apache-2.0 | Per-instance metrics collection (existing) |
| Alertmanager | Apache-2.0 | Alert routing & deduplication |
| Docker + Compose | Apache-2.0 | Container orchestration |
| Ubuntu | Various FOSS | Host operating system |
| UFW / iptables | GPL | Firewall |
| fail2ban | GPL-2.0 | Brute-force protection |
| OpenSSL | Apache-2.0 | Secret generation |
No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.
Risk Register
| Risk | Impact | Mitigation |
|---|---|---|
| Vault password lost | Cannot decrypt any secrets | Store in password manager + offline backup |
| Central server down | No fleet dashboards (instances unaffected) | remote_write WAL retries for ~2h; instances self-sufficient |
| SSH key compromise | Attacker gains access to managed servers | Rotate keys, use separate deploy user, enable 2FA on SSH |
| Ansible playbook bug | Bad config deployed to fleet | serial: 1 for deploys, --check --diff before apply, canary phase |
| Docker Hub rate limits | Image pulls fail during upgrade | Use a registry mirror or pre-pull images |
| Prisma migration conflict | Database schema mismatch | Always run migrate deploy (applies pending only), never migrate dev in production |
| Instance disk full | Backup fails, containers crash | BackupStale + DiskSpaceLow alerts, retention cleanup |