changemaker.lite/bunker-ops/ROLLOUT_PLAN.md
2026-02-18 17:15:31 -07:00

19 KiB

Bunker Ops — Staged Rollout Plan

Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.


Current State (Completed)

Phase 0: Foundation

Repo changes (v2 branch):

  • INSTANCE_LABEL, BUNKER_OPS_ENABLED, BUNKER_OPS_REMOTE_WRITE_URL env vars added
  • Prometheus metrics tagged with instance label
  • Redis-exporter auth fixed (correct container name + password)
  • Backup script pushes metrics when Bunker Ops is enabled
  • docker-compose.override.yml in .gitignore

Ansible skeleton (bunker-ops/):

  • ansible.cfg — SSH pipelining, yaml callback, vault password path
  • Inventory structure with example host_vars and group defaults
  • 3 roles: common (OS/Docker/UFW), changemaker (full deploy), monitoring (Prometheus/remote_write)
  • 5 playbooks: deploy, upgrade, backup, configure, monitoring
  • 2 scripts: bootstrap-vault.sh (secret generation), add-instance.sh (instance scaffolding)
  • env.j2 template mapping all 100+ .env variables to Ansible vars

Phase 1: First Managed Instance (Week 1-2)

Goal: Validate the full Ansible pipeline end-to-end on a single real instance.

1.1 Prepare a test server

  • Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
  • Set up SSH key access for a deploy user with passwordless sudo
  • Ensure ports 80, 443, SSH are reachable

1.2 Scaffold the instance

cd bunker-ops
echo "$(openssl rand -base64 32)" > .vault_pass
chmod 600 .vault_pass

./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1

1.3 Run the full deploy

ansible-playbook playbooks/deploy.yml --limit test-01

1.4 Validate

  • All containers running (docker compose ps)
  • API responds at /api/health
  • Admin GUI loads and login works
  • Prisma migrations applied cleanly
  • Backup cron is installed (crontab -l)
  • UFW is active with correct rules
  • fail2ban is running

1.5 Test day-2 operations

  • configure.yml — change a feature flag, verify API restarts
  • upgrade.yml — make a Git commit, run upgrade, verify new code is live
  • backup.yml — trigger backup, verify archive created
  • Secret rotation — change Redis password in vault, reconfigure, verify connectivity

1.6 Fix and iterate

Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:

  • Docker image pull timing
  • Prisma migration ordering
  • Directory permission edge cases
  • OS-specific package availability

Deliverable: One fully Ansible-managed instance running in production.


Phase 2: Pangolin Tunnel Integration (Week 2-3)

Goal: Automate the full Pangolin tunnel setup within Ansible.

2.1 Add Pangolin setup task

Create roles/changemaker/tasks/pangolin.yml:

  • Call Pangolin API to create a site (if cml_pangolin_api_url is set)
  • Store returned PANGOLIN_SITE_ID, PANGOLIN_NEWT_ID, PANGOLIN_NEWT_SECRET in vault
  • Sync resource definitions from configs/pangolin/resources.yml
  • Set all resources to "Not Protected"
  • Restart the Newt container

This replaces the manual Pangolin setup flow that currently lives in the admin GUI.

2.2 Validate tunnel works

  • Instance accessible via https://app.<domain> through Pangolin
  • API accessible via https://api.<domain>
  • All 12 subdomains route correctly
  • CORS headers present

2.3 Idempotency

Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.

Deliverable: Single-command deployment from bare server to publicly accessible instance.


Phase 3: Onboard Existing Instances (Week 3-4)

Goal: Migrate manually-installed instances to Ansible management.

3.1 Import strategy

For each existing instance that was set up with config.sh:

  1. Scaffold host_vars:

    ./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1
    
  2. Import existing secrets from the server's .env into the vault:

    # SSH in and extract current secrets:
    ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env"
    # Copy into vault.yml (replace generated values with existing ones)
    ansible-vault edit inventory/host_vars/<hostname>/vault.yml
    
  3. Test with --check --diff first:

    ansible-playbook playbooks/configure.yml --limit <hostname> --check --diff
    

    This shows what .env lines would change without actually changing anything.

  4. Apply configuration management:

    ansible-playbook playbooks/configure.yml --limit <hostname>
    

3.2 Avoid disruption

  • Do NOT re-run the common role on production servers that are already set up. Use --tags env,deploy to skip OS provisioning.
  • Do NOT re-run the seed on instances with existing data. The seed task has failed_when: false for safety, but verify.
  • Backup first — always run playbooks/backup.yml before importing an existing instance.

3.3 Instance inventory target

Instance Domain Status Tier
test-01 test.cmlite.org Phase 1 deploy 1
edmonton-prod betteredmonton.org Import from config.sh 1
... ... ... ...

Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.

Deliverable: All existing production instances under Ansible management (Tier 1).


Phase 4: Central Observability Server (Week 4-6)

Goal: Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.

4.1 Create roles/bunker-ops/

New role for the central server:

roles/bunker-ops/
├── tasks/main.yml
├── templates/
│   ├── docker-compose.yml.j2
│   └── nginx.conf.j2
├── defaults/main.yml
└── handlers/main.yml

Docker Compose stack:

Service Image Purpose
VictoriaMetrics victoriametrics/victoria-metrics Receives remote_write from instances, 12-month retention
Grafana grafana/grafana Fleet dashboards, VM as datasource
Uptime Kuma louislam/uptime-kuma HTTP health monitors per instance
Nginx nginx:alpine TLS termination, auth on write endpoint

Key configuration:

  • VictoriaMetrics listens on :8428 for writes, :8428/select for queries
  • Nginx authenticates remote_write requests with Bearer token
  • Grafana auto-provisioned with VictoriaMetrics as default datasource
  • Uptime Kuma monitors https://api.<domain>/api/health for each instance

4.2 Create playbooks/central.yml

- name: Deploy Bunker Ops Central
  hosts: bunker_ops_central
  become: true
  roles:
    - common
    - bunker-ops

4.3 Authentication for remote_write

  • Generate a shared write token: openssl rand -hex 32
  • Store in central server's Nginx config (validates incoming Authorization: Bearer <token>)
  • Distribute same token to all Tier 2 instances via vault_bunker_ops_remote_write_token
  • This ensures only authorized instances can push metrics

4.4 Deploy and verify

ansible-playbook playbooks/central.yml

Verify:

  • VictoriaMetrics accepts test write: curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1'
  • Grafana accessible at https://grafana.ops.bnkserve.org
  • Uptime Kuma accessible and monitoring test instance

Deliverable: Central server running VictoriaMetrics + Grafana + Uptime Kuma.


Phase 5: Fleet Dashboards (Week 6-7)

Goal: Build three Grafana dashboards for fleet-wide visibility.

5.1 Fleet Overview Dashboard

File: files/grafana/fleet-overview.json

Panels:

  • Stat row: Total instances up/down — count(up{job="changemaker-v2-api"} == 1)
  • Instance table: All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
  • Time series — Canvass visits: sum(rate(cm_canvass_visits_total[5m])) by (instance)
  • Time series — Emails sent: sum(rate(cm_emails_sent_total[5m])) by (instance)
  • Time series — HTTP request rate: sum(rate(http_requests_total[5m])) by (instance)
  • Gauge — Fleet email queue: sum(cm_email_queue_size) by (instance)

Variables:

  • $instance — Multi-select, populated from label_values(up{job="changemaker-v2-api"}, instance)

5.2 Instance Drill-Down Dashboard

File: files/grafana/instance-drilldown.json

Variables:

  • $instance — Single-select

Panel groups:

  • Health: API uptime, HTTP error rate, p50/p95/p99 latency
  • Influence: Emails sent/failed, queue depth, response submissions
  • Canvass: Active sessions, visits by outcome, shift signups
  • Geocoding: Cache hit rate, request rate by provider, duration
  • System: CPU usage, memory, disk I/O, network (from node_* metrics)

This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.

5.3 Backup Status Dashboard

File: files/grafana/backup-status.json

Panels:

  • Gauge — Time since last backup: time() - cm_backup_last_success_timestamp per instance. Green < 24h, yellow < 48h, red > 48h.
  • Table — Backup sizes: cm_backup_size_bytes per instance with sparkline trend
  • Alert rule — BackupStale: Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)

5.4 Auto-provisioning

Grafana dashboards auto-provisioned from JSON files via a dashboards.yml provisioner config, same pattern as the existing per-instance Grafana setup.

Deliverable: Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.


Phase 6: Promote Instances to Tier 2 (Week 7-8)

Goal: Enable fleet observability on all managed instances.

6.1 For each instance

  1. Update host_vars/<hostname>/main.yml:

    bunker_ops_enabled: true
    bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write"
    
  2. Add write token to host_vars/<hostname>/vault.yml:

    vault_bunker_ops_remote_write_token: "<shared-token>"
    
  3. Apply:

    ansible-playbook playbooks/monitoring.yml --limit <hostname>
    

6.2 Verify data flow

  • Check VictoriaMetrics for incoming data: curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}'
  • Check Grafana fleet overview shows the new instance
  • Verify backup metrics appear after next backup run

6.3 Bandwidth audit

Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.

Deliverable: All instances reporting to central dashboards.


Phase 7: Alerting & Notifications (Week 8-9)

Goal: Central alerting for fleet-wide issues.

7.1 Alert rules on central VictoriaMetrics

Create roles/bunker-ops/templates/alerts.yml.j2:

Alert Condition Severity
InstanceDown up{job="changemaker-v2-api"} == 0 for 5m critical
HighErrorRate rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1 warning
EmailQueueBacklog cm_email_queue_size > 100 for 15m warning
BackupStale time() - cm_backup_last_success_timestamp > 90000 (25h) critical
DiskSpaceLow node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1 critical
HighMemoryUsage node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 for 10m warning
CanvassSessionAbandoned cm_active_canvass_sessions > 20 for 1h info

7.2 Notification channels

Central Alertmanager routes alerts to:

  • Gotify — Push notifications to admin phone
  • Email — Summary digests to fleet admin email
  • Webhook — Optional Rocket.Chat / Slack integration

7.3 Silence rules

  • Suppress InstanceDown during planned maintenance windows
  • Group alerts by instance to avoid notification storms

Deliverable: Automated alerts for instance health, backups, and resource exhaustion.


Phase 8: Upgrade Automation & CI (Week 9-11)

Goal: Streamline the upgrade pipeline.

8.1 Gitea webhook → n8n → Ansible

When a new commit is pushed to the v2 branch on the central Gitea:

  1. Gitea fires a webhook to n8n
  2. n8n workflow triggers ansible-playbook playbooks/upgrade.yml
  3. Rolling upgrade proceeds (25% batches)
  4. Health checks gate each batch
  5. n8n sends a summary notification

8.2 Canary deployment

Add a canary group to inventory:

all:
  children:
    canary:
      hosts:
        test-01:
    changemaker_instances:
      hosts:
        edmonton-prod:
        calgary-prod:
        ...

New playbooks/canary-upgrade.yml:

  1. Upgrade canary instance first
  2. Wait 30 minutes
  3. Run health checks
  4. If healthy, proceed with upgrade.yml on remaining instances
  5. If unhealthy, alert and stop

8.3 Rollback playbook

Create playbooks/rollback.yml:

  • git checkout <previous-tag> on the instance
  • docker compose up -d --build
  • Run health checks
  • Requires knowing the previous good commit (store in a fact file per host)

Deliverable: Semi-automated upgrade pipeline with canary gates and rollback capability.


Phase 9: Self-Service Instance Provisioning (Week 11-13)

Goal: Enable clients to request and receive a new instance with minimal operator intervention.

9.1 Provisioning API

Build a lightweight FastAPI or Express service on the central server:

Endpoints:

  • POST /api/instances — Create a new instance (accepts domain, features, tier)
  • GET /api/instances — List all instances with status
  • GET /api/instances/:id/status — Health + metrics summary
  • DELETE /api/instances/:id — Decommission

Workflow:

  1. API receives request with domain, SSH host, feature flags
  2. Runs add-instance.sh to scaffold host_vars
  3. Triggers ansible-playbook playbooks/deploy.yml --limit <hostname>
  4. Monitors deployment progress
  5. Returns status when deployment completes

9.2 Fleet admin dashboard

A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):

  • Instance list with health status
  • One-click upgrade, backup, configure
  • New instance wizard
  • Grafana iframe embeds for metrics

9.3 DNS automation

If using Pangolin for all instances:

  • Pangolin handles DNS + TLS automatically
  • The provisioning API creates Pangolin resources as part of deploy

If using Cloudflare or other DNS:

  • Add a roles/dns/ role with Cloudflare API integration
  • Automatically create A/CNAME records for all subdomains

Deliverable: Operator can provision a new instance with a single API call or form submission.


Phase 10: Multi-Tenant Hardening (Week 13-16)

Goal: Security and isolation for a fleet of independent client instances.

10.1 Network isolation

Each instance runs on its own server — already isolated at the OS level. Additional hardening:

  • UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
  • No inter-instance SSH access
  • Central server can SSH to instances, not vice versa

10.2 Secret rotation schedule

Automate periodic secret rotation:

Secret Rotation frequency Method
JWT access secret Quarterly vault edit + configure playbook
Database passwords Annually vault edit + full redeploy
Redis password Annually vault edit + configure playbook
Pangolin tokens On-demand Re-run Pangolin setup
Remote write token Annually Update central + all instances

Create a playbooks/rotate-secrets.yml that generates new secrets and applies them.

10.3 Audit logging

  • Ansible logs all operations to a central log file
  • Each playbook run produces a summary (host, timestamp, changes made)
  • Integrate with Git: all inventory changes are committed to a private repo

10.4 Compliance documentation

For each instance, Ansible can generate:

  • Inventory of services and versions
  • Security posture report (UFW rules, fail2ban status, TLS cert expiry)
  • Backup compliance (last backup date, retention policy)
  • Data residency confirmation (server location, no PII in metrics)

Deliverable: Hardened fleet with automated rotation, audit trail, and compliance artifacts.


Timeline Summary

Phase Duration Milestone
0: Foundation Done Ansible skeleton + repo changes
1: First instance Week 1-2 End-to-end deploy validated
2: Pangolin integration Week 2-3 Single-command public deployment
3: Import existing Week 3-4 All instances under management
4: Central server Week 4-6 VictoriaMetrics + Grafana running
5: Fleet dashboards Week 6-7 3 operational dashboards
6: Tier 2 promotion Week 7-8 All instances reporting centrally
7: Alerting Week 8-9 Automated health + backup alerts
8: CI/Upgrade automation Week 9-11 Canary + rolling upgrades
9: Self-service Week 11-13 Provisioning API + admin UI
10: Multi-tenant hardening Week 13-16 Rotation, audit, compliance

Total: ~16 weeks from foundation to fully hardened fleet.

Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.


FOSS Stack Summary

Every component is Free and Open Source Software:

Component License Role in Stack
Ansible GPL-3.0 Deployment automation & configuration management
VictoriaMetrics Apache-2.0 Centralized time-series database (Prometheus-compatible)
Grafana AGPL-3.0 Fleet dashboards & visualization
Uptime Kuma MIT HTTP health monitoring
Prometheus Apache-2.0 Per-instance metrics collection (existing)
Alertmanager Apache-2.0 Alert routing & deduplication
Docker + Compose Apache-2.0 Container orchestration
Ubuntu Various FOSS Host operating system
UFW / iptables GPL Firewall
fail2ban GPL-2.0 Brute-force protection
OpenSSL Apache-2.0 Secret generation

No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.


Risk Register

Risk Impact Mitigation
Vault password lost Cannot decrypt any secrets Store in password manager + offline backup
Central server down No fleet dashboards (instances unaffected) remote_write WAL retries for ~2h; instances self-sufficient
SSH key compromise Attacker gains access to managed servers Rotate keys, use separate deploy user, enable 2FA on SSH
Ansible playbook bug Bad config deployed to fleet serial: 1 for deploys, --check --diff before apply, canary phase
Docker Hub rate limits Image pulls fail during upgrade Use a registry mirror or pre-pull images
Prisma migration conflict Database schema mismatch Always run migrate deploy (applies pending only), never migrate dev in production
Instance disk full Backup fails, containers crash BackupStale + DiskSpaceLow alerts, retention cleanup