admin/changemaker.lite

Fork 0

bunker-admin 1a1f12c45b Tonne of updates

2026-02-18 17:15:31 -07:00

19 KiB

Raw Permalink Blame History

Bunker Ops — Staged Rollout Plan

Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.

Current State (Completed)

Phase 0: Foundation ✅

Repo changes (v2 branch):

INSTANCE_LABEL, BUNKER_OPS_ENABLED, BUNKER_OPS_REMOTE_WRITE_URL env vars added
Prometheus metrics tagged with instance label
Redis-exporter auth fixed (correct container name + password)
Backup script pushes metrics when Bunker Ops is enabled
docker-compose.override.yml in .gitignore

Ansible skeleton (bunker-ops/):

ansible.cfg — SSH pipelining, yaml callback, vault password path
Inventory structure with example host_vars and group defaults
3 roles: common (OS/Docker/UFW), changemaker (full deploy), monitoring (Prometheus/remote_write)
5 playbooks: deploy, upgrade, backup, configure, monitoring
2 scripts: bootstrap-vault.sh (secret generation), add-instance.sh (instance scaffolding)
env.j2 template mapping all 100+ .env variables to Ansible vars

Phase 1: First Managed Instance (Week 1-2)

Goal: Validate the full Ansible pipeline end-to-end on a single real instance.

1.1 Prepare a test server

Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
Set up SSH key access for a deploy user with passwordless sudo
Ensure ports 80, 443, SSH are reachable

1.2 Scaffold the instance

cd bunker-ops
echo "$(openssl rand -base64 32)" > .vault_pass
chmod 600 .vault_pass

./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1

1.3 Run the full deploy

ansible-playbook playbooks/deploy.yml --limit test-01

1.4 Validate

All containers running (docker compose ps)
API responds at /api/health
Admin GUI loads and login works
Prisma migrations applied cleanly
Backup cron is installed (crontab -l)
UFW is active with correct rules
fail2ban is running

1.5 Test day-2 operations

configure.yml — change a feature flag, verify API restarts
upgrade.yml — make a Git commit, run upgrade, verify new code is live
backup.yml — trigger backup, verify archive created
Secret rotation — change Redis password in vault, reconfigure, verify connectivity

1.6 Fix and iterate

Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:

Docker image pull timing
Prisma migration ordering
Directory permission edge cases
OS-specific package availability

Deliverable: One fully Ansible-managed instance running in production.

Phase 2: Pangolin Tunnel Integration (Week 2-3)

Goal: Automate the full Pangolin tunnel setup within Ansible.

2.1 Add Pangolin setup task

Create roles/changemaker/tasks/pangolin.yml:

Call Pangolin API to create a site (if cml_pangolin_api_url is set)
Store returned PANGOLIN_SITE_ID, PANGOLIN_NEWT_ID, PANGOLIN_NEWT_SECRET in vault
Sync resource definitions from configs/pangolin/resources.yml
Set all resources to "Not Protected"
Restart the Newt container

This replaces the manual Pangolin setup flow that currently lives in the admin GUI.

2.2 Validate tunnel works

Instance accessible via https://app.<domain> through Pangolin
API accessible via https://api.<domain>
All 12 subdomains route correctly
CORS headers present

2.3 Idempotency

Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.

Deliverable: Single-command deployment from bare server to publicly accessible instance.

Phase 3: Onboard Existing Instances (Week 3-4)

Goal: Migrate manually-installed instances to Ansible management.

3.1 Import strategy

For each existing instance that was set up with config.sh:

Scaffold host_vars:

./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1

Import existing secrets from the server's .env into the vault:

# SSH in and extract current secrets:
ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env"
# Copy into vault.yml (replace generated values with existing ones)
ansible-vault edit inventory/host_vars/<hostname>/vault.yml

Test with --check --diff first:
```
ansible-playbook playbooks/configure.yml --limit <hostname> --check --diff
```
This shows what .env lines would change without actually changing anything.

Apply configuration management:

ansible-playbook playbooks/configure.yml --limit <hostname>

3.2 Avoid disruption

Do NOT re-run the common role on production servers that are already set up. Use --tags env,deploy to skip OS provisioning.
Do NOT re-run the seed on instances with existing data. The seed task has failed_when: false for safety, but verify.
Backup first — always run playbooks/backup.yml before importing an existing instance.

3.3 Instance inventory target

Instance	Domain	Status	Tier
test-01	test.cmlite.org	Phase 1 deploy	1
edmonton-prod	betteredmonton.org	Import from config.sh	1
...	...	...	...

Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.

Deliverable: All existing production instances under Ansible management (Tier 1).

Phase 4: Central Observability Server (Week 4-6)

Goal: Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.

4.1 Create `roles/bunker-ops/`

New role for the central server:

roles/bunker-ops/
├── tasks/main.yml
├── templates/
│   ├── docker-compose.yml.j2
│   └── nginx.conf.j2
├── defaults/main.yml
└── handlers/main.yml

Docker Compose stack:

Service	Image	Purpose
VictoriaMetrics	`victoriametrics/victoria-metrics`	Receives `remote_write` from instances, 12-month retention
Grafana	`grafana/grafana`	Fleet dashboards, VM as datasource
Uptime Kuma	`louislam/uptime-kuma`	HTTP health monitors per instance
Nginx	`nginx:alpine`	TLS termination, auth on write endpoint

Key configuration:

VictoriaMetrics listens on :8428 for writes, :8428/select for queries
Nginx authenticates remote_write requests with Bearer token
Grafana auto-provisioned with VictoriaMetrics as default datasource
Uptime Kuma monitors https://api.<domain>/api/health for each instance

4.2 Create `playbooks/central.yml`

- name: Deploy Bunker Ops Central
  hosts: bunker_ops_central
  become: true
  roles:
    - common
    - bunker-ops

4.3 Authentication for remote_write

Generate a shared write token: openssl rand -hex 32
Store in central server's Nginx config (validates incoming Authorization: Bearer <token>)
Distribute same token to all Tier 2 instances via vault_bunker_ops_remote_write_token
This ensures only authorized instances can push metrics

4.4 Deploy and verify

ansible-playbook playbooks/central.yml

Verify:

VictoriaMetrics accepts test write: curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1'
Grafana accessible at https://grafana.ops.bnkserve.org
Uptime Kuma accessible and monitoring test instance

Deliverable: Central server running VictoriaMetrics + Grafana + Uptime Kuma.

Phase 5: Fleet Dashboards (Week 6-7)

Goal: Build three Grafana dashboards for fleet-wide visibility.

5.1 Fleet Overview Dashboard

File: files/grafana/fleet-overview.json

Panels:

Stat row: Total instances up/down — count(up{job="changemaker-v2-api"} == 1)
Instance table: All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
Time series — Canvass visits: sum(rate(cm_canvass_visits_total[5m])) by (instance)
Time series — Emails sent: sum(rate(cm_emails_sent_total[5m])) by (instance)
Time series — HTTP request rate: sum(rate(http_requests_total[5m])) by (instance)
Gauge — Fleet email queue: sum(cm_email_queue_size) by (instance)

Variables:

$instance — Multi-select, populated from label_values(up{job="changemaker-v2-api"}, instance)

5.2 Instance Drill-Down Dashboard

File: files/grafana/instance-drilldown.json

Variables:

$instance — Single-select

Panel groups:

Health: API uptime, HTTP error rate, p50/p95/p99 latency
Influence: Emails sent/failed, queue depth, response submissions
Canvass: Active sessions, visits by outcome, shift signups
Geocoding: Cache hit rate, request rate by provider, duration
System: CPU usage, memory, disk I/O, network (from node_* metrics)

This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.

5.3 Backup Status Dashboard

File: files/grafana/backup-status.json

Panels:

Gauge — Time since last backup: time() - cm_backup_last_success_timestamp per instance. Green < 24h, yellow < 48h, red > 48h.
Table — Backup sizes: cm_backup_size_bytes per instance with sparkline trend
Alert rule — BackupStale: Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)

5.4 Auto-provisioning

Grafana dashboards auto-provisioned from JSON files via a dashboards.yml provisioner config, same pattern as the existing per-instance Grafana setup.

Deliverable: Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.

Phase 6: Promote Instances to Tier 2 (Week 7-8)

Goal: Enable fleet observability on all managed instances.

6.1 For each instance

Update host_vars/<hostname>/main.yml:

bunker_ops_enabled: true
bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write"

Add write token to host_vars/<hostname>/vault.yml:

vault_bunker_ops_remote_write_token: "<shared-token>"

Apply:

ansible-playbook playbooks/monitoring.yml --limit <hostname>

6.2 Verify data flow

Check VictoriaMetrics for incoming data: curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}'
Check Grafana fleet overview shows the new instance
Verify backup metrics appear after next backup run

6.3 Bandwidth audit

Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.

Deliverable: All instances reporting to central dashboards.

Phase 7: Alerting & Notifications (Week 8-9)

Goal: Central alerting for fleet-wide issues.

7.1 Alert rules on central VictoriaMetrics

Create roles/bunker-ops/templates/alerts.yml.j2:

Alert	Condition	Severity
`InstanceDown`	`up{job="changemaker-v2-api"} == 0` for 5m	critical
`HighErrorRate`	`rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1`	warning
`EmailQueueBacklog`	`cm_email_queue_size > 100` for 15m	warning
`BackupStale`	`time() - cm_backup_last_success_timestamp > 90000` (25h)	critical
`DiskSpaceLow`	`node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1`	critical
`HighMemoryUsage`	`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1` for 10m	warning
`CanvassSessionAbandoned`	`cm_active_canvass_sessions > 20` for 1h	info

7.2 Notification channels

Central Alertmanager routes alerts to:

Gotify — Push notifications to admin phone
Email — Summary digests to fleet admin email
Webhook — Optional Rocket.Chat / Slack integration

7.3 Silence rules

Suppress InstanceDown during planned maintenance windows
Group alerts by instance to avoid notification storms

Deliverable: Automated alerts for instance health, backups, and resource exhaustion.

Phase 8: Upgrade Automation & CI (Week 9-11)

Goal: Streamline the upgrade pipeline.

8.1 Gitea webhook → n8n → Ansible

When a new commit is pushed to the v2 branch on the central Gitea:

Gitea fires a webhook to n8n
n8n workflow triggers ansible-playbook playbooks/upgrade.yml
Rolling upgrade proceeds (25% batches)
Health checks gate each batch
n8n sends a summary notification

8.2 Canary deployment

Add a canary group to inventory:

all:
  children:
    canary:
      hosts:
        test-01:
    changemaker_instances:
      hosts:
        edmonton-prod:
        calgary-prod:
        ...

New playbooks/canary-upgrade.yml:

Upgrade canary instance first
Wait 30 minutes
Run health checks
If healthy, proceed with upgrade.yml on remaining instances
If unhealthy, alert and stop

8.3 Rollback playbook

Create playbooks/rollback.yml:

git checkout <previous-tag> on the instance
docker compose up -d --build
Run health checks
Requires knowing the previous good commit (store in a fact file per host)

Deliverable: Semi-automated upgrade pipeline with canary gates and rollback capability.

Phase 9: Self-Service Instance Provisioning (Week 11-13)

Goal: Enable clients to request and receive a new instance with minimal operator intervention.

9.1 Provisioning API

Build a lightweight FastAPI or Express service on the central server:

Endpoints:

POST /api/instances — Create a new instance (accepts domain, features, tier)
GET /api/instances — List all instances with status
GET /api/instances/:id/status — Health + metrics summary
DELETE /api/instances/:id — Decommission

Workflow:

API receives request with domain, SSH host, feature flags
Runs add-instance.sh to scaffold host_vars
Triggers ansible-playbook playbooks/deploy.yml --limit <hostname>
Monitors deployment progress
Returns status when deployment completes

9.2 Fleet admin dashboard

A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):

Instance list with health status
One-click upgrade, backup, configure
New instance wizard
Grafana iframe embeds for metrics

9.3 DNS automation

If using Pangolin for all instances:

Pangolin handles DNS + TLS automatically
The provisioning API creates Pangolin resources as part of deploy

If using Cloudflare or other DNS:

Add a roles/dns/ role with Cloudflare API integration
Automatically create A/CNAME records for all subdomains

Deliverable: Operator can provision a new instance with a single API call or form submission.

Phase 10: Multi-Tenant Hardening (Week 13-16)

Goal: Security and isolation for a fleet of independent client instances.

10.1 Network isolation

Each instance runs on its own server — already isolated at the OS level. Additional hardening:

UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
No inter-instance SSH access
Central server can SSH to instances, not vice versa

10.2 Secret rotation schedule

Automate periodic secret rotation:

Secret	Rotation frequency	Method
JWT access secret	Quarterly	vault edit + configure playbook
Database passwords	Annually	vault edit + full redeploy
Redis password	Annually	vault edit + configure playbook
Pangolin tokens	On-demand	Re-run Pangolin setup
Remote write token	Annually	Update central + all instances

Create a playbooks/rotate-secrets.yml that generates new secrets and applies them.

10.3 Audit logging

Ansible logs all operations to a central log file
Each playbook run produces a summary (host, timestamp, changes made)
Integrate with Git: all inventory changes are committed to a private repo

10.4 Compliance documentation

For each instance, Ansible can generate:

Inventory of services and versions
Security posture report (UFW rules, fail2ban status, TLS cert expiry)
Backup compliance (last backup date, retention policy)
Data residency confirmation (server location, no PII in metrics)

Deliverable: Hardened fleet with automated rotation, audit trail, and compliance artifacts.

Timeline Summary

Phase	Duration	Milestone
0: Foundation	✅ Done	Ansible skeleton + repo changes
1: First instance	Week 1-2	End-to-end deploy validated
2: Pangolin integration	Week 2-3	Single-command public deployment
3: Import existing	Week 3-4	All instances under management
4: Central server	Week 4-6	VictoriaMetrics + Grafana running
5: Fleet dashboards	Week 6-7	3 operational dashboards
6: Tier 2 promotion	Week 7-8	All instances reporting centrally
7: Alerting	Week 8-9	Automated health + backup alerts
8: CI/Upgrade automation	Week 9-11	Canary + rolling upgrades
9: Self-service	Week 11-13	Provisioning API + admin UI
10: Multi-tenant hardening	Week 13-16	Rotation, audit, compliance

Total: ~16 weeks from foundation to fully hardened fleet.

Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.

FOSS Stack Summary

Every component is Free and Open Source Software:

Component	License	Role in Stack
Ansible	GPL-3.0	Deployment automation & configuration management
VictoriaMetrics	Apache-2.0	Centralized time-series database (Prometheus-compatible)
Grafana	AGPL-3.0	Fleet dashboards & visualization
Uptime Kuma	MIT	HTTP health monitoring
Prometheus	Apache-2.0	Per-instance metrics collection (existing)
Alertmanager	Apache-2.0	Alert routing & deduplication
Docker + Compose	Apache-2.0	Container orchestration
Ubuntu	Various FOSS	Host operating system
UFW / iptables	GPL	Firewall
fail2ban	GPL-2.0	Brute-force protection
OpenSSL	Apache-2.0	Secret generation

No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.

Risk Register

Risk	Impact	Mitigation
Vault password lost	Cannot decrypt any secrets	Store in password manager + offline backup
Central server down	No fleet dashboards (instances unaffected)	`remote_write` WAL retries for ~2h; instances self-sufficient
SSH key compromise	Attacker gains access to managed servers	Rotate keys, use separate deploy user, enable 2FA on SSH
Ansible playbook bug	Bad config deployed to fleet	`serial: 1` for deploys, `--check --diff` before apply, canary phase
Docker Hub rate limits	Image pulls fail during upgrade	Use a registry mirror or pre-pull images
Prisma migration conflict	Database schema mismatch	Always run `migrate deploy` (applies pending only), never `migrate dev` in production
Instance disk full	Backup fails, containers crash	`BackupStale` + `DiskSpaceLow` alerts, retention cleanup

19 KiB Raw Permalink Blame History