changemaker.lite/bunker-ops/ROLLOUT_PLAN.md

# Bunker Ops — Staged Rollout Plan

Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.

---

## Current State (Completed)

### Phase 0: Foundation ✅

**Repo changes (v2 branch):**
- `INSTANCE_LABEL`, `BUNKER_OPS_ENABLED`, `BUNKER_OPS_REMOTE_WRITE_URL` env vars added
- Prometheus metrics tagged with `instance` label
- Redis-exporter auth fixed (correct container name + password)
- Backup script pushes metrics when Bunker Ops is enabled
- `docker-compose.override.yml` in `.gitignore`

**Ansible skeleton (`bunker-ops/`):**
- `ansible.cfg` — SSH pipelining, yaml callback, vault password path
- Inventory structure with example host_vars and group defaults
- 3 roles: `common` (OS/Docker/UFW), `changemaker` (full deploy), `monitoring` (Prometheus/remote_write)
- 5 playbooks: `deploy`, `upgrade`, `backup`, `configure`, `monitoring`
- 2 scripts: `bootstrap-vault.sh` (secret generation), `add-instance.sh` (instance scaffolding)
- `env.j2` template mapping all 100+ `.env` variables to Ansible vars

---

## Phase 1: First Managed Instance (Week 1-2)

**Goal:** Validate the full Ansible pipeline end-to-end on a single real instance.

### 1.1 Prepare a test server

- Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
- Set up SSH key access for a `deploy` user with passwordless sudo
- Ensure ports 80, 443, SSH are reachable

### 1.2 Scaffold the instance

```bash
cd bunker-ops
echo "$(openssl rand -base64 32)" > .vault_pass
chmod 600 .vault_pass

./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1
```

### 1.3 Run the full deploy

```bash
ansible-playbook playbooks/deploy.yml --limit test-01
```

### 1.4 Validate

- [ ] All containers running (`docker compose ps`)
- [ ] API responds at `/api/health`
- [ ] Admin GUI loads and login works
- [ ] Prisma migrations applied cleanly
- [ ] Backup cron is installed (`crontab -l`)
- [ ] UFW is active with correct rules
- [ ] fail2ban is running

### 1.5 Test day-2 operations

- [ ] `configure.yml` — change a feature flag, verify API restarts
- [ ] `upgrade.yml` — make a Git commit, run upgrade, verify new code is live
- [ ] `backup.yml` — trigger backup, verify archive created
- [ ] Secret rotation — change Redis password in vault, reconfigure, verify connectivity

### 1.6 Fix and iterate

Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:
- Docker image pull timing
- Prisma migration ordering
- Directory permission edge cases
- OS-specific package availability

**Deliverable:** One fully Ansible-managed instance running in production.

---

## Phase 2: Pangolin Tunnel Integration (Week 2-3)

**Goal:** Automate the full Pangolin tunnel setup within Ansible.

### 2.1 Add Pangolin setup task

Create `roles/changemaker/tasks/pangolin.yml`:
- Call Pangolin API to create a site (if `cml_pangolin_api_url` is set)
- Store returned `PANGOLIN_SITE_ID`, `PANGOLIN_NEWT_ID`, `PANGOLIN_NEWT_SECRET` in vault
- Sync resource definitions from `configs/pangolin/resources.yml`
- Set all resources to "Not Protected"
- Restart the Newt container

This replaces the manual Pangolin setup flow that currently lives in the admin GUI.

### 2.2 Validate tunnel works

- [ ] Instance accessible via `https://app.<domain>` through Pangolin
- [ ] API accessible via `https://api.<domain>`
- [ ] All 12 subdomains route correctly
- [ ] CORS headers present

### 2.3 Idempotency

Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.

**Deliverable:** Single-command deployment from bare server to publicly accessible instance.

---

## Phase 3: Onboard Existing Instances (Week 3-4)

**Goal:** Migrate manually-installed instances to Ansible management.

### 3.1 Import strategy

For each existing instance that was set up with `config.sh`:

1. **Scaffold host_vars:**
   ```bash
   ./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1
   ```

2. **Import existing secrets** from the server's `.env` into the vault:
   ```bash
   # SSH in and extract current secrets:
   ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env"
   # Copy into vault.yml (replace generated values with existing ones)
   ansible-vault edit inventory/host_vars/<hostname>/vault.yml
   ```

3. **Test with `--check --diff`** first:
   ```bash
   ansible-playbook playbooks/configure.yml --limit <hostname> --check --diff
   ```
   This shows what `.env` lines would change without actually changing anything.

4. **Apply configuration management:**
   ```bash
   ansible-playbook playbooks/configure.yml --limit <hostname>
   ```

### 3.2 Avoid disruption

- **Do NOT re-run the `common` role** on production servers that are already set up. Use `--tags env,deploy` to skip OS provisioning.
- **Do NOT re-run the seed** on instances with existing data. The seed task has `failed_when: false` for safety, but verify.
- **Backup first** — always run `playbooks/backup.yml` before importing an existing instance.

### 3.3 Instance inventory target

| Instance | Domain | Status | Tier |
|----------|--------|--------|------|
| test-01 | test.cmlite.org | Phase 1 deploy | 1 |
| edmonton-prod | betteredmonton.org | Import from config.sh | 1 |
| ... | ... | ... | ... |

Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.

**Deliverable:** All existing production instances under Ansible management (Tier 1).

---

## Phase 4: Central Observability Server (Week 4-6)

**Goal:** Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.

### 4.1 Create `roles/bunker-ops/`

New role for the central server:

```
roles/bunker-ops/
├── tasks/main.yml
├── templates/
│   ├── docker-compose.yml.j2
│   └── nginx.conf.j2
├── defaults/main.yml
└── handlers/main.yml
```

**Docker Compose stack:**

| Service | Image | Purpose |
|---------|-------|---------|
| VictoriaMetrics | `victoriametrics/victoria-metrics` | Receives `remote_write` from instances, 12-month retention |
| Grafana | `grafana/grafana` | Fleet dashboards, VM as datasource |
| Uptime Kuma | `louislam/uptime-kuma` | HTTP health monitors per instance |
| Nginx | `nginx:alpine` | TLS termination, auth on write endpoint |

**Key configuration:**
- VictoriaMetrics listens on `:8428` for writes, `:8428/select` for queries
- Nginx authenticates `remote_write` requests with Bearer token
- Grafana auto-provisioned with VictoriaMetrics as default datasource
- Uptime Kuma monitors `https://api.<domain>/api/health` for each instance

### 4.2 Create `playbooks/central.yml`

```yaml
- name: Deploy Bunker Ops Central
  hosts: bunker_ops_central
  become: true
  roles:
    - common
    - bunker-ops
```

### 4.3 Authentication for remote_write

- Generate a shared write token: `openssl rand -hex 32`
- Store in central server's Nginx config (validates incoming `Authorization: Bearer <token>`)
- Distribute same token to all Tier 2 instances via `vault_bunker_ops_remote_write_token`
- This ensures only authorized instances can push metrics

### 4.4 Deploy and verify

```bash
ansible-playbook playbooks/central.yml
```

Verify:
- [ ] VictoriaMetrics accepts test write: `curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1'`
- [ ] Grafana accessible at `https://grafana.ops.bnkserve.org`
- [ ] Uptime Kuma accessible and monitoring test instance

**Deliverable:** Central server running VictoriaMetrics + Grafana + Uptime Kuma.

---

## Phase 5: Fleet Dashboards (Week 6-7)

**Goal:** Build three Grafana dashboards for fleet-wide visibility.

### 5.1 Fleet Overview Dashboard

File: `files/grafana/fleet-overview.json`

**Panels:**
- **Stat row:** Total instances up/down — `count(up{job="changemaker-v2-api"} == 1)`
- **Instance table:** All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
- **Time series — Canvass visits:** `sum(rate(cm_canvass_visits_total[5m])) by (instance)`
- **Time series — Emails sent:** `sum(rate(cm_emails_sent_total[5m])) by (instance)`
- **Time series — HTTP request rate:** `sum(rate(http_requests_total[5m])) by (instance)`
- **Gauge — Fleet email queue:** `sum(cm_email_queue_size) by (instance)`

**Variables:**
- `$instance` — Multi-select, populated from `label_values(up{job="changemaker-v2-api"}, instance)`

### 5.2 Instance Drill-Down Dashboard

File: `files/grafana/instance-drilldown.json`

**Variables:**
- `$instance` — Single-select

**Panel groups:**
- **Health:** API uptime, HTTP error rate, p50/p95/p99 latency
- **Influence:** Emails sent/failed, queue depth, response submissions
- **Canvass:** Active sessions, visits by outcome, shift signups
- **Geocoding:** Cache hit rate, request rate by provider, duration
- **System:** CPU usage, memory, disk I/O, network (from `node_*` metrics)

This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.

### 5.3 Backup Status Dashboard

File: `files/grafana/backup-status.json`

**Panels:**
- **Gauge — Time since last backup:** `time() - cm_backup_last_success_timestamp` per instance. Green < 24h, yellow < 48h, red > 48h.
- **Table — Backup sizes:** `cm_backup_size_bytes` per instance with sparkline trend
- **Alert rule — BackupStale:** Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)

### 5.4 Auto-provisioning

Grafana dashboards auto-provisioned from JSON files via a `dashboards.yml` provisioner config, same pattern as the existing per-instance Grafana setup.

**Deliverable:** Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.

---

## Phase 6: Promote Instances to Tier 2 (Week 7-8)

**Goal:** Enable fleet observability on all managed instances.

### 6.1 For each instance

1. Update `host_vars/<hostname>/main.yml`:
   ```yaml
   bunker_ops_enabled: true
   bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write"
   ```

2. Add write token to `host_vars/<hostname>/vault.yml`:
   ```yaml
   vault_bunker_ops_remote_write_token: "<shared-token>"
   ```

3. Apply:
   ```bash
   ansible-playbook playbooks/monitoring.yml --limit <hostname>
   ```

### 6.2 Verify data flow

- Check VictoriaMetrics for incoming data: `curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}'`
- Check Grafana fleet overview shows the new instance
- Verify backup metrics appear after next backup run

### 6.3 Bandwidth audit

Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.

**Deliverable:** All instances reporting to central dashboards.

---

## Phase 7: Alerting & Notifications (Week 8-9)

**Goal:** Central alerting for fleet-wide issues.

### 7.1 Alert rules on central VictoriaMetrics

Create `roles/bunker-ops/templates/alerts.yml.j2`:

| Alert | Condition | Severity |
|-------|-----------|----------|
| `InstanceDown` | `up{job="changemaker-v2-api"} == 0` for 5m | critical |
| `HighErrorRate` | `rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1` | warning |
| `EmailQueueBacklog` | `cm_email_queue_size > 100` for 15m | warning |
| `BackupStale` | `time() - cm_backup_last_success_timestamp > 90000` (25h) | critical |
| `DiskSpaceLow` | `node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1` | critical |
| `HighMemoryUsage` | `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1` for 10m | warning |
| `CanvassSessionAbandoned` | `cm_active_canvass_sessions > 20` for 1h | info |

### 7.2 Notification channels

Central Alertmanager routes alerts to:
- **Gotify** — Push notifications to admin phone
- **Email** — Summary digests to fleet admin email
- **Webhook** — Optional Rocket.Chat / Slack integration

### 7.3 Silence rules

- Suppress `InstanceDown` during planned maintenance windows
- Group alerts by instance to avoid notification storms

**Deliverable:** Automated alerts for instance health, backups, and resource exhaustion.

---

## Phase 8: Upgrade Automation & CI (Week 9-11)

**Goal:** Streamline the upgrade pipeline.

### 8.1 Gitea webhook → n8n → Ansible

When a new commit is pushed to the `v2` branch on the central Gitea:

1. **Gitea** fires a webhook to **n8n**
2. **n8n** workflow triggers `ansible-playbook playbooks/upgrade.yml`
3. Rolling upgrade proceeds (25% batches)
4. Health checks gate each batch
5. n8n sends a summary notification

### 8.2 Canary deployment

Add a `canary` group to inventory:

```yaml
all:
  children:
    canary:
      hosts:
        test-01:
    changemaker_instances:
      hosts:
        edmonton-prod:
        calgary-prod:
        ...
```

New `playbooks/canary-upgrade.yml`:
1. Upgrade canary instance first
2. Wait 30 minutes
3. Run health checks
4. If healthy, proceed with `upgrade.yml` on remaining instances
5. If unhealthy, alert and stop

### 8.3 Rollback playbook

Create `playbooks/rollback.yml`:
- `git checkout <previous-tag>` on the instance
- `docker compose up -d --build`
- Run health checks
- Requires knowing the previous good commit (store in a fact file per host)

**Deliverable:** Semi-automated upgrade pipeline with canary gates and rollback capability.

---

## Phase 9: Self-Service Instance Provisioning (Week 11-13)

**Goal:** Enable clients to request and receive a new instance with minimal operator intervention.

### 9.1 Provisioning API

Build a lightweight FastAPI or Express service on the central server:

**Endpoints:**
- `POST /api/instances` — Create a new instance (accepts domain, features, tier)
- `GET /api/instances` — List all instances with status
- `GET /api/instances/:id/status` — Health + metrics summary
- `DELETE /api/instances/:id` — Decommission

**Workflow:**
1. API receives request with domain, SSH host, feature flags
2. Runs `add-instance.sh` to scaffold host_vars
3. Triggers `ansible-playbook playbooks/deploy.yml --limit <hostname>`
4. Monitors deployment progress
5. Returns status when deployment completes

### 9.2 Fleet admin dashboard

A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):
- Instance list with health status
- One-click upgrade, backup, configure
- New instance wizard
- Grafana iframe embeds for metrics

### 9.3 DNS automation

If using Pangolin for all instances:
- Pangolin handles DNS + TLS automatically
- The provisioning API creates Pangolin resources as part of deploy

If using Cloudflare or other DNS:
- Add a `roles/dns/` role with Cloudflare API integration
- Automatically create A/CNAME records for all subdomains

**Deliverable:** Operator can provision a new instance with a single API call or form submission.

---

## Phase 10: Multi-Tenant Hardening (Week 13-16)

**Goal:** Security and isolation for a fleet of independent client instances.

### 10.1 Network isolation

Each instance runs on its own server — already isolated at the OS level. Additional hardening:
- UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
- No inter-instance SSH access
- Central server can SSH to instances, not vice versa

### 10.2 Secret rotation schedule

Automate periodic secret rotation:

| Secret | Rotation frequency | Method |
|--------|-------------------|--------|
| JWT access secret | Quarterly | vault edit + configure playbook |
| Database passwords | Annually | vault edit + full redeploy |
| Redis password | Annually | vault edit + configure playbook |
| Pangolin tokens | On-demand | Re-run Pangolin setup |
| Remote write token | Annually | Update central + all instances |

Create a `playbooks/rotate-secrets.yml` that generates new secrets and applies them.

### 10.3 Audit logging

- Ansible logs all operations to a central log file
- Each playbook run produces a summary (host, timestamp, changes made)
- Integrate with Git: all inventory changes are committed to a private repo

### 10.4 Compliance documentation

For each instance, Ansible can generate:
- Inventory of services and versions
- Security posture report (UFW rules, fail2ban status, TLS cert expiry)
- Backup compliance (last backup date, retention policy)
- Data residency confirmation (server location, no PII in metrics)

**Deliverable:** Hardened fleet with automated rotation, audit trail, and compliance artifacts.

---

## Timeline Summary

| Phase | Duration | Milestone |
|-------|----------|-----------|
| 0: Foundation | ✅ Done | Ansible skeleton + repo changes |
| 1: First instance | Week 1-2 | End-to-end deploy validated |
| 2: Pangolin integration | Week 2-3 | Single-command public deployment |
| 3: Import existing | Week 3-4 | All instances under management |
| 4: Central server | Week 4-6 | VictoriaMetrics + Grafana running |
| 5: Fleet dashboards | Week 6-7 | 3 operational dashboards |
| 6: Tier 2 promotion | Week 7-8 | All instances reporting centrally |
| 7: Alerting | Week 8-9 | Automated health + backup alerts |
| 8: CI/Upgrade automation | Week 9-11 | Canary + rolling upgrades |
| 9: Self-service | Week 11-13 | Provisioning API + admin UI |
| 10: Multi-tenant hardening | Week 13-16 | Rotation, audit, compliance |

**Total: ~16 weeks from foundation to fully hardened fleet.**

Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.

---

## FOSS Stack Summary

Every component is Free and Open Source Software:

| Component | License | Role in Stack |
|-----------|---------|---------------|
| Ansible | GPL-3.0 | Deployment automation & configuration management |
| VictoriaMetrics | Apache-2.0 | Centralized time-series database (Prometheus-compatible) |
| Grafana | AGPL-3.0 | Fleet dashboards & visualization |
| Uptime Kuma | MIT | HTTP health monitoring |
| Prometheus | Apache-2.0 | Per-instance metrics collection (existing) |
| Alertmanager | Apache-2.0 | Alert routing & deduplication |
| Docker + Compose | Apache-2.0 | Container orchestration |
| Ubuntu | Various FOSS | Host operating system |
| UFW / iptables | GPL | Firewall |
| fail2ban | GPL-2.0 | Brute-force protection |
| OpenSSL | Apache-2.0 | Secret generation |

No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.

---

## Risk Register

| Risk | Impact | Mitigation |
|------|--------|------------|
| Vault password lost | Cannot decrypt any secrets | Store in password manager + offline backup |
| Central server down | No fleet dashboards (instances unaffected) | `remote_write` WAL retries for ~2h; instances self-sufficient |
| SSH key compromise | Attacker gains access to managed servers | Rotate keys, use separate deploy user, enable 2FA on SSH |
| Ansible playbook bug | Bad config deployed to fleet | `serial: 1` for deploys, `--check --diff` before apply, canary phase |
| Docker Hub rate limits | Image pulls fail during upgrade | Use a registry mirror or pre-pull images |
| Prisma migration conflict | Database schema mismatch | Always run `migrate deploy` (applies pending only), never `migrate dev` in production |
| Instance disk full | Backup fails, containers crash | `BackupStale` + `DiskSpaceLow` alerts, retention cleanup |