544 lines
19 KiB
Markdown
544 lines
19 KiB
Markdown
# Bunker Ops — Staged Rollout Plan
|
|
|
|
Full plan for rolling out the fleet management and observability system across Changemaker Lite instances.
|
|
|
|
---
|
|
|
|
## Current State (Completed)
|
|
|
|
### Phase 0: Foundation ✅
|
|
|
|
**Repo changes (v2 branch):**
|
|
- `INSTANCE_LABEL`, `BUNKER_OPS_ENABLED`, `BUNKER_OPS_REMOTE_WRITE_URL` env vars added
|
|
- Prometheus metrics tagged with `instance` label
|
|
- Redis-exporter auth fixed (correct container name + password)
|
|
- Backup script pushes metrics when Bunker Ops is enabled
|
|
- `docker-compose.override.yml` in `.gitignore`
|
|
|
|
**Ansible skeleton (`bunker-ops/`):**
|
|
- `ansible.cfg` — SSH pipelining, yaml callback, vault password path
|
|
- Inventory structure with example host_vars and group defaults
|
|
- 3 roles: `common` (OS/Docker/UFW), `changemaker` (full deploy), `monitoring` (Prometheus/remote_write)
|
|
- 5 playbooks: `deploy`, `upgrade`, `backup`, `configure`, `monitoring`
|
|
- 2 scripts: `bootstrap-vault.sh` (secret generation), `add-instance.sh` (instance scaffolding)
|
|
- `env.j2` template mapping all 100+ `.env` variables to Ansible vars
|
|
|
|
---
|
|
|
|
## Phase 1: First Managed Instance (Week 1-2)
|
|
|
|
**Goal:** Validate the full Ansible pipeline end-to-end on a single real instance.
|
|
|
|
### 1.1 Prepare a test server
|
|
|
|
- Provision a fresh Ubuntu 24.04 VM (e.g., a low-cost VPS or local Proxmox VM)
|
|
- Set up SSH key access for a `deploy` user with passwordless sudo
|
|
- Ensure ports 80, 443, SSH are reachable
|
|
|
|
### 1.2 Scaffold the instance
|
|
|
|
```bash
|
|
cd bunker-ops
|
|
echo "$(openssl rand -base64 32)" > .vault_pass
|
|
chmod 600 .vault_pass
|
|
|
|
./scripts/add-instance.sh test-01 test.cmlite.org <server-ip> --tier 1
|
|
```
|
|
|
|
### 1.3 Run the full deploy
|
|
|
|
```bash
|
|
ansible-playbook playbooks/deploy.yml --limit test-01
|
|
```
|
|
|
|
### 1.4 Validate
|
|
|
|
- [ ] All containers running (`docker compose ps`)
|
|
- [ ] API responds at `/api/health`
|
|
- [ ] Admin GUI loads and login works
|
|
- [ ] Prisma migrations applied cleanly
|
|
- [ ] Backup cron is installed (`crontab -l`)
|
|
- [ ] UFW is active with correct rules
|
|
- [ ] fail2ban is running
|
|
|
|
### 1.5 Test day-2 operations
|
|
|
|
- [ ] `configure.yml` — change a feature flag, verify API restarts
|
|
- [ ] `upgrade.yml` — make a Git commit, run upgrade, verify new code is live
|
|
- [ ] `backup.yml` — trigger backup, verify archive created
|
|
- [ ] Secret rotation — change Redis password in vault, reconfigure, verify connectivity
|
|
|
|
### 1.6 Fix and iterate
|
|
|
|
Document anything that fails. Update roles, templates, and defaults. The Ansible skeleton is a starting framework — real deployments will surface edge cases in:
|
|
- Docker image pull timing
|
|
- Prisma migration ordering
|
|
- Directory permission edge cases
|
|
- OS-specific package availability
|
|
|
|
**Deliverable:** One fully Ansible-managed instance running in production.
|
|
|
|
---
|
|
|
|
## Phase 2: Pangolin Tunnel Integration (Week 2-3)
|
|
|
|
**Goal:** Automate the full Pangolin tunnel setup within Ansible.
|
|
|
|
### 2.1 Add Pangolin setup task
|
|
|
|
Create `roles/changemaker/tasks/pangolin.yml`:
|
|
- Call Pangolin API to create a site (if `cml_pangolin_api_url` is set)
|
|
- Store returned `PANGOLIN_SITE_ID`, `PANGOLIN_NEWT_ID`, `PANGOLIN_NEWT_SECRET` in vault
|
|
- Sync resource definitions from `configs/pangolin/resources.yml`
|
|
- Set all resources to "Not Protected"
|
|
- Restart the Newt container
|
|
|
|
This replaces the manual Pangolin setup flow that currently lives in the admin GUI.
|
|
|
|
### 2.2 Validate tunnel works
|
|
|
|
- [ ] Instance accessible via `https://app.<domain>` through Pangolin
|
|
- [ ] API accessible via `https://api.<domain>`
|
|
- [ ] All 12 subdomains route correctly
|
|
- [ ] CORS headers present
|
|
|
|
### 2.3 Idempotency
|
|
|
|
Ensure re-running the playbook doesn't duplicate Pangolin resources. The task should check for existing site/resources before creating new ones.
|
|
|
|
**Deliverable:** Single-command deployment from bare server to publicly accessible instance.
|
|
|
|
---
|
|
|
|
## Phase 3: Onboard Existing Instances (Week 3-4)
|
|
|
|
**Goal:** Migrate manually-installed instances to Ansible management.
|
|
|
|
### 3.1 Import strategy
|
|
|
|
For each existing instance that was set up with `config.sh`:
|
|
|
|
1. **Scaffold host_vars:**
|
|
```bash
|
|
./scripts/add-instance.sh <hostname> <domain> <ip> --tier 1
|
|
```
|
|
|
|
2. **Import existing secrets** from the server's `.env` into the vault:
|
|
```bash
|
|
# SSH in and extract current secrets:
|
|
ssh deploy@<ip> "grep -E '(PASSWORD|SECRET|KEY|TOKEN)' /opt/changemaker-lite/.env"
|
|
# Copy into vault.yml (replace generated values with existing ones)
|
|
ansible-vault edit inventory/host_vars/<hostname>/vault.yml
|
|
```
|
|
|
|
3. **Test with `--check --diff`** first:
|
|
```bash
|
|
ansible-playbook playbooks/configure.yml --limit <hostname> --check --diff
|
|
```
|
|
This shows what `.env` lines would change without actually changing anything.
|
|
|
|
4. **Apply configuration management:**
|
|
```bash
|
|
ansible-playbook playbooks/configure.yml --limit <hostname>
|
|
```
|
|
|
|
### 3.2 Avoid disruption
|
|
|
|
- **Do NOT re-run the `common` role** on production servers that are already set up. Use `--tags env,deploy` to skip OS provisioning.
|
|
- **Do NOT re-run the seed** on instances with existing data. The seed task has `failed_when: false` for safety, but verify.
|
|
- **Backup first** — always run `playbooks/backup.yml` before importing an existing instance.
|
|
|
|
### 3.3 Instance inventory target
|
|
|
|
| Instance | Domain | Status | Tier |
|
|
|----------|--------|--------|------|
|
|
| test-01 | test.cmlite.org | Phase 1 deploy | 1 |
|
|
| edmonton-prod | betteredmonton.org | Import from config.sh | 1 |
|
|
| ... | ... | ... | ... |
|
|
|
|
Populate this table as instances are onboarded. Aim for 3-5 instances managed by end of Phase 3.
|
|
|
|
**Deliverable:** All existing production instances under Ansible management (Tier 1).
|
|
|
|
---
|
|
|
|
## Phase 4: Central Observability Server (Week 4-6)
|
|
|
|
**Goal:** Deploy the Bunker Ops central server with VictoriaMetrics, Grafana, and Uptime Kuma.
|
|
|
|
### 4.1 Create `roles/bunker-ops/`
|
|
|
|
New role for the central server:
|
|
|
|
```
|
|
roles/bunker-ops/
|
|
├── tasks/main.yml
|
|
├── templates/
|
|
│ ├── docker-compose.yml.j2
|
|
│ └── nginx.conf.j2
|
|
├── defaults/main.yml
|
|
└── handlers/main.yml
|
|
```
|
|
|
|
**Docker Compose stack:**
|
|
|
|
| Service | Image | Purpose |
|
|
|---------|-------|---------|
|
|
| VictoriaMetrics | `victoriametrics/victoria-metrics` | Receives `remote_write` from instances, 12-month retention |
|
|
| Grafana | `grafana/grafana` | Fleet dashboards, VM as datasource |
|
|
| Uptime Kuma | `louislam/uptime-kuma` | HTTP health monitors per instance |
|
|
| Nginx | `nginx:alpine` | TLS termination, auth on write endpoint |
|
|
|
|
**Key configuration:**
|
|
- VictoriaMetrics listens on `:8428` for writes, `:8428/select` for queries
|
|
- Nginx authenticates `remote_write` requests with Bearer token
|
|
- Grafana auto-provisioned with VictoriaMetrics as default datasource
|
|
- Uptime Kuma monitors `https://api.<domain>/api/health` for each instance
|
|
|
|
### 4.2 Create `playbooks/central.yml`
|
|
|
|
```yaml
|
|
- name: Deploy Bunker Ops Central
|
|
hosts: bunker_ops_central
|
|
become: true
|
|
roles:
|
|
- common
|
|
- bunker-ops
|
|
```
|
|
|
|
### 4.3 Authentication for remote_write
|
|
|
|
- Generate a shared write token: `openssl rand -hex 32`
|
|
- Store in central server's Nginx config (validates incoming `Authorization: Bearer <token>`)
|
|
- Distribute same token to all Tier 2 instances via `vault_bunker_ops_remote_write_token`
|
|
- This ensures only authorized instances can push metrics
|
|
|
|
### 4.4 Deploy and verify
|
|
|
|
```bash
|
|
ansible-playbook playbooks/central.yml
|
|
```
|
|
|
|
Verify:
|
|
- [ ] VictoriaMetrics accepts test write: `curl -X POST 'https://ops.bnkserve.org/api/v1/write' -H 'Authorization: Bearer <token>' --data-binary 'test_metric{instance="test"} 1'`
|
|
- [ ] Grafana accessible at `https://grafana.ops.bnkserve.org`
|
|
- [ ] Uptime Kuma accessible and monitoring test instance
|
|
|
|
**Deliverable:** Central server running VictoriaMetrics + Grafana + Uptime Kuma.
|
|
|
|
---
|
|
|
|
## Phase 5: Fleet Dashboards (Week 6-7)
|
|
|
|
**Goal:** Build three Grafana dashboards for fleet-wide visibility.
|
|
|
|
### 5.1 Fleet Overview Dashboard
|
|
|
|
File: `files/grafana/fleet-overview.json`
|
|
|
|
**Panels:**
|
|
- **Stat row:** Total instances up/down — `count(up{job="changemaker-v2-api"} == 1)`
|
|
- **Instance table:** All instances with columns for status, p95 latency, email queue depth, active canvass sessions, last backup age
|
|
- **Time series — Canvass visits:** `sum(rate(cm_canvass_visits_total[5m])) by (instance)`
|
|
- **Time series — Emails sent:** `sum(rate(cm_emails_sent_total[5m])) by (instance)`
|
|
- **Time series — HTTP request rate:** `sum(rate(http_requests_total[5m])) by (instance)`
|
|
- **Gauge — Fleet email queue:** `sum(cm_email_queue_size) by (instance)`
|
|
|
|
**Variables:**
|
|
- `$instance` — Multi-select, populated from `label_values(up{job="changemaker-v2-api"}, instance)`
|
|
|
|
### 5.2 Instance Drill-Down Dashboard
|
|
|
|
File: `files/grafana/instance-drilldown.json`
|
|
|
|
**Variables:**
|
|
- `$instance` — Single-select
|
|
|
|
**Panel groups:**
|
|
- **Health:** API uptime, HTTP error rate, p50/p95/p99 latency
|
|
- **Influence:** Emails sent/failed, queue depth, response submissions
|
|
- **Canvass:** Active sessions, visits by outcome, shift signups
|
|
- **Geocoding:** Cache hit rate, request rate by provider, duration
|
|
- **System:** CPU usage, memory, disk I/O, network (from `node_*` metrics)
|
|
|
|
This mirrors the existing per-instance Grafana dashboards but sources data from VictoriaMetrics.
|
|
|
|
### 5.3 Backup Status Dashboard
|
|
|
|
File: `files/grafana/backup-status.json`
|
|
|
|
**Panels:**
|
|
- **Gauge — Time since last backup:** `time() - cm_backup_last_success_timestamp` per instance. Green < 24h, yellow < 48h, red > 48h.
|
|
- **Table — Backup sizes:** `cm_backup_size_bytes` per instance with sparkline trend
|
|
- **Alert rule — BackupStale:** Fires when any instance hasn't backed up in 25 hours (1h grace past daily cron)
|
|
|
|
### 5.4 Auto-provisioning
|
|
|
|
Grafana dashboards auto-provisioned from JSON files via a `dashboards.yml` provisioner config, same pattern as the existing per-instance Grafana setup.
|
|
|
|
**Deliverable:** Three operational Grafana dashboards showing fleet health, per-instance detail, and backup status.
|
|
|
|
---
|
|
|
|
## Phase 6: Promote Instances to Tier 2 (Week 7-8)
|
|
|
|
**Goal:** Enable fleet observability on all managed instances.
|
|
|
|
### 6.1 For each instance
|
|
|
|
1. Update `host_vars/<hostname>/main.yml`:
|
|
```yaml
|
|
bunker_ops_enabled: true
|
|
bunker_ops_remote_write_url: "https://ops.bnkserve.org/api/v1/write"
|
|
```
|
|
|
|
2. Add write token to `host_vars/<hostname>/vault.yml`:
|
|
```yaml
|
|
vault_bunker_ops_remote_write_token: "<shared-token>"
|
|
```
|
|
|
|
3. Apply:
|
|
```bash
|
|
ansible-playbook playbooks/monitoring.yml --limit <hostname>
|
|
```
|
|
|
|
### 6.2 Verify data flow
|
|
|
|
- Check VictoriaMetrics for incoming data: `curl 'https://ops.bnkserve.org/api/v1/query?query=up{instance="<domain>"}'`
|
|
- Check Grafana fleet overview shows the new instance
|
|
- Verify backup metrics appear after next backup run
|
|
|
|
### 6.3 Bandwidth audit
|
|
|
|
Each instance sends ~50 time series at 15s intervals ≈ 200 samples/minute ≈ 12KB/min ≈ 17MB/day. With 10 instances: ~170MB/day. VictoriaMetrics compresses efficiently — expect ~2GB/month total storage for a 10-instance fleet.
|
|
|
|
**Deliverable:** All instances reporting to central dashboards.
|
|
|
|
---
|
|
|
|
## Phase 7: Alerting & Notifications (Week 8-9)
|
|
|
|
**Goal:** Central alerting for fleet-wide issues.
|
|
|
|
### 7.1 Alert rules on central VictoriaMetrics
|
|
|
|
Create `roles/bunker-ops/templates/alerts.yml.j2`:
|
|
|
|
| Alert | Condition | Severity |
|
|
|-------|-----------|----------|
|
|
| `InstanceDown` | `up{job="changemaker-v2-api"} == 0` for 5m | critical |
|
|
| `HighErrorRate` | `rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1` | warning |
|
|
| `EmailQueueBacklog` | `cm_email_queue_size > 100` for 15m | warning |
|
|
| `BackupStale` | `time() - cm_backup_last_success_timestamp > 90000` (25h) | critical |
|
|
| `DiskSpaceLow` | `node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1` | critical |
|
|
| `HighMemoryUsage` | `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1` for 10m | warning |
|
|
| `CanvassSessionAbandoned` | `cm_active_canvass_sessions > 20` for 1h | info |
|
|
|
|
### 7.2 Notification channels
|
|
|
|
Central Alertmanager routes alerts to:
|
|
- **Gotify** — Push notifications to admin phone
|
|
- **Email** — Summary digests to fleet admin email
|
|
- **Webhook** — Optional Rocket.Chat / Slack integration
|
|
|
|
### 7.3 Silence rules
|
|
|
|
- Suppress `InstanceDown` during planned maintenance windows
|
|
- Group alerts by instance to avoid notification storms
|
|
|
|
**Deliverable:** Automated alerts for instance health, backups, and resource exhaustion.
|
|
|
|
---
|
|
|
|
## Phase 8: Upgrade Automation & CI (Week 9-11)
|
|
|
|
**Goal:** Streamline the upgrade pipeline.
|
|
|
|
### 8.1 Gitea webhook → n8n → Ansible
|
|
|
|
When a new commit is pushed to the `v2` branch on the central Gitea:
|
|
|
|
1. **Gitea** fires a webhook to **n8n**
|
|
2. **n8n** workflow triggers `ansible-playbook playbooks/upgrade.yml`
|
|
3. Rolling upgrade proceeds (25% batches)
|
|
4. Health checks gate each batch
|
|
5. n8n sends a summary notification
|
|
|
|
### 8.2 Canary deployment
|
|
|
|
Add a `canary` group to inventory:
|
|
|
|
```yaml
|
|
all:
|
|
children:
|
|
canary:
|
|
hosts:
|
|
test-01:
|
|
changemaker_instances:
|
|
hosts:
|
|
edmonton-prod:
|
|
calgary-prod:
|
|
...
|
|
```
|
|
|
|
New `playbooks/canary-upgrade.yml`:
|
|
1. Upgrade canary instance first
|
|
2. Wait 30 minutes
|
|
3. Run health checks
|
|
4. If healthy, proceed with `upgrade.yml` on remaining instances
|
|
5. If unhealthy, alert and stop
|
|
|
|
### 8.3 Rollback playbook
|
|
|
|
Create `playbooks/rollback.yml`:
|
|
- `git checkout <previous-tag>` on the instance
|
|
- `docker compose up -d --build`
|
|
- Run health checks
|
|
- Requires knowing the previous good commit (store in a fact file per host)
|
|
|
|
**Deliverable:** Semi-automated upgrade pipeline with canary gates and rollback capability.
|
|
|
|
---
|
|
|
|
## Phase 9: Self-Service Instance Provisioning (Week 11-13)
|
|
|
|
**Goal:** Enable clients to request and receive a new instance with minimal operator intervention.
|
|
|
|
### 9.1 Provisioning API
|
|
|
|
Build a lightweight FastAPI or Express service on the central server:
|
|
|
|
**Endpoints:**
|
|
- `POST /api/instances` — Create a new instance (accepts domain, features, tier)
|
|
- `GET /api/instances` — List all instances with status
|
|
- `GET /api/instances/:id/status` — Health + metrics summary
|
|
- `DELETE /api/instances/:id` — Decommission
|
|
|
|
**Workflow:**
|
|
1. API receives request with domain, SSH host, feature flags
|
|
2. Runs `add-instance.sh` to scaffold host_vars
|
|
3. Triggers `ansible-playbook playbooks/deploy.yml --limit <hostname>`
|
|
4. Monitors deployment progress
|
|
5. Returns status when deployment completes
|
|
|
|
### 9.2 Fleet admin dashboard
|
|
|
|
A simple web UI (could be a dedicated page in the central Grafana or a standalone React app):
|
|
- Instance list with health status
|
|
- One-click upgrade, backup, configure
|
|
- New instance wizard
|
|
- Grafana iframe embeds for metrics
|
|
|
|
### 9.3 DNS automation
|
|
|
|
If using Pangolin for all instances:
|
|
- Pangolin handles DNS + TLS automatically
|
|
- The provisioning API creates Pangolin resources as part of deploy
|
|
|
|
If using Cloudflare or other DNS:
|
|
- Add a `roles/dns/` role with Cloudflare API integration
|
|
- Automatically create A/CNAME records for all subdomains
|
|
|
|
**Deliverable:** Operator can provision a new instance with a single API call or form submission.
|
|
|
|
---
|
|
|
|
## Phase 10: Multi-Tenant Hardening (Week 13-16)
|
|
|
|
**Goal:** Security and isolation for a fleet of independent client instances.
|
|
|
|
### 10.1 Network isolation
|
|
|
|
Each instance runs on its own server — already isolated at the OS level. Additional hardening:
|
|
- UFW rules restrict outbound to essential services only (Docker Hub, Git, SMTP, Pangolin, VictoriaMetrics)
|
|
- No inter-instance SSH access
|
|
- Central server can SSH to instances, not vice versa
|
|
|
|
### 10.2 Secret rotation schedule
|
|
|
|
Automate periodic secret rotation:
|
|
|
|
| Secret | Rotation frequency | Method |
|
|
|--------|-------------------|--------|
|
|
| JWT access secret | Quarterly | vault edit + configure playbook |
|
|
| Database passwords | Annually | vault edit + full redeploy |
|
|
| Redis password | Annually | vault edit + configure playbook |
|
|
| Pangolin tokens | On-demand | Re-run Pangolin setup |
|
|
| Remote write token | Annually | Update central + all instances |
|
|
|
|
Create a `playbooks/rotate-secrets.yml` that generates new secrets and applies them.
|
|
|
|
### 10.3 Audit logging
|
|
|
|
- Ansible logs all operations to a central log file
|
|
- Each playbook run produces a summary (host, timestamp, changes made)
|
|
- Integrate with Git: all inventory changes are committed to a private repo
|
|
|
|
### 10.4 Compliance documentation
|
|
|
|
For each instance, Ansible can generate:
|
|
- Inventory of services and versions
|
|
- Security posture report (UFW rules, fail2ban status, TLS cert expiry)
|
|
- Backup compliance (last backup date, retention policy)
|
|
- Data residency confirmation (server location, no PII in metrics)
|
|
|
|
**Deliverable:** Hardened fleet with automated rotation, audit trail, and compliance artifacts.
|
|
|
|
---
|
|
|
|
## Timeline Summary
|
|
|
|
| Phase | Duration | Milestone |
|
|
|-------|----------|-----------|
|
|
| 0: Foundation | ✅ Done | Ansible skeleton + repo changes |
|
|
| 1: First instance | Week 1-2 | End-to-end deploy validated |
|
|
| 2: Pangolin integration | Week 2-3 | Single-command public deployment |
|
|
| 3: Import existing | Week 3-4 | All instances under management |
|
|
| 4: Central server | Week 4-6 | VictoriaMetrics + Grafana running |
|
|
| 5: Fleet dashboards | Week 6-7 | 3 operational dashboards |
|
|
| 6: Tier 2 promotion | Week 7-8 | All instances reporting centrally |
|
|
| 7: Alerting | Week 8-9 | Automated health + backup alerts |
|
|
| 8: CI/Upgrade automation | Week 9-11 | Canary + rolling upgrades |
|
|
| 9: Self-service | Week 11-13 | Provisioning API + admin UI |
|
|
| 10: Multi-tenant hardening | Week 13-16 | Rotation, audit, compliance |
|
|
|
|
**Total: ~16 weeks from foundation to fully hardened fleet.**
|
|
|
|
Phases 1-3 are the critical path — they validate the core pipeline and bring existing instances under management. Phases 4-7 add observability. Phases 8-10 are operational maturity.
|
|
|
|
---
|
|
|
|
## FOSS Stack Summary
|
|
|
|
Every component is Free and Open Source Software:
|
|
|
|
| Component | License | Role in Stack |
|
|
|-----------|---------|---------------|
|
|
| Ansible | GPL-3.0 | Deployment automation & configuration management |
|
|
| VictoriaMetrics | Apache-2.0 | Centralized time-series database (Prometheus-compatible) |
|
|
| Grafana | AGPL-3.0 | Fleet dashboards & visualization |
|
|
| Uptime Kuma | MIT | HTTP health monitoring |
|
|
| Prometheus | Apache-2.0 | Per-instance metrics collection (existing) |
|
|
| Alertmanager | Apache-2.0 | Alert routing & deduplication |
|
|
| Docker + Compose | Apache-2.0 | Container orchestration |
|
|
| Ubuntu | Various FOSS | Host operating system |
|
|
| UFW / iptables | GPL | Firewall |
|
|
| fail2ban | GPL-2.0 | Brute-force protection |
|
|
| OpenSSL | Apache-2.0 | Secret generation |
|
|
|
|
No proprietary SaaS dependencies. The entire fleet can run air-gapped after initial image pulls.
|
|
|
|
---
|
|
|
|
## Risk Register
|
|
|
|
| Risk | Impact | Mitigation |
|
|
|------|--------|------------|
|
|
| Vault password lost | Cannot decrypt any secrets | Store in password manager + offline backup |
|
|
| Central server down | No fleet dashboards (instances unaffected) | `remote_write` WAL retries for ~2h; instances self-sufficient |
|
|
| SSH key compromise | Attacker gains access to managed servers | Rotate keys, use separate deploy user, enable 2FA on SSH |
|
|
| Ansible playbook bug | Bad config deployed to fleet | `serial: 1` for deploys, `--check --diff` before apply, canary phase |
|
|
| Docker Hub rate limits | Image pulls fail during upgrade | Use a registry mirror or pre-pull images |
|
|
| Prisma migration conflict | Database schema mismatch | Always run `migrate deploy` (applies pending only), never `migrate dev` in production |
|
|
| Instance disk full | Backup fails, containers crash | `BackupStale` + `DiskSpaceLow` alerts, retention cleanup |
|