changemaker.lite/docs/SESSION_HANDOFF_2026-05-20.md
bunker-admin 731e70ee42 docs: session handoff for the upgrade-flow redesign work
Captures the full state of the 2026-05-20/21 working session for the
next agent or future-self: fleet status, what landed in v2.10.2,
remaining Phase B + C work from the plan, surgical-update procedures
for the 6 remaining tenants (proven on pia 2026-05-21), bug inventory,
and "don't repeat my mistakes" notes.

Plan reference: /home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md

Force-added because docs/ is gitignored but the handoff needs to be
discoverable in-repo (same pattern as COMPETITIVE_ANALYSIS.md).

Bunker Admin
2026-05-21 13:42:08 -06:00

267 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Handoff: Upgrade Flow Redesign (2026-05-20 → 2026-05-21)
> Carries forward all context from a long working session into the next conversation. If you're a fresh agent: read this top-to-bottom before touching anything.
---
## Quick state of the fleet
| Tenant | Type | Version | Agent patched | Surgical script update | Notes |
|---|---|---|---|---|---|
| bnkops (n4) | source | main @ 1b80e82 | ✅ | ⏳ pending | Management node; CCP backend runs here in parallel |
| marcelle (n5, cursedknowledge.org) | release | v2.9.15 | ✅ | ⏳ pending | Test bench; first end-to-end CCP upgrade test ran here (succeeded after manual Phase 6 recovery) |
| trbh (n6) | source | main @ 1b80e82 | ✅ | ⏳ pending | mkdocs content RESTORED from `stash@{0}` — site serves "That Really Blonde Human" correctly |
| pia (n3, pia-bnkops) | release | v2.9.10 | ✅ | ✅ **completed 2026-05-21** | First successful surgical update — proof the procedure works |
| pridecorner (n1) | source | main @ 1b80e82 | ✅ | ⏳ pending | Has 3 March 9 upgrade-* stashes still on disk (audit done; recovery deferred to another agent) |
| soroush (n7) | source | main @ 1b80e82 | ✅ | ⏳ pending | Was earliest-fixed tonight |
| linda (n2, lindalindsay.org) | release-converted | v2.9.14 | ✅ | ⏳ pending | Was source-install with broken `.git`; converted to release mode (VERSION file written) |
**Public sites verified working at session end**: trbh.org, docs.trbh.org, bnkops.com, pridecorner.ca, soroushsamavat.org, publicinterestalberta.org, lindalindsay.org, cursedknowledge.org.
**Known caveat**: docs.bnkops.com returns HTTP 000 externally (Pangolin tunnel routing issue, pre-existing, NOT caused by this session). bnkops mkdocs container serves correct content locally.
---
## What landed in source (committed + pushed to origin/main)
| Commit | Description |
|---|---|
| `1b80e82` | `fix(ccp-agent): whitelist /app/instance for git safe.directory` — ccp-agent Dockerfile |
| `e88ac79` | `fix(ccp-agent): export COMPOSE_PROJECT_NAME so upgrade.sh sees correct project` — docker-compose.yml + .prod.yml |
| `9613c3e` | `fix(upgrade): Phase 1 of upgrade-flow redesign (Approach A)` — upgrade.sh + scripts/lib/mkdocs-snapshot.sh + scripts/upgrade-stash-cleanup.sh + .gitignore |
| `a7d3dd7` | `chore(release): ship scripts/lib/ + classify upgrade-stash-cleanup.sh` — build-release.sh |
**Release**: v2.10.2 tagged on `a7d3dd7`, uploaded to Gitea Releases as the new "latest" (`/releases/latest` returns v2.10.2 — the timestamp issue from earlier in session is fixed via build-release.sh's `target_commitish` workaround).
**Earlier in session**: tonight also produced commit `a531f9b` (ccp-agent missing bash/curl/jq/python3 + writable mount) and v2.10.1 release. v2.10.2 supersedes v2.10.1.
---
## The plan — Approach A (DONE) + B + C (pending)
Full design lives at `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md`.
### Approach A — ✅ Done
Three fixes to existing `scripts/upgrade.sh` shipping in v2.10.2:
1. **Phase 6 self-destruct fix** — Phase 6's broad `docker compose up -d` no longer recreates ccp-agent (which would SIGKILL the running script). Instead, ccp-agent restart is deferred to AFTER `write_result` writes the final `result.json`, via a detached `nohup ... & disown` subshell.
2. **mkdocs/ snapshot fallback**`scripts/lib/mkdocs-snapshot.sh` is sourced by upgrade.sh's Phase 2. Before any other backup or pull operation, it tarballs the entire `mkdocs/` directory into `mkdocs-backup-<timestamp>.tar.gz` in the install root. Retains last 5. Discoverable via `ls`. Restoration is one-liner:
```bash
tar xzf "$(ls -t mkdocs-backup-*.tar.gz | head -1)" -C . && \
docker compose restart mkdocs mkdocs-site-server
```
3. **`upgrade-stash-cleanup.sh`** — interactive utility to drop accumulated `upgrade-*` git stashes. Warns LOUDLY if any stash contains `mkdocs/mkdocs.yml` so operators verify recovery before dropping.
### Approach B — ⏳ Pending (1-2 days)
Add `--image-only` upgrade mode. Production images are hermetic (bake compiled code + Prisma migrations + entrypoint runs migrations on container start). Therefore `docker compose pull && docker compose up -d` IS a complete code+schema upgrade. **No filesystem mutation outside Docker** → tenant content implicitly safe.
New files to create:
- `scripts/image-upgrade.sh` (~150 lines; sources `scripts/lib/mkdocs-snapshot.sh` for the fallback)
- `changemaker-control-panel/agent/src/routes/upgrade.routes.ts` → new endpoint `POST /instance/:slug/upgrade/start-image-only`
- `changemaker-control-panel/api/src/services/upgrade.service.ts` → `startImageUpgrade(instanceId, userId, { imageTag })`
- `changemaker-control-panel/api/src/services/remote-driver.ts` → `startImageUpgrade()`
- `changemaker-control-panel/api/src/modules/instances/instances.routes.ts` → `POST /:id/upgrade-images`
- CCP admin UI: "Quick Upgrade (image-only)" button on `InstanceDetailPage.tsx`
### Approach C — ⏳ Pending (3-5 days)
CCP-driven template re-render for orchestration-changing upgrades. Reuses existing `template-engine.ts` and `reconfigureInstance` pattern. Only writes templated files (compose, nginx, configs/pangolin); never touches `mkdocs/` or `configs/code-server/data/`. See plan for details.
---
## How to apply v2.10.2 fixes to remaining tenants
**For PIA: already done** — used as the proof-of-concept on 2026-05-21. mkdocs.yml md5 unchanged, file count unchanged. ~5 minutes per tenant.
**For the other 6 tenants**, use the surgical update — DO NOT run a raw `git pull origin main` (it would resurrect tenant-deleted files via merge logic):
### Source installs (bnkops, trbh, pridecorner, soroush)
```bash
# bnkops, trbh, soroush use ~/changemaker.lite
# pridecorner uses ~/cmlite/changemaker.lite
cd ~/changemaker.lite # or ~/cmlite/changemaker.lite
git fetch origin main
mkdir -p scripts/lib
git checkout origin/main -- \
scripts/upgrade.sh \
scripts/upgrade-stash-cleanup.sh \
scripts/lib/mkdocs-snapshot.sh \
scripts/build-release.sh \
docker-compose.yml \
.gitignore
# Sanity: tenant content should still be ahead/divergent (not touched)
git status mkdocs/ configs/ # should show no NEW changes from this update
```
### Release installs (marcelle, linda) — used pia approach
```bash
# marcelle: ~/changemaker.lite, ssh bunker-admin@100.90.78.47
# linda: ~/changemaker.lite.canonical, ssh bunker-admin@n2-linda.taile33572.ts.net
cd ~/changemaker.lite # or ~/changemaker.lite.canonical
curl -fSL https://gitea.bnkops.com/admin/changemaker.lite/releases/download/v2.10.2/changemaker-lite-v2.10.2.tar.gz \
-o /tmp/v2.10.2.tar.gz
mkdir -p scripts/lib
tar -xzf /tmp/v2.10.2.tar.gz --strip-components=1 \
changemaker-lite/scripts/upgrade.sh \
changemaker-lite/scripts/upgrade-stash-cleanup.sh \
changemaker-lite/scripts/lib/mkdocs-snapshot.sh \
changemaker-lite/docker-compose.yml
chmod +x scripts/upgrade.sh scripts/upgrade-stash-cleanup.sh scripts/lib/mkdocs-snapshot.sh
rm -f /tmp/v2.10.2.tar.gz
# Do NOT update VERSION — only scripts changed, rest of install stays at current version.
```
### Verification per tenant
```bash
# Before update: capture
md5sum mkdocs/mkdocs.yml
find mkdocs/docs -type f | wc -l
# Run the appropriate surgical update above
# After update: re-verify (should match)
md5sum mkdocs/mkdocs.yml
find mkdocs/docs -type f | wc -l
# Confirm new upgrade.sh
grep -c 'deferred ccp-agent\|Deferred ccp-agent' scripts/upgrade.sh # expect 2
# Optional: smoke-test the snapshot helper
PROJECT_DIR=$(pwd) bash -c '. scripts/lib/mkdocs-snapshot.sh; snapshot_mkdocs'
ls -lh mkdocs-backup-*.tar.gz
```
---
## Bug inventory — what we know
### Fixed in v2.10.2
| Bug | Memory file | Status |
|---|---|---|
| Gitea release `created_unix=0` (lightweight tag + Gitea 1.23.x quirk) | `feedback_gitea_release_tag_timing.md` | Fixed in `build-release.sh` — uses `target_commitish` + removes remote tag first |
| ccp-agent image missing bash/curl/jq/python3 + git safe.directory | `feedback_ccp_agent_image_deps.md` | Fixed in agent Dockerfile + rolled out to all 7 tenants |
| ccp-agent compose mount was `:ro` (blocked status.json writes) | (in `feedback_ccp_agent_image_deps.md`) | Fixed in both compose files |
| CCP upgrade Phase 5 collision: `COMPOSE_PROJECT_NAME` mismatch | `feedback_upgrade_compose_project_name.md` | Fixed via env-var addition in compose env block (e88ac79) — also needs `.env` entry on tenants installed before v2.10.2 |
| upgrade.sh Phase 6 self-destruct | `feedback_upgrade_sh_bugs.md` | Fixed in v2.10.2 — deferred ccp-agent restart |
### Open
- **upgrade.sh `git stash → git pull` stash-no-pop** — Pride Corner has 3 stashes from March 9 holding mkdocs.yml customizations. Existing `save_user_paths`/`restore_user_paths` in upgrade.sh handles the common case; the snapshot fallback (v2.10.2) covers edge cases. Pridecorner-specific recovery handled by another agent.
- **Agent-side `detached: true` spawn** — Defense-in-depth. Skip unless Phase 6 self-destruct re-emerges.
---
## Tenant content protection layers (all in v2.10.2)
1. **`save_user_paths`/`restore_user_paths`** in upgrade.sh — preserves working-tree state of `mkdocs/docs/`, `mkdocs/mkdocs.yml`, `mkdocs/site/`, `configs/`, `nginx/conf.d/services.conf` across `git pull`.
2. **`git stash` + auto-resolve on USER_PATHS** — modified tracked files stash + pop with `git checkout --theirs` on USER_PATH conflicts.
3. **Pre-upgrade mkdocs snapshot** — tarball of `mkdocs/` to install root before any other phase runs. Fallback for everything else.
---
## Tonight's recovery work — already applied
These tenants had content damage from earlier in the session; recovery was completed:
- **trbh** — mkdocs.yml + 143 M files restored from `stash@{0}`; 538 D-entry files re-deleted. Public sites serve correct branding.
- **bnkops** — same pattern, 100 M files restored + 82 D-entry re-deletions. Public sites serve correct branding.
- **marcelle** — manual recovery from Phase 6 self-destruct test (file rollback + service restart). On v2.10.1 currently. Operating normally.
`stash@{0}` is preserved on trbh and bnkops as forensic record + safety net.
---
## CCP access
```
URL: http://n4-bnkops.taile33572.ts.net:5100 (UI)
http://n4-bnkops.taile33572.ts.net:5000 (API)
User: admin@thebunkerops.ca
Password: NRTgHdC7Zxxs2P2UmNwnEbn3jTwU8uJN (seed; rotate if you want)
Role: SUPER_ADMIN
```
---
## Test bench (marcelle)
```
SSH: ssh bunker-admin@100.90.78.47
Install dir: ~/changemaker.lite
Domain: cursedknowledge.org
Admin: admin@cursedknowledge.org / @TheBunker2025!
CCP slug: changemakerlite
CCP id: 71b5bc4a-c47e-4435-b460-e9bc303b76ed
```
Marcelle is the test bench per `docs/TEST_SERVER.md`. Use it for ALL upgrade experiments before touching production tenants.
---
## Per-tenant quick reference
| Tenant | SSH | Install dir | CCP id |
|---|---|---|---|
| bnkops | bunker-admin@n4-bnkops.taile33572.ts.net | ~/changemaker.lite | 21238536-7c04-4a3b-a073-38390a939046 |
| marcelle | bunker-admin@100.90.78.47 | ~/changemaker.lite | 71b5bc4a-c47e-4435-b460-e9bc303b76ed |
| trbh | bunker-admin@n6-trbh.taile33572.ts.net | ~/changemaker.lite | c066dc23-64a5-4684-96a7-992e65c1b82c |
| pia | pia-bnkops@n3-pia.taile33572.ts.net | ~/changemaker.lite | 92a11622-d357-4ab4-b21e-60c030c1b026 |
| pridecorner | bunker-admin@n1-pridecorner.taile33572.ts.net | ~/cmlite/changemaker.lite | a30de94b-ef28-42b6-a71d-112669526a62 |
| soroush | bunker-admin@n7-soroush.taile33572.ts.net | ~/changemaker.lite | 0c70f94c-1319-41e1-867c-5674f17cadda |
| linda | bunker-admin@n2-linda.taile33572.ts.net | ~/changemaker.lite.canonical | 6dcc19a1-f4fd-45df-be77-5bf62f8110c8 |
---
## Most important "don't repeat my mistakes" notes
1. **Never `git stash + git pull --ff-only origin main` on a tenant** outside of upgrade.sh. The stash silently displaces tenant content. If you must update files on a source-installed tenant, use targeted `git checkout origin/main -- <specific-file>` instead.
2. **Never blindly trigger CCP "Upgrade Now"** on a tenant still running pre-v2.10.2 upgrade.sh — it will Phase 6 self-destruct. Apply surgical script update first (instructions above), THEN trigger CCP upgrade.
3. **mkdocs/docs/ contains upstream tracked files** (default screenshots, demo docs, blog posts). Tenants typically delete these locally without committing. ANY operation that brings origin/main's tracked tree into the working tree (git pull, tarball extract) will resurrect them. v2.10.2's snapshot fallback gives you a recovery path; the surgical update procedure (this doc) avoids the issue entirely.
4. **mkdocs/mkdocs.yml is tracked, tenant-customized** with branding. Lives under USER_PATHS so v2.10.2's upgrade.sh protects it. But if you do raw git operations outside the script, it's exposed.
5. **CCP backend on n4 is decoupled from per-tenant ccp-agent**. Restarting a tenant's ccp-agent does NOT affect CCP itself. Verified during bnkops patch (CCP backend stayed at 41h uptime while ccp-agent recreated).
---
## Memory files (in `/home/bunker-admin/.claude/projects/-home-bunker-admin-changemaker-lite/memory/`)
Latest session work documented in:
- `feedback_gitea_release_tag_timing.md`
- `feedback_ccp_agent_image_deps.md`
- `feedback_upgrade_compose_project_name.md`
- `feedback_upgrade_sh_bugs.md`
- `feedback_session_2026_05_20_damage_report.md`
Plus the architectural plan: `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md`
---
## Where to start the next session
Recommended sequence:
1. **Apply surgical update to remaining 6 tenants** (~30-45 min, low risk; pia procedure already proven). Order: marcelle, linda (release), then soroush, trbh, bnkops, pridecorner (source).
2. **Test CCP-driven upgrade on marcelle** after surgical update lands. This will verify the deferred ccp-agent restart works end-to-end through the CCP path (the test we couldn't complete tonight because Phase 6 kept self-destructing).
3. **Implement Approach B** per the plan — image-only upgrade mode. Estimated 1-2 days.
4. **Implement Approach C** — CCP template re-render. 3-5 days.
If only one thing happens next session: **do step 1**. Six surgical updates × ~5 minutes each. The rest of the fleet stays vulnerable to Phase 6 self-destruct until they're on v2.10.2's upgrade.sh.