diff --git a/docs/SESSION_HANDOFF_2026-05-20.md b/docs/SESSION_HANDOFF_2026-05-20.md new file mode 100644 index 0000000..102980c --- /dev/null +++ b/docs/SESSION_HANDOFF_2026-05-20.md @@ -0,0 +1,266 @@ +# Session Handoff: Upgrade Flow Redesign (2026-05-20 → 2026-05-21) + +> Carries forward all context from a long working session into the next conversation. If you're a fresh agent: read this top-to-bottom before touching anything. + +--- + +## Quick state of the fleet + +| Tenant | Type | Version | Agent patched | Surgical script update | Notes | +|---|---|---|---|---|---| +| bnkops (n4) | source | main @ 1b80e82 | ✅ | ⏳ pending | Management node; CCP backend runs here in parallel | +| marcelle (n5, cursedknowledge.org) | release | v2.9.15 | ✅ | ⏳ pending | Test bench; first end-to-end CCP upgrade test ran here (succeeded after manual Phase 6 recovery) | +| trbh (n6) | source | main @ 1b80e82 | ✅ | ⏳ pending | mkdocs content RESTORED from `stash@{0}` — site serves "That Really Blonde Human" correctly | +| pia (n3, pia-bnkops) | release | v2.9.10 | ✅ | ✅ **completed 2026-05-21** | First successful surgical update — proof the procedure works | +| pridecorner (n1) | source | main @ 1b80e82 | ✅ | ⏳ pending | Has 3 March 9 upgrade-* stashes still on disk (audit done; recovery deferred to another agent) | +| soroush (n7) | source | main @ 1b80e82 | ✅ | ⏳ pending | Was earliest-fixed tonight | +| linda (n2, lindalindsay.org) | release-converted | v2.9.14 | ✅ | ⏳ pending | Was source-install with broken `.git`; converted to release mode (VERSION file written) | + +**Public sites verified working at session end**: trbh.org, docs.trbh.org, bnkops.com, pridecorner.ca, soroushsamavat.org, publicinterestalberta.org, lindalindsay.org, cursedknowledge.org. + +**Known caveat**: docs.bnkops.com returns HTTP 000 externally (Pangolin tunnel routing issue, pre-existing, NOT caused by this session). bnkops mkdocs container serves correct content locally. + +--- + +## What landed in source (committed + pushed to origin/main) + +| Commit | Description | +|---|---| +| `1b80e82` | `fix(ccp-agent): whitelist /app/instance for git safe.directory` — ccp-agent Dockerfile | +| `e88ac79` | `fix(ccp-agent): export COMPOSE_PROJECT_NAME so upgrade.sh sees correct project` — docker-compose.yml + .prod.yml | +| `9613c3e` | `fix(upgrade): Phase 1 of upgrade-flow redesign (Approach A)` — upgrade.sh + scripts/lib/mkdocs-snapshot.sh + scripts/upgrade-stash-cleanup.sh + .gitignore | +| `a7d3dd7` | `chore(release): ship scripts/lib/ + classify upgrade-stash-cleanup.sh` — build-release.sh | + +**Release**: v2.10.2 tagged on `a7d3dd7`, uploaded to Gitea Releases as the new "latest" (`/releases/latest` returns v2.10.2 — the timestamp issue from earlier in session is fixed via build-release.sh's `target_commitish` workaround). + +**Earlier in session**: tonight also produced commit `a531f9b` (ccp-agent missing bash/curl/jq/python3 + writable mount) and v2.10.1 release. v2.10.2 supersedes v2.10.1. + +--- + +## The plan — Approach A (DONE) + B + C (pending) + +Full design lives at `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md`. + +### Approach A — ✅ Done + +Three fixes to existing `scripts/upgrade.sh` shipping in v2.10.2: + +1. **Phase 6 self-destruct fix** — Phase 6's broad `docker compose up -d` no longer recreates ccp-agent (which would SIGKILL the running script). Instead, ccp-agent restart is deferred to AFTER `write_result` writes the final `result.json`, via a detached `nohup ... & disown` subshell. + +2. **mkdocs/ snapshot fallback** — `scripts/lib/mkdocs-snapshot.sh` is sourced by upgrade.sh's Phase 2. Before any other backup or pull operation, it tarballs the entire `mkdocs/` directory into `mkdocs-backup-.tar.gz` in the install root. Retains last 5. Discoverable via `ls`. Restoration is one-liner: + ```bash + tar xzf "$(ls -t mkdocs-backup-*.tar.gz | head -1)" -C . && \ + docker compose restart mkdocs mkdocs-site-server + ``` + +3. **`upgrade-stash-cleanup.sh`** — interactive utility to drop accumulated `upgrade-*` git stashes. Warns LOUDLY if any stash contains `mkdocs/mkdocs.yml` so operators verify recovery before dropping. + +### Approach B — ⏳ Pending (1-2 days) + +Add `--image-only` upgrade mode. Production images are hermetic (bake compiled code + Prisma migrations + entrypoint runs migrations on container start). Therefore `docker compose pull && docker compose up -d` IS a complete code+schema upgrade. **No filesystem mutation outside Docker** → tenant content implicitly safe. + +New files to create: +- `scripts/image-upgrade.sh` (~150 lines; sources `scripts/lib/mkdocs-snapshot.sh` for the fallback) +- `changemaker-control-panel/agent/src/routes/upgrade.routes.ts` → new endpoint `POST /instance/:slug/upgrade/start-image-only` +- `changemaker-control-panel/api/src/services/upgrade.service.ts` → `startImageUpgrade(instanceId, userId, { imageTag })` +- `changemaker-control-panel/api/src/services/remote-driver.ts` → `startImageUpgrade()` +- `changemaker-control-panel/api/src/modules/instances/instances.routes.ts` → `POST /:id/upgrade-images` +- CCP admin UI: "Quick Upgrade (image-only)" button on `InstanceDetailPage.tsx` + +### Approach C — ⏳ Pending (3-5 days) + +CCP-driven template re-render for orchestration-changing upgrades. Reuses existing `template-engine.ts` and `reconfigureInstance` pattern. Only writes templated files (compose, nginx, configs/pangolin); never touches `mkdocs/` or `configs/code-server/data/`. See plan for details. + +--- + +## How to apply v2.10.2 fixes to remaining tenants + +**For PIA: already done** — used as the proof-of-concept on 2026-05-21. mkdocs.yml md5 unchanged, file count unchanged. ~5 minutes per tenant. + +**For the other 6 tenants**, use the surgical update — DO NOT run a raw `git pull origin main` (it would resurrect tenant-deleted files via merge logic): + +### Source installs (bnkops, trbh, pridecorner, soroush) + +```bash +# bnkops, trbh, soroush use ~/changemaker.lite +# pridecorner uses ~/cmlite/changemaker.lite +cd ~/changemaker.lite # or ~/cmlite/changemaker.lite + +git fetch origin main + +mkdir -p scripts/lib +git checkout origin/main -- \ + scripts/upgrade.sh \ + scripts/upgrade-stash-cleanup.sh \ + scripts/lib/mkdocs-snapshot.sh \ + scripts/build-release.sh \ + docker-compose.yml \ + .gitignore + +# Sanity: tenant content should still be ahead/divergent (not touched) +git status mkdocs/ configs/ # should show no NEW changes from this update +``` + +### Release installs (marcelle, linda) — used pia approach + +```bash +# marcelle: ~/changemaker.lite, ssh bunker-admin@100.90.78.47 +# linda: ~/changemaker.lite.canonical, ssh bunker-admin@n2-linda.taile33572.ts.net +cd ~/changemaker.lite # or ~/changemaker.lite.canonical + +curl -fSL https://gitea.bnkops.com/admin/changemaker.lite/releases/download/v2.10.2/changemaker-lite-v2.10.2.tar.gz \ + -o /tmp/v2.10.2.tar.gz + +mkdir -p scripts/lib +tar -xzf /tmp/v2.10.2.tar.gz --strip-components=1 \ + changemaker-lite/scripts/upgrade.sh \ + changemaker-lite/scripts/upgrade-stash-cleanup.sh \ + changemaker-lite/scripts/lib/mkdocs-snapshot.sh \ + changemaker-lite/docker-compose.yml + +chmod +x scripts/upgrade.sh scripts/upgrade-stash-cleanup.sh scripts/lib/mkdocs-snapshot.sh +rm -f /tmp/v2.10.2.tar.gz + +# Do NOT update VERSION — only scripts changed, rest of install stays at current version. +``` + +### Verification per tenant + +```bash +# Before update: capture +md5sum mkdocs/mkdocs.yml +find mkdocs/docs -type f | wc -l + +# Run the appropriate surgical update above + +# After update: re-verify (should match) +md5sum mkdocs/mkdocs.yml +find mkdocs/docs -type f | wc -l + +# Confirm new upgrade.sh +grep -c 'deferred ccp-agent\|Deferred ccp-agent' scripts/upgrade.sh # expect 2 + +# Optional: smoke-test the snapshot helper +PROJECT_DIR=$(pwd) bash -c '. scripts/lib/mkdocs-snapshot.sh; snapshot_mkdocs' +ls -lh mkdocs-backup-*.tar.gz +``` + +--- + +## Bug inventory — what we know + +### Fixed in v2.10.2 + +| Bug | Memory file | Status | +|---|---|---| +| Gitea release `created_unix=0` (lightweight tag + Gitea 1.23.x quirk) | `feedback_gitea_release_tag_timing.md` | Fixed in `build-release.sh` — uses `target_commitish` + removes remote tag first | +| ccp-agent image missing bash/curl/jq/python3 + git safe.directory | `feedback_ccp_agent_image_deps.md` | Fixed in agent Dockerfile + rolled out to all 7 tenants | +| ccp-agent compose mount was `:ro` (blocked status.json writes) | (in `feedback_ccp_agent_image_deps.md`) | Fixed in both compose files | +| CCP upgrade Phase 5 collision: `COMPOSE_PROJECT_NAME` mismatch | `feedback_upgrade_compose_project_name.md` | Fixed via env-var addition in compose env block (e88ac79) — also needs `.env` entry on tenants installed before v2.10.2 | +| upgrade.sh Phase 6 self-destruct | `feedback_upgrade_sh_bugs.md` | Fixed in v2.10.2 — deferred ccp-agent restart | + +### Open + +- **upgrade.sh `git stash → git pull` stash-no-pop** — Pride Corner has 3 stashes from March 9 holding mkdocs.yml customizations. Existing `save_user_paths`/`restore_user_paths` in upgrade.sh handles the common case; the snapshot fallback (v2.10.2) covers edge cases. Pridecorner-specific recovery handled by another agent. +- **Agent-side `detached: true` spawn** — Defense-in-depth. Skip unless Phase 6 self-destruct re-emerges. + +--- + +## Tenant content protection layers (all in v2.10.2) + +1. **`save_user_paths`/`restore_user_paths`** in upgrade.sh — preserves working-tree state of `mkdocs/docs/`, `mkdocs/mkdocs.yml`, `mkdocs/site/`, `configs/`, `nginx/conf.d/services.conf` across `git pull`. +2. **`git stash` + auto-resolve on USER_PATHS** — modified tracked files stash + pop with `git checkout --theirs` on USER_PATH conflicts. +3. **Pre-upgrade mkdocs snapshot** — tarball of `mkdocs/` to install root before any other phase runs. Fallback for everything else. + +--- + +## Tonight's recovery work — already applied + +These tenants had content damage from earlier in the session; recovery was completed: + +- **trbh** — mkdocs.yml + 143 M files restored from `stash@{0}`; 538 D-entry files re-deleted. Public sites serve correct branding. +- **bnkops** — same pattern, 100 M files restored + 82 D-entry re-deletions. Public sites serve correct branding. +- **marcelle** — manual recovery from Phase 6 self-destruct test (file rollback + service restart). On v2.10.1 currently. Operating normally. + +`stash@{0}` is preserved on trbh and bnkops as forensic record + safety net. + +--- + +## CCP access + +``` +URL: http://n4-bnkops.taile33572.ts.net:5100 (UI) + http://n4-bnkops.taile33572.ts.net:5000 (API) +User: admin@thebunkerops.ca +Password: NRTgHdC7Zxxs2P2UmNwnEbn3jTwU8uJN (seed; rotate if you want) +Role: SUPER_ADMIN +``` + +--- + +## Test bench (marcelle) + +``` +SSH: ssh bunker-admin@100.90.78.47 +Install dir: ~/changemaker.lite +Domain: cursedknowledge.org +Admin: admin@cursedknowledge.org / @TheBunker2025! +CCP slug: changemakerlite +CCP id: 71b5bc4a-c47e-4435-b460-e9bc303b76ed +``` + +Marcelle is the test bench per `docs/TEST_SERVER.md`. Use it for ALL upgrade experiments before touching production tenants. + +--- + +## Per-tenant quick reference + +| Tenant | SSH | Install dir | CCP id | +|---|---|---|---| +| bnkops | bunker-admin@n4-bnkops.taile33572.ts.net | ~/changemaker.lite | 21238536-7c04-4a3b-a073-38390a939046 | +| marcelle | bunker-admin@100.90.78.47 | ~/changemaker.lite | 71b5bc4a-c47e-4435-b460-e9bc303b76ed | +| trbh | bunker-admin@n6-trbh.taile33572.ts.net | ~/changemaker.lite | c066dc23-64a5-4684-96a7-992e65c1b82c | +| pia | pia-bnkops@n3-pia.taile33572.ts.net | ~/changemaker.lite | 92a11622-d357-4ab4-b21e-60c030c1b026 | +| pridecorner | bunker-admin@n1-pridecorner.taile33572.ts.net | ~/cmlite/changemaker.lite | a30de94b-ef28-42b6-a71d-112669526a62 | +| soroush | bunker-admin@n7-soroush.taile33572.ts.net | ~/changemaker.lite | 0c70f94c-1319-41e1-867c-5674f17cadda | +| linda | bunker-admin@n2-linda.taile33572.ts.net | ~/changemaker.lite.canonical | 6dcc19a1-f4fd-45df-be77-5bf62f8110c8 | + +--- + +## Most important "don't repeat my mistakes" notes + +1. **Never `git stash + git pull --ff-only origin main` on a tenant** outside of upgrade.sh. The stash silently displaces tenant content. If you must update files on a source-installed tenant, use targeted `git checkout origin/main -- ` instead. + +2. **Never blindly trigger CCP "Upgrade Now"** on a tenant still running pre-v2.10.2 upgrade.sh — it will Phase 6 self-destruct. Apply surgical script update first (instructions above), THEN trigger CCP upgrade. + +3. **mkdocs/docs/ contains upstream tracked files** (default screenshots, demo docs, blog posts). Tenants typically delete these locally without committing. ANY operation that brings origin/main's tracked tree into the working tree (git pull, tarball extract) will resurrect them. v2.10.2's snapshot fallback gives you a recovery path; the surgical update procedure (this doc) avoids the issue entirely. + +4. **mkdocs/mkdocs.yml is tracked, tenant-customized** with branding. Lives under USER_PATHS so v2.10.2's upgrade.sh protects it. But if you do raw git operations outside the script, it's exposed. + +5. **CCP backend on n4 is decoupled from per-tenant ccp-agent**. Restarting a tenant's ccp-agent does NOT affect CCP itself. Verified during bnkops patch (CCP backend stayed at 41h uptime while ccp-agent recreated). + +--- + +## Memory files (in `/home/bunker-admin/.claude/projects/-home-bunker-admin-changemaker-lite/memory/`) + +Latest session work documented in: +- `feedback_gitea_release_tag_timing.md` +- `feedback_ccp_agent_image_deps.md` +- `feedback_upgrade_compose_project_name.md` +- `feedback_upgrade_sh_bugs.md` +- `feedback_session_2026_05_20_damage_report.md` + +Plus the architectural plan: `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md` + +--- + +## Where to start the next session + +Recommended sequence: + +1. **Apply surgical update to remaining 6 tenants** (~30-45 min, low risk; pia procedure already proven). Order: marcelle, linda (release), then soroush, trbh, bnkops, pridecorner (source). +2. **Test CCP-driven upgrade on marcelle** after surgical update lands. This will verify the deferred ccp-agent restart works end-to-end through the CCP path (the test we couldn't complete tonight because Phase 6 kept self-destructing). +3. **Implement Approach B** per the plan — image-only upgrade mode. Estimated 1-2 days. +4. **Implement Approach C** — CCP template re-render. 3-5 days. + +If only one thing happens next session: **do step 1**. Six surgical updates × ~5 minutes each. The rest of the fleet stays vulnerable to Phase 6 self-destruct until they're on v2.10.2's upgrade.sh.