Adds the third upgrade path alongside Approach A (full upgrade.sh) and B
(image-only). For releases that change orchestration (new services, new
nginx routes, new compose env vars) in addition to image versions, CCP
re-renders templates server-side, sends the rendered files to the tenant
via the existing mTLS agent, then composePull + composeUp. Tenant content
(mkdocs/, custom configs/) is never touched.
Pieces:
PHASE 1 — Schema + per-instance imageTag
- prisma/schema.prisma: new Instance.imageTag column (NULL = fall back
to env.IMAGE_TAG default).
- prisma/migrations/20260522093400_add_instance_image_tag/: SQL.
- services/template-engine.ts:
- buildTemplateContext now uses instance.imageTag || env.IMAGE_TAG.
- InstanceForTemplate interface gains imageTag: string | null.
PHASE 2 — Pre-flight diff (read-only "what would change?")
- agent/services/file.service.ts: new diffFiles() helper with a small
inline LCS-based unified-diff (no new deps). Returns per-file status
('unchanged' | 'modified' | 'created') + truncated unified diff.
- agent/routes/files.routes.ts: POST /instance/:slug/files/diff.
- api/services/execution-driver.ts: diffFiles added to interface.
- api/services/local-driver.ts + remote-driver.ts: diffFiles methods
(local mirrors agent helper inline; remote POSTs to the agent endpoint).
- api/services/upgrade.service.ts: previewReleaseUpgrade() — renders
templates in-memory with the proposed imageTag, filters out .env for
isRegistered=true tenants, calls driver.diffFiles, computes envCoverage
(which env vars the new compose needs vs which the tenant's .env has).
PHASE 3 — Apply path (the actual upgrade)
- api/services/upgrade.service.ts: startReleaseUpgrade() and the inner
runReleaseUpgrade() runner. Distinct from runRemoteUpgrade because CCP
does the work directly via the mTLS driver (no agent-side script).
Flow: persist imageTag in DB → render → writeFiles → composePull →
composeUp → composePs verify. Status reported via InstanceUpgrade
rows (same shape the existing CCP polling UI already uses).
- Failure handling: instance.imageTag stays at the new value on failure
so operator can retry. Manual rollback only.
PHASE 4 — Routes + schemas
- instances.schemas.ts: startReleaseUpgradeSchema (imageTag regex).
- instances.routes.ts:
- POST /:id/upgrade-release (apply)
- POST /:id/upgrade-release/preview (read-only diff)
PHASE 5 — CCP admin UI
- admin/pages/InstanceDetailPage.tsx: third "Upgrade to Release" button
next to Quick Upgrade + Upgrade Now. Opens a modal with imageTag input,
Preview button (calls /preview), and Apply button. Preview modal shows:
- Red alert if envCoverage.missingInTenantEnv is non-empty (compose
needs vars the tenant's .env doesn't define).
- Per-file status tags (unchanged / modified / created) + truncated
unified diff for modified files.
- admin/types/api.ts: Instance.imageTag added.
Constraints applied:
- Remote-only initial scope: throws "currently supported only for remote
instances" if instance.isRemote === false.
- isRegistered=true tenants (install.sh fleet): .env is filtered out
of the render set (CCP can't render env without secrets in DB), the
tenant's existing .env stays as-is. envCoverage warns the operator
if the new compose references env vars their .env doesn't define.
- Shared in-progress guard with Approach A/B (one upgrade at a time).
Per the plan: see ~/.claude/plans/insight-temporal-bachman.md.
All three projects type-check cleanly (api, agent, admin).
Bunker Admin
Add a "Quick Upgrade" path that pulls latest container images and recreates
only the core app services (api, admin, media-api, nginx) without touching
any tracked files. Tenant content (mkdocs/, configs/, scripts/) is implicitly
preserved because the script never writes outside docker.
Faster (~2 min vs ~4-5 min for full upgrade) and structurally safer for
releases that don't change orchestration/templates.
Pieces:
- scripts/image-upgrade.sh: new ~350-line script. Phases: pre-flight +
mkdocs snapshot, image pull, targeted recreate (broad up -d would cascade
on misconfigured infra containers — proven on marcelle), light health
checks, deferred ccp-agent restart. Writes the same progress.json +
result.json schema as upgrade.sh so the CCP poll loop is unchanged.
- agent/src/routes/upgrade.routes.ts: POST /instance/:slug/upgrade/start-image-only.
Same lock + staleness guards as the existing /upgrade/start endpoint.
- api/src/services/remote-driver.ts: RemoteDriver.startImageUpgrade().
- api/src/services/upgrade.service.ts: startImageUpgrade() entry point;
reuses runRemoteUpgrade with mode='image-only' (only the initial agent
call differs — result schema and polling are identical).
- api/src/modules/instances/instances.routes.ts: POST /:id/upgrade-images
+ startImageUpgradeSchema.
- admin/src/pages/InstanceDetailPage.tsx: secondary "Quick Upgrade" button
next to "Upgrade Now" on the Updates tab. Tooltip explains when to use it.
Tested locally on marcelle (v2.10.2 idempotent run): 1m 49s, mkdocs.yml md5
unchanged, file count unchanged, only api/admin/media-api/nginx touched.
Subtle bug found and fixed: `set -o pipefail` + `grep -q` shorts pipe and
SIGPIPEs the writer — captured services list once instead.
Bunker Admin
The agent container runs as root but the bind-mounted instance directory
is owned by the host user (UID 1000 = `node` in the container). Modern
git refuses to operate on such repos without an explicit safe.directory
entry, breaking upgrade-check.sh's `git fetch/log` calls on source-installed
tenants. Verified empirically on soroush after the previous fix landed.
Bunker Admin
Three related fixes uncovered during a marcelle CCP registration test:
1. ccp-agent image was missing bash + curl + jq + python3, so every
spawn('bash', ...) in upgrade.routes.ts and backup.routes.ts failed
silently with ENOENT. CCP kept reading stale status.json files from
disk, masking that no agent had successfully checked for updates in
weeks. apk-add the missing tools.
2. ccp-agent's /app/instance mount was :ro, blocking the agent from
writing data/upgrade/status.json (and result/progress/backups).
Agent already has docker.sock — removing :ro is not a security
escalation. Patched both docker-compose.yml and docker-compose.prod.yml.
3. Gitea 1.23.x only initializes Release.CreatedUnix inside its
createTag() helper, which is skipped if the tag already exists on
origin. The old DEV_WORKFLOW pattern (push tag, then run
build-release.sh --upload) was triggering this — releases got
created_unix=0 and lost /releases/latest sort order to v2.9.14.
build-release.sh now removes the remote tag first and POSTs with
target_commitish so Gitea creates the tag and release atomically.
After these fixes, CCP's "Check for Updates" path returns truthful
data end-to-end (verified on marcelle: v2.9.15 -> v2.10.1, 1 behind).
Bunker Admin
Problem: the agent polled /poll every 30s while waiting for admin
approval. At 10 req/15min, the 11th poll hit 429 after ~5 min and
every subsequent one also failed — recovery required an agent
restart. A human-paced approval SLA is longer than 5 minutes.
CCP side (agents.routes.ts):
Split the one-size-fits-all agentRegistrationLimiter into two.
/register stays tight (10/15min — invite-code brute force is the
real attack surface). /poll gets a new agentPollLimiter at 180/15min
(one poll per ~5s upper bound), scoped to registrationId+slug so
blast radius is bounded.
Agent side (server.ts):
Replaced fixed 30s setInterval with a self-scheduling setTimeout
loop that backs off exponentially on HTTP 429 (30s → 60s → 120s →
300s cap) and resets to 30s on any 2xx. Stop-flag protects against
re-entry after approval. Fixes the "agent wedged at 429, restart to
recover" workaround.
Bunker Admin