Add a "Quick Upgrade" path that pulls latest container images and recreates
only the core app services (api, admin, media-api, nginx) without touching
any tracked files. Tenant content (mkdocs/, configs/, scripts/) is implicitly
preserved because the script never writes outside docker.
Faster (~2 min vs ~4-5 min for full upgrade) and structurally safer for
releases that don't change orchestration/templates.
Pieces:
- scripts/image-upgrade.sh: new ~350-line script. Phases: pre-flight +
mkdocs snapshot, image pull, targeted recreate (broad up -d would cascade
on misconfigured infra containers — proven on marcelle), light health
checks, deferred ccp-agent restart. Writes the same progress.json +
result.json schema as upgrade.sh so the CCP poll loop is unchanged.
- agent/src/routes/upgrade.routes.ts: POST /instance/:slug/upgrade/start-image-only.
Same lock + staleness guards as the existing /upgrade/start endpoint.
- api/src/services/remote-driver.ts: RemoteDriver.startImageUpgrade().
- api/src/services/upgrade.service.ts: startImageUpgrade() entry point;
reuses runRemoteUpgrade with mode='image-only' (only the initial agent
call differs — result schema and polling are identical).
- api/src/modules/instances/instances.routes.ts: POST /:id/upgrade-images
+ startImageUpgradeSchema.
- admin/src/pages/InstanceDetailPage.tsx: secondary "Quick Upgrade" button
next to "Upgrade Now" on the Updates tab. Tooltip explains when to use it.
Tested locally on marcelle (v2.10.2 idempotent run): 1m 49s, mkdocs.yml md5
unchanged, file count unchanged, only api/admin/media-api/nginx touched.
Subtle bug found and fixed: `set -o pipefail` + `grep -q` shorts pipe and
SIGPIPEs the writer — captured services list once instead.
Bunker Admin
The agent container runs as root but the bind-mounted instance directory
is owned by the host user (UID 1000 = `node` in the container). Modern
git refuses to operate on such repos without an explicit safe.directory
entry, breaking upgrade-check.sh's `git fetch/log` calls on source-installed
tenants. Verified empirically on soroush after the previous fix landed.
Bunker Admin
Three related fixes uncovered during a marcelle CCP registration test:
1. ccp-agent image was missing bash + curl + jq + python3, so every
spawn('bash', ...) in upgrade.routes.ts and backup.routes.ts failed
silently with ENOENT. CCP kept reading stale status.json files from
disk, masking that no agent had successfully checked for updates in
weeks. apk-add the missing tools.
2. ccp-agent's /app/instance mount was :ro, blocking the agent from
writing data/upgrade/status.json (and result/progress/backups).
Agent already has docker.sock — removing :ro is not a security
escalation. Patched both docker-compose.yml and docker-compose.prod.yml.
3. Gitea 1.23.x only initializes Release.CreatedUnix inside its
createTag() helper, which is skipped if the tag already exists on
origin. The old DEV_WORKFLOW pattern (push tag, then run
build-release.sh --upload) was triggering this — releases got
created_unix=0 and lost /releases/latest sort order to v2.9.14.
build-release.sh now removes the remote tag first and POSTs with
target_commitish so Gitea creates the tag and release atomically.
After these fixes, CCP's "Check for Updates" path returns truthful
data end-to-end (verified on marcelle: v2.9.15 -> v2.10.1, 1 behind).
Bunker Admin
Problem: the agent polled /poll every 30s while waiting for admin
approval. At 10 req/15min, the 11th poll hit 429 after ~5 min and
every subsequent one also failed — recovery required an agent
restart. A human-paced approval SLA is longer than 5 minutes.
CCP side (agents.routes.ts):
Split the one-size-fits-all agentRegistrationLimiter into two.
/register stays tight (10/15min — invite-code brute force is the
real attack surface). /poll gets a new agentPollLimiter at 180/15min
(one poll per ~5s upper bound), scoped to registrationId+slug so
blast radius is bounded.
Agent side (server.ts):
Replaced fixed 30s setInterval with a self-scheduling setTimeout
loop that backs off exponentially on HTTP 429 (30s → 60s → 120s →
300s cap) and resets to 30s on any 2xx. Stop-flag protects against
re-entry after approval. Fixes the "agent wedged at 429, restart to
recover" workaround.
Bunker Admin