chore(approach-c): Phase 0 initial template overlay + session handoff

This session shipped: - Approach B end-to-end (commit 4a3d9d7): full rollout to all 7 tenants; marcelle E2E validated twice (121s + 100s). - v2.10.2 surgical update applied to 6 remaining tenants. This commit lands the kickoff for Approach C (template re-render path): scripts/templates changes: - docker-compose.yml.hbs.OLD-style-pre-approach-c: preserved old CCP template (Handlebars-heavy, dynamic container names, secrets rendered at template-time). - docker-compose.yml.hbs: REWRITTEN as a near-mirror of canonical docker-compose.prod.yml. Minimal Handlebars overlay: - Header comment lists {{name}}, {{slug}}, {{composeProject}}. - 5 image refs: ${IMAGE_TAG:-latest} -> {{imageTag}}, so CCP can per-instance override once Phase 1 lands the Instance.imageTag column. All other variation flows through env-var substitution from tenant's .env. Container names are now hardcoded (matching prod), feature flags are deferred to COMPOSE_PROFILES gating (matching prod). Why a rewrite: the old CCP template and prod compose used fundamentally different conventions (dynamic vs hardcoded names, render-time vs substitute-time secrets, Handlebars vs profiles gating). Sync-by-addition couldn't reconcile them. The rewrite makes Approach C re-render safe for the install.sh-installed fleet (marcelle, linda, pia and future). docs/SESSION_HANDOFF_2026-05-21.md: full session handoff covering fleet state, Approach B rollout, Approach C plan, and where to start next session. force-added because /docs is gitignored (same precedent as docs/SESSION_HANDOFF_2026-05-20.md from prior session). Phase 0 remaining work (next session): - Audit env.hbs against new compose env-var expectations - Sync static config files (nginx/, configs/prometheus/, etc.) - Build api/scripts/render-for-instance.ts harness - Iterate template until rendered output is per-instance-only diff against marcelle/linda/pia actual compose. Then Phases 1-6 per plan in subsequent sessions (~11-14 hours total). Bunker Admin
feat(upgrade): Approach B - image-only upgrade mode
2026-05-21 19:32:21 -06:00 · 2026-05-21 15:20:35 -06:00 · 2026-05-21 13:42:08 -06:00 · 2026-05-21 10:36:28 -06:00 · 2026-05-20 20:43:34 -06:00 · 2026-05-20 15:57:30 -06:00
19 changed files with 3899 additions and 923 deletions
--- a/.gitignore
+++ b/.gitignore
@ -64,6 +64,11 @@ core.*
 /backups/
 .upgrade.lock
 # Pre-upgrade mkdocs snapshots (created by scripts/lib/mkdocs-snapshot.sh).
 # These are the tenant-content rescue archives written before every upgrade;
 # discoverable in the install root via `ls`. Retention: last 5 (see helper).
 /mkdocs-backup-*.tar.gz
 # Release tarballs (generated by build-release.sh)
 /releases/
--- a/changemaker-control-panel/admin/src/pages/InstanceDetailPage.tsx
+++ b/changemaker-control-panel/admin/src/pages/InstanceDetailPage.tsx
@ -39,6 +39,7 @@ import {
  CloudOutlined,
  DisconnectOutlined,
  UploadOutlined,
  ThunderboltOutlined,
  BellOutlined,
  CheckCircleOutlined,
  WarningOutlined,
@ -563,6 +564,24 @@ export default function InstanceDetailPage() {
    }
  };
  // Image-only upgrade (Approach B): pulls images + recreates core app services
  // without touching tracked files. Faster + safer than full upgrade for releases
  // that don't change compose/templates.
  const handleStartImageUpgrade = async () => {
    setUpgradingInstance(true);
    try {
      const { data } = await api.post(`/instances/${id}/upgrade-images`, {});
      setCurrentUpgrade(data.data);
      message.success('Image-only upgrade started');
    } catch (err: unknown) {
      const resp = (err as { response?: { data?: { error?: { message?: string } } } })?.response
        ?.data?.error;
      message.error(resp?.message || 'Failed to start image-only upgrade');
    } finally {
      setUpgradingInstance(false);
    }
  };
  // Event handlers
  const handleAcknowledgeEvent = async (eventId: string) => {
    try {
@ -1632,25 +1651,41 @@ export default function InstanceDetailPage() {
                  closable
                />
              )}
-              <div style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center' }}>
+              <div style={{ display: 'flex', justifyContent: 'space-between', alignItems: 'center', gap: 16 }}>
-                <Typography.Text type="secondary">
+                <Typography.Text type="secondary" style={{ flex: 1 }}>
-                  Pulls latest code, runs migrations, and restarts services. CCP backup is recommended before upgrading.
+                  Full upgrade pulls the latest code, runs migrations, and restarts services. Quick upgrade only pulls images and recreates the core app — tenant content stays untouched and it&apos;s ~2 min faster. Use Quick when the release notes say no orchestration changes.
                </Typography.Text>
-                <Popconfirm
+                <Space>
-                  title="Start upgrade?"
+                  <Popconfirm
-                  description="This will pull the latest code, run database migrations, and restart all services. Brief downtime is expected."
+                    title="Start quick (image-only) upgrade?"
-                  onConfirm={handleStartUpgrade}
+                    description="Pulls new container images and recreates the API/Admin/Media/Nginx services. No filesystem changes — mkdocs and configs are not touched. Brief downtime is expected."
-                  disabled={instance.status !== 'RUNNING' && instance.status !== 'STOPPED'}
+                    onConfirm={handleStartImageUpgrade}
                >
                  <Button
                    type="primary"
                    icon={<UploadOutlined />}
                    loading={upgradingInstance}
                    disabled={instance.status !== 'RUNNING' && instance.status !== 'STOPPED'}
                  >
-                    Upgrade Now
+                    <Button
-                  </Button>
+                      icon={<ThunderboltOutlined />}
-                </Popconfirm>
+                      loading={upgradingInstance}
                      disabled={instance.status !== 'RUNNING' && instance.status !== 'STOPPED'}
                    >
                      Quick Upgrade
                    </Button>
                  </Popconfirm>
                  <Popconfirm
                    title="Start full upgrade?"
                    description="This will pull the latest code, run database migrations, and restart all services. Brief downtime is expected."
                    onConfirm={handleStartUpgrade}
                    disabled={instance.status !== 'RUNNING' && instance.status !== 'STOPPED'}
                  >
                    <Button
                      type="primary"
                      icon={<UploadOutlined />}
                      loading={upgradingInstance}
                      disabled={instance.status !== 'RUNNING' && instance.status !== 'STOPPED'}
                    >
                      Upgrade Now
                    </Button>
                  </Popconfirm>
                </Space>
              </div>
            </Space>
          )}
--- a/changemaker-control-panel/agent/Dockerfile
+++ b/changemaker-control-panel/agent/Dockerfile
@ -8,7 +8,16 @@ COPY src/ ./src/
 RUN npx tsc
 FROM node:20-alpine
-RUN apk add --no-cache docker-cli docker-cli-compose git rsync
+# bash + curl + jq + python3 are required by the changemaker scripts the agent
 # shells out to (upgrade-check.sh, upgrade.sh, backup.sh). Without them, every
 # /upgrade/* and /backup/* call returns "command not found" failures.
 RUN apk add --no-cache docker-cli docker-cli-compose git rsync bash curl jq python3
 # Agent runs as root, but the bind-mounted /app/instance is owned by the host
 # user (UID 1000 = `node` inside the container). Modern git refuses to operate
 # on repos with mismatched ownership without an explicit safe.directory entry.
 # Wildcard whitelist all paths — the agent only mounts a single host directory
 # anyway (the instance's project root).
 RUN git config --system --add safe.directory '*'
 WORKDIR /app
 COPY package*.json ./
 RUN npm ci --production
--- a/changemaker-control-panel/agent/src/routes/upgrade.routes.ts
+++ b/changemaker-control-panel/agent/src/routes/upgrade.routes.ts
@ -188,6 +188,85 @@ router.post('/instance/:slug/upgrade/start', async (req: Request, res: Response)
  res.status(202).json({ started: true });
 });
 // POST /instance/:slug/upgrade/start-image-only — Run image-upgrade.sh in background
 //
 // Image-only upgrade: pulls latest images + recreates services without touching
 // tracked files (no git pull, no tarball extract, no VERSION mutation). Tenant
 // content is implicitly safe because the script never writes outside data/upgrade.
 // See scripts/image-upgrade.sh for full rationale.
 //
 // Schema-compatible with /upgrade/start: writes the same progress.json + result.json
 // so the CCP poll loop in runRemoteUpgrade() works unchanged.
 router.post('/instance/:slug/upgrade/start-image-only', async (req: Request, res: Response) => {
  const slug = param(req, 'slug');
  const entry = await getSlugEntry(slug);
  const { imageTag } = req.body || {};
  // SECURITY: imageTag flows into bash via --image-tag. Constrain to a safe
  // subset of docker tag chars (semver, SHA, named tags). Reject anything
  // that could shell-escape.
  if (imageTag && !/^[a-zA-Z0-9][a-zA-Z0-9_.-]{0,127}$/.test(String(imageTag))) {
    res.status(400).json({ error: 'VALIDATION', message: 'Invalid imageTag' });
    return;
  }
  const scriptPath = path.join(entry.basePath, 'scripts', 'image-upgrade.sh');
  try {
    await fs.access(scriptPath);
  } catch {
    res.status(404).json({ error: 'NOT_FOUND', message: 'image-upgrade.sh not found' });
    return;
  }
  // Same concurrency guards as the full /upgrade/start endpoint — uses the
  // same lock + on-disk staleness check + backup/restore mutex.
  if (isSlugLocked(slug, 'upgrade') || await isUpgradeRunningOnDisk(entry.basePath)) {
    res.status(409).json({ error: 'SLUG_BUSY', message: 'An upgrade is already in progress' });
    return;
  }
  if (isSlugLocked(slug, 'backup') || isSlugLocked(slug, 'restore')) {
    res.status(409).json({ error: 'SLUG_BUSY', message: 'A backup or restore is currently running' });
    return;
  }
  // Clear stale progress/result files (same convention as /upgrade/start)
  const progressPath = path.join(entry.basePath, 'data', 'upgrade', 'progress.json');
  const resultPath = path.join(entry.basePath, 'data', 'upgrade', 'result.json');
  await fs.mkdir(path.dirname(progressPath), { recursive: true });
  await fs.rm(progressPath, { force: true });
  await fs.rm(resultPath, { force: true });
  const args: string[] = [scriptPath, '--api-mode'];
  if (imageTag) args.push('--image-tag', String(imageTag));
  void withSlugLock(slug, 'upgrade', async () => {
    logger.info(`[image-upgrade] ${slug}: spawning ${args.join(' ')} (cwd=${entry.basePath})`);
    try {
      await new Promise<void>((resolve, reject) => {
        const proc = spawn('bash', args, {
          cwd: entry.basePath,
          env: { ...process.env, COMPOSE_ANSI: 'never' },
          stdio: ['ignore', 'ignore', 'ignore'],
        });
        proc.on('error', reject);
        proc.on('close', (code) => {
          if (code === 0) resolve();
          else reject(new Error(`image-upgrade.sh exited with code ${code}`));
        });
      });
      logger.info(`[image-upgrade] ${slug}: image-upgrade.sh completed`);
    } catch (err) {
      logger.error(`[image-upgrade] ${slug}: ${(err as Error).message}`);
    }
  }).catch((err) => {
    if (!(err instanceof SlugBusyError)) {
      logger.error(`[image-upgrade] ${slug}: lock or background error: ${(err as Error).message}`);
    }
  });
  res.status(202).json({ started: true, mode: 'image-only' });
 });
 // GET /instance/:slug/upgrade/progress — Read progress.json
 router.get('/instance/:slug/upgrade/progress', async (req: Request, res: Response) => {
  const entry = await getSlugEntry(param(req, 'slug'));
--- a/changemaker-control-panel/api/src/modules/instances/instances.routes.ts
+++ b/changemaker-control-panel/api/src/modules/instances/instances.routes.ts
@ -4,7 +4,7 @@ import rateLimit from 'express-rate-limit';
 import { prisma } from '../../lib/prisma';
 import { authenticate, requireRole } from '../../middleware/auth';
 import { validate } from '../../middleware/validate';
-import { createInstanceSchema, updateInstanceSchema, registerInstanceSchema, reconfigureInstanceSchema, configureTunnelSchema, importInstancesSchema, startUpgradeSchema, setupRemoteTunnelSchema } from './instances.schemas';
+import { createInstanceSchema, updateInstanceSchema, registerInstanceSchema, reconfigureInstanceSchema, configureTunnelSchema, importInstancesSchema, startUpgradeSchema, startImageUpgradeSchema, setupRemoteTunnelSchema } from './instances.schemas';
 import * as instancesService from './instances.service';
 import * as healthService from '../../services/health.service';
 import * as backupService from '../../services/backup.service';
@ -362,6 +362,25 @@ router.post(
  }
 );
 // Image-only upgrade (Approach B). Faster + safer than full upgrade for
 // releases that don't change orchestration/templates. See upgrade.service.ts
 // startImageUpgrade for full rationale.
 router.post(
  '/:id/upgrade-images',
  requireRole('SUPER_ADMIN', 'OPERATOR'),
  validate(startImageUpgradeSchema),
  async (req: Request, res: Response) => {
    const { imageTag } = req.body || {};
    const upgrade = await upgradeService.startImageUpgrade(
      req.params.id as string,
      req.user!.id,
      req.ip,
      { imageTag }
    );
    res.status(201).json({ data: upgrade });
  }
 );
 router.get(
  '/:id/upgrade-status',
  requireRole('SUPER_ADMIN', 'OPERATOR'),
--- a/changemaker-control-panel/api/src/modules/instances/instances.schemas.ts
+++ b/changemaker-control-panel/api/src/modules/instances/instances.schemas.ts
@ -121,6 +121,17 @@ export const startUpgradeSchema = z.object({
    .optional(),
 });
 // Approach B: image-only upgrade. Pulls images + recreates core app services
 // without touching tracked files. imageTag is optional — if omitted, the
 // agent uses whatever IMAGE_TAG the install's .env / compose env defines
 // (typically `latest`). Tag must be a valid Docker tag.
 export const startImageUpgradeSchema = z.object({
  imageTag: z
    .string()
    .regex(/^[a-zA-Z0-9][a-zA-Z0-9_.-]{0,127}$/, 'Invalid imageTag')
    .optional(),
 });
 export const setupRemoteTunnelSchema = z.object({
  // Empty string or omitted → resources use standard subdomains (app., api., etc.)
  // A value like "ck" → creates ck-app., ck-api., etc. for multi-tenant domains
--- a/changemaker-control-panel/api/src/services/remote-driver.ts
+++ b/changemaker-control-panel/api/src/services/remote-driver.ts
@ -82,6 +82,10 @@ export interface StartAgentUpgradeOptions {
  branch?: string;
 }
 export interface StartAgentImageUpgradeOptions {
  imageTag?: string;
 }
 interface AgentRequestOptions {
  method: 'GET' | 'POST' | 'DELETE';
  path: string;
@ -574,6 +578,21 @@ export class RemoteDriver implements ExecutionDriver {
    });
  }
  /**
   * Trigger image-upgrade.sh --api-mode on the remote (Approach B: image-only
   * upgrade — pulls images + recreates core app services without touching
   * the install tree). Fire-and-forget; returns 202 immediately. Uses the
   * same progress/result polling endpoints as startUpgrade.
   */
  async startImageUpgrade(options: StartAgentImageUpgradeOptions = {}): Promise<void> {
    await this.request({
      method: 'POST',
      path: `/instance/${this.slug}/upgrade/start-image-only`,
      body: options,
      timeoutMs: 30_000,
    });
  }
  /**
   * Read the agent's data/upgrade/progress.json. Returns the default zero-state
   * if no progress has been written yet.
--- a/changemaker-control-panel/api/src/services/upgrade.service.ts
+++ b/changemaker-control-panel/api/src/services/upgrade.service.ts
@ -205,6 +205,10 @@ export interface StartUpgradeOptions {
  branch?: string;
 }
 export interface StartImageUpgradeOptions {
  imageTag?: string;
 }
 /**
 * Start an upgrade for an instance. Returns the created InstanceUpgrade record.
 * The actual upgrade runs asynchronously (fire-and-forget).
@ -298,6 +302,86 @@ export async function startUpgrade(
  return upgrade;
 }
 /**
 * Start an IMAGE-ONLY upgrade (Approach B). Pulls latest images + recreates
 * core app services without touching tracked files. Faster (~2 min vs ~4-5
 * min for full upgrade) and safer because no filesystem mutation outside
 * docker — tenant content (mkdocs/, configs/) is implicitly preserved.
 *
 * Use this for releases that only bump container code or schema. For
 * releases that change compose orchestration, nginx config, or other
 * tracked files, use startUpgrade() instead.
 *
 * Remote-only for now: local mode would need a `runImageUpgrade` runner
 * which we haven't built (all our instances are remote via mTLS agent).
 */
 export async function startImageUpgrade(
  instanceId: string,
  userId: string,
  ipAddress?: string,
  options?: StartImageUpgradeOptions
 ) {
  const instance = await prisma.instance.findUnique({ where: { id: instanceId } });
  if (!instance) throw new Error('Instance not found');
  if (!instance.isRemote) {
    throw new Error('Image-only upgrade is currently supported only for remote instances');
  }
  if (instance.status !== InstanceStatus.RUNNING && instance.status !== InstanceStatus.STOPPED) {
    throw new Error(`Cannot upgrade instance in ${instance.status} state`);
  }
  // Reuse the same in-progress guard as startUpgrade: only one upgrade
  // (of either type) at a time per instance.
  const active = await prisma.instanceUpgrade.findFirst({
    where: {
      instanceId,
      status: { in: [UpgradeStatus.PENDING, UpgradeStatus.IN_PROGRESS] },
    },
  });
  if (active) {
    throw new Error('An upgrade is already in progress for this instance');
  }
  // Create upgrade record. branch is unused for image-only but keep it
  // populated with current branch for audit trail consistency.
  const upgrade = await prisma.instanceUpgrade.create({
    data: {
      instanceId,
      status: UpgradeStatus.PENDING,
      previousCommit: instance.gitCommit,
      branch: instance.gitBranch,
      triggeredById: userId,
    },
  });
  // Audit log
  await prisma.auditLog.create({
    data: {
      userId,
      instanceId,
      action: AuditAction.INSTANCE_UPGRADE,
      details: {
        upgradeId: upgrade.id,
        previousCommit: instance.gitCommit,
        source: 'remote',
        mode: 'image-only',
        options: options || {},
      } as unknown as Prisma.InputJsonValue,
      ipAddress,
    },
  });
  // Fire-and-forget: reuse runRemoteUpgrade with mode='image-only'. Same
  // poll loop and result handling — only the initial agent call differs.
  runRemoteUpgrade(upgrade.id, instance, undefined, 'image-only', options).catch((err) => {
    logger.error(`[image-upgrade] Remote image upgrade orchestration failed for ${instance.slug}: ${err}`);
  });
  return upgrade;
 }
 /**
 * Async REMOTE upgrade runner.
 *
@ -316,7 +400,9 @@ export async function startUpgrade(
 async function runRemoteUpgrade(
  upgradeId: string,
  instance: Instance,
-  options?: StartUpgradeOptions
+  options?: StartUpgradeOptions,
  mode: 'full' | 'image-only' = 'full',
  imageOnlyOptions?: StartImageUpgradeOptions
 ) {
  const slug = instance.slug;
@ -333,18 +419,27 @@ async function runRemoteUpgrade(
      where: { id: upgradeId },
      data: {
        status: UpgradeStatus.IN_PROGRESS,
-        progressMessage: 'Starting remote upgrade...',
+        progressMessage: mode === 'image-only'
          ? 'Starting image-only upgrade...'
          : 'Starting remote upgrade...',
      },
    });
    // Tell the agent to start. The agent has its own mutex + stale-progress
    // check, so this can return 409 if a previous upgrade is still running.
-    logger.info(`[upgrade] ${slug}: triggering remote upgrade.sh start`);
+    if (mode === 'image-only') {
-    await driver.startUpgrade({
+      logger.info(`[upgrade] ${slug}: triggering remote image-upgrade.sh start`);
-      skipBackup: options?.skipBackup,
+      await driver.startImageUpgrade({
-      useRegistry: options?.useRegistry,
+        imageTag: imageOnlyOptions?.imageTag,
-      branch: options?.branch,
+      });
-    });
+    } else {
      logger.info(`[upgrade] ${slug}: triggering remote upgrade.sh start`);
      await driver.startUpgrade({
        skipBackup: options?.skipBackup,
        useRegistry: options?.useRegistry,
        branch: options?.branch,
      });
    }
    // Poll progress + result. We treat /result returning 200 as the signal
    // that upgrade.sh exited (successfully or with code != 0 — the script
--- a/changemaker-control-panel/templates/docker-compose.yml.hbs
+++ b/changemaker-control-panel/templates/docker-compose.yml.hbs
--- a/changemaker-control-panel/templates/docker-compose.yml.hbs.OLD-style-pre-approach-c
+++ b/changemaker-control-panel/templates/docker-compose.yml.hbs.OLD-style-pre-approach-c
--- a/docker-compose.prod.yml
+++ b/docker-compose.prod.yml
@ -976,6 +976,39 @@ services:
      retries: 10
      start_period: 30s
  # Gancio Config Init — Writes /home/node/data/config.json from .env if missing.
  # Gancio refuses to start when its DB has tables but the data volume has no
  # config.json ("Non empty db! Please move your current db elsewhere than retry"),
  # which causes an infinite restart loop. This sidecar runs on every `up` and is
  # a no-op when config.json is already present. See docker-compose.yml for the
  # full rationale; the two files must stay in parity per scripts/validate-compose-parity.sh.
  gancio-config-init:
    image: ${GITEA_REGISTRY:-gitea.bnkops.com/admin}/alpine:3
    container_name: gancio-config-init
    restart: "no"
    volumes:
      - gancio-data:/data
    environment:
      - GANCIO_BASE_URL=${GANCIO_BASE_URL:-https://events.cmlite.org}
      - V2_POSTGRES_USER=${V2_POSTGRES_USER:-changemaker}
      - V2_POSTGRES_PASSWORD=${V2_POSTGRES_PASSWORD:?V2_POSTGRES_PASSWORD must be set in .env}
    entrypoint: ["sh", "-c"]
    command:
      - |
        set -e
        if [ -s /data/config.json ]; then
          echo "Gancio config.json present — skipping"
          exit 0
        fi
        echo "Gancio config.json missing — regenerating from .env"
        printf '{"baseurl":"%s","server":{"host":"0.0.0.0","port":13120},"db":{"dialect":"postgres","host":"changemaker-v2-postgres","port":5432,"database":"gancio","username":"%s","password":"%s"}}' \
          "$$GANCIO_BASE_URL" "$$V2_POSTGRES_USER" "$$V2_POSTGRES_PASSWORD" > /data/config.json
        chown 1000:1000 /data/config.json
        echo "Gancio config.json regenerated"
    logging: *default-logging
    networks:
      - changemaker-lite
  # Gancio — Event management platform (uses shared PostgreSQL)
  gancio:
    image: ${GITEA_REGISTRY:-gitea.bnkops.com/admin}/gancio:1.28.2
@ -984,6 +1017,8 @@ services:
    depends_on:
      v2-postgres:
        condition: service_healthy
      gancio-config-init:
        condition: service_completed_successfully
    ports:
      - "127.0.0.1:${GANCIO_PORT:-8092}:13120"
    healthcheck:
@ -1392,9 +1427,10 @@ services:
      - /var/run/docker.sock:/var/run/docker.sock
      - ccp-agent-data:/var/lib/ccp-agent
      - ccp-agent-certs:/etc/ccp-agent
-      # Mount the instance directory so the agent can read compose files and run
+      # Mount the instance directory so the agent can read compose files and
-      # `docker compose -p <project>` commands against the real project on disk.
+      # write status.json + backups (writable; agent already has docker.sock,
-      - .:/app/instance:ro
+      # so file write access is not an additional security escalation).
      - .:/app/instance
    environment:
      - AGENT_PORT=7443
      - AGENT_DATA_DIR=/var/lib/ccp-agent
@ -1406,7 +1442,12 @@ services:
      - INSTANCE_BASE_PATH=/app/instance
      # Pass the host's compose project name so the agent runs `docker compose -p <project>`
      # against the right project (not basename of INSTANCE_BASE_PATH, which is "instance").
      # COMPOSE_PROJECT is read by the agent's TypeScript for slug derivation;
      # COMPOSE_PROJECT_NAME is what Docker Compose itself reads when upgrade.sh
      # shells out to `docker compose ...` — without it, compose defaults to
      # basename(cwd)="instance" and collides with the host's existing containers.
      - COMPOSE_PROJECT=${COMPOSE_PROJECT_NAME:-changemaker-lite}
      - COMPOSE_PROJECT_NAME=${COMPOSE_PROJECT_NAME:-changemaker-lite}
    logging: *default-logging
    networks:
      - changemaker-lite
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -998,6 +998,40 @@ services:
      start_period: 30s
  # Gancio — Event management platform (uses shared PostgreSQL)
  # Gancio Config Init — Writes /home/node/data/config.json from .env if missing.
  # Gancio refuses to start when its DB has tables but the data volume has no
  # config.json ("Non empty db! Please move your current db elsewhere than retry"),
  # which causes an infinite restart loop. This sidecar runs on every `up` and is
  # a no-op when config.json is already present. Reversible: removing this
  # service has no effect on healthy stacks; it only matters when the volume
  # loses config.json (volume rename, partial restore, manual volume rm, etc.).
  gancio-config-init:
    image: alpine:3
    container_name: gancio-config-init
    restart: "no"
    volumes:
      - gancio-data:/data
    environment:
      - GANCIO_BASE_URL=${GANCIO_BASE_URL:-https://events.cmlite.org}
      - V2_POSTGRES_USER=${V2_POSTGRES_USER:-changemaker}
      - V2_POSTGRES_PASSWORD=${V2_POSTGRES_PASSWORD:?V2_POSTGRES_PASSWORD must be set in .env}
    entrypoint: ["sh", "-c"]
    command:
      - |
        set -e
        if [ -s /data/config.json ]; then
          echo "Gancio config.json present — skipping"
          exit 0
        fi
        echo "Gancio config.json missing — regenerating from .env"
        printf '{"baseurl":"%s","server":{"host":"0.0.0.0","port":13120},"db":{"dialect":"postgres","host":"changemaker-v2-postgres","port":5432,"database":"gancio","username":"%s","password":"%s"}}' \
          "$$GANCIO_BASE_URL" "$$V2_POSTGRES_USER" "$$V2_POSTGRES_PASSWORD" > /data/config.json
        chown 1000:1000 /data/config.json
        echo "Gancio config.json regenerated"
    logging: *default-logging
    networks:
      - changemaker-lite
  gancio:
    image: cisti/gancio:1.28.2
    container_name: gancio-changemaker
@ -1005,6 +1039,8 @@ services:
    depends_on:
      v2-postgres:
        condition: service_healthy
      gancio-config-init:
        condition: service_completed_successfully
    ports:
      - "127.0.0.1:${GANCIO_PORT:-8092}:13120"
    healthcheck:
@ -1414,7 +1450,10 @@ services:
      - /var/run/docker.sock:/var/run/docker.sock
      - ccp-agent-data:/var/lib/ccp-agent
      - ccp-agent-certs:/etc/ccp-agent
-      - .:/app/instance:ro
+      # Writable: agent must write data/upgrade/{status,progress,result}.json
      # and data/backups/*.tar.gz. Agent already has docker.sock — file write
      # access is not an additional security escalation.
      - .:/app/instance
    environment:
      - AGENT_PORT=7443
      - AGENT_DATA_DIR=/var/lib/ccp-agent
@ -1426,7 +1465,12 @@ services:
      - INSTANCE_BASE_PATH=/app/instance
      # Pass the host's compose project name so the agent runs `docker compose -p <project>`
      # against the right project (not basename of INSTANCE_BASE_PATH, which is "instance").
      # COMPOSE_PROJECT is read by the agent's TypeScript for slug derivation;
      # COMPOSE_PROJECT_NAME is what Docker Compose itself reads when upgrade.sh
      # shells out to `docker compose ...` — without it, compose defaults to
      # basename(cwd)="instance" and collides with the host's existing containers.
      - COMPOSE_PROJECT=${COMPOSE_PROJECT_NAME:-changemaker-lite}
      - COMPOSE_PROJECT_NAME=${COMPOSE_PROJECT_NAME:-changemaker-lite}
    logging: *default-logging
    networks:
      - changemaker-lite
--- a/docs/SESSION_HANDOFF_2026-05-20.md
+++ b/docs/SESSION_HANDOFF_2026-05-20.md
@ -0,0 +1,266 @@
 # Session Handoff: Upgrade Flow Redesign (2026-05-20 → 2026-05-21)
 > Carries forward all context from a long working session into the next conversation. If you're a fresh agent: read this top-to-bottom before touching anything.
 ---
 ## Quick state of the fleet
 | Tenant | Type | Version | Agent patched | Surgical script update | Notes |
 |---|---|---|---|---|---|
 | bnkops (n4) | source | main @ 1b80e82 | ✅ | ⏳ pending | Management node; CCP backend runs here in parallel |
 | marcelle (n5, cursedknowledge.org) | release | v2.9.15 | ✅ | ⏳ pending | Test bench; first end-to-end CCP upgrade test ran here (succeeded after manual Phase 6 recovery) |
 | trbh (n6) | source | main @ 1b80e82 | ✅ | ⏳ pending | mkdocs content RESTORED from `stash@{0}` — site serves "That Really Blonde Human" correctly |
 | pia (n3, pia-bnkops) | release | v2.9.10 | ✅ | ✅ **completed 2026-05-21** | First successful surgical update — proof the procedure works |
 | pridecorner (n1) | source | main @ 1b80e82 | ✅ | ⏳ pending | Has 3 March 9 upgrade-* stashes still on disk (audit done; recovery deferred to another agent) |
 | soroush (n7) | source | main @ 1b80e82 | ✅ | ⏳ pending | Was earliest-fixed tonight |
 | linda (n2, lindalindsay.org) | release-converted | v2.9.14 | ✅ | ⏳ pending | Was source-install with broken `.git`; converted to release mode (VERSION file written) |
 **Public sites verified working at session end**: trbh.org, docs.trbh.org, bnkops.com, pridecorner.ca, soroushsamavat.org, publicinterestalberta.org, lindalindsay.org, cursedknowledge.org.
 **Known caveat**: docs.bnkops.com returns HTTP 000 externally (Pangolin tunnel routing issue, pre-existing, NOT caused by this session). bnkops mkdocs container serves correct content locally.
 ---
 ## What landed in source (committed + pushed to origin/main)
 | Commit | Description |
 |---|---|
 | `1b80e82` | `fix(ccp-agent): whitelist /app/instance for git safe.directory` — ccp-agent Dockerfile |
 | `e88ac79` | `fix(ccp-agent): export COMPOSE_PROJECT_NAME so upgrade.sh sees correct project` — docker-compose.yml + .prod.yml |
 | `9613c3e` | `fix(upgrade): Phase 1 of upgrade-flow redesign (Approach A)` — upgrade.sh + scripts/lib/mkdocs-snapshot.sh + scripts/upgrade-stash-cleanup.sh + .gitignore |
 | `a7d3dd7` | `chore(release): ship scripts/lib/ + classify upgrade-stash-cleanup.sh` — build-release.sh |
 **Release**: v2.10.2 tagged on `a7d3dd7`, uploaded to Gitea Releases as the new "latest" (`/releases/latest` returns v2.10.2 — the timestamp issue from earlier in session is fixed via build-release.sh's `target_commitish` workaround).
 **Earlier in session**: tonight also produced commit `a531f9b` (ccp-agent missing bash/curl/jq/python3 + writable mount) and v2.10.1 release. v2.10.2 supersedes v2.10.1.
 ---
 ## The plan — Approach A (DONE) + B + C (pending)
 Full design lives at `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md`.
 ### Approach A — ✅ Done
 Three fixes to existing `scripts/upgrade.sh` shipping in v2.10.2:
 1. **Phase 6 self-destruct fix** — Phase 6's broad `docker compose up -d` no longer recreates ccp-agent (which would SIGKILL the running script). Instead, ccp-agent restart is deferred to AFTER `write_result` writes the final `result.json`, via a detached `nohup ... & disown` subshell.
 2. **mkdocs/ snapshot fallback** — `scripts/lib/mkdocs-snapshot.sh` is sourced by upgrade.sh's Phase 2. Before any other backup or pull operation, it tarballs the entire `mkdocs/` directory into `mkdocs-backup-<timestamp>.tar.gz` in the install root. Retains last 5. Discoverable via `ls`. Restoration is one-liner:
   ```bash
   tar xzf "$(ls -t mkdocs-backup-*.tar.gz | head -1)" -C . && \
   docker compose restart mkdocs mkdocs-site-server
   ```
 3. **`upgrade-stash-cleanup.sh`** — interactive utility to drop accumulated `upgrade-*` git stashes. Warns LOUDLY if any stash contains `mkdocs/mkdocs.yml` so operators verify recovery before dropping.
 ### Approach B — ⏳ Pending (1-2 days)
 Add `--image-only` upgrade mode. Production images are hermetic (bake compiled code + Prisma migrations + entrypoint runs migrations on container start). Therefore `docker compose pull && docker compose up -d` IS a complete code+schema upgrade. **No filesystem mutation outside Docker** → tenant content implicitly safe.
 New files to create:
 - `scripts/image-upgrade.sh` (~150 lines; sources `scripts/lib/mkdocs-snapshot.sh` for the fallback)
 - `changemaker-control-panel/agent/src/routes/upgrade.routes.ts` → new endpoint `POST /instance/:slug/upgrade/start-image-only`
 - `changemaker-control-panel/api/src/services/upgrade.service.ts` → `startImageUpgrade(instanceId, userId, { imageTag })`
 - `changemaker-control-panel/api/src/services/remote-driver.ts` → `startImageUpgrade()`
 - `changemaker-control-panel/api/src/modules/instances/instances.routes.ts` → `POST /:id/upgrade-images`
 - CCP admin UI: "Quick Upgrade (image-only)" button on `InstanceDetailPage.tsx`
 ### Approach C — ⏳ Pending (3-5 days)
 CCP-driven template re-render for orchestration-changing upgrades. Reuses existing `template-engine.ts` and `reconfigureInstance` pattern. Only writes templated files (compose, nginx, configs/pangolin); never touches `mkdocs/` or `configs/code-server/data/`. See plan for details.
 ---
 ## How to apply v2.10.2 fixes to remaining tenants
 **For PIA: already done** — used as the proof-of-concept on 2026-05-21. mkdocs.yml md5 unchanged, file count unchanged. ~5 minutes per tenant.
 **For the other 6 tenants**, use the surgical update — DO NOT run a raw `git pull origin main` (it would resurrect tenant-deleted files via merge logic):
 ### Source installs (bnkops, trbh, pridecorner, soroush)
 ```bash
 # bnkops, trbh, soroush use ~/changemaker.lite
 # pridecorner uses ~/cmlite/changemaker.lite
 cd ~/changemaker.lite  # or ~/cmlite/changemaker.lite
 git fetch origin main
 mkdir -p scripts/lib
 git checkout origin/main -- \
  scripts/upgrade.sh \
  scripts/upgrade-stash-cleanup.sh \
  scripts/lib/mkdocs-snapshot.sh \
  scripts/build-release.sh \
  docker-compose.yml \
  .gitignore
 # Sanity: tenant content should still be ahead/divergent (not touched)
 git status mkdocs/ configs/  # should show no NEW changes from this update
 ```
 ### Release installs (marcelle, linda) — used pia approach
 ```bash
 # marcelle: ~/changemaker.lite, ssh bunker-admin@100.90.78.47
 # linda: ~/changemaker.lite.canonical, ssh bunker-admin@n2-linda.taile33572.ts.net
 cd ~/changemaker.lite  # or ~/changemaker.lite.canonical
 curl -fSL https://gitea.bnkops.com/admin/changemaker.lite/releases/download/v2.10.2/changemaker-lite-v2.10.2.tar.gz \
  -o /tmp/v2.10.2.tar.gz
 mkdir -p scripts/lib
 tar -xzf /tmp/v2.10.2.tar.gz --strip-components=1 \
  changemaker-lite/scripts/upgrade.sh \
  changemaker-lite/scripts/upgrade-stash-cleanup.sh \
  changemaker-lite/scripts/lib/mkdocs-snapshot.sh \
  changemaker-lite/docker-compose.yml
 chmod +x scripts/upgrade.sh scripts/upgrade-stash-cleanup.sh scripts/lib/mkdocs-snapshot.sh
 rm -f /tmp/v2.10.2.tar.gz
 # Do NOT update VERSION — only scripts changed, rest of install stays at current version.
 ```
 ### Verification per tenant
 ```bash
 # Before update: capture
 md5sum mkdocs/mkdocs.yml
 find mkdocs/docs -type f | wc -l
 # Run the appropriate surgical update above
 # After update: re-verify (should match)
 md5sum mkdocs/mkdocs.yml  
 find mkdocs/docs -type f | wc -l
 # Confirm new upgrade.sh
 grep -c 'deferred ccp-agent\|Deferred ccp-agent' scripts/upgrade.sh  # expect 2
 # Optional: smoke-test the snapshot helper
 PROJECT_DIR=$(pwd) bash -c '. scripts/lib/mkdocs-snapshot.sh; snapshot_mkdocs'
 ls -lh mkdocs-backup-*.tar.gz
 ```
 ---
 ## Bug inventory — what we know
 ### Fixed in v2.10.2
 | Bug | Memory file | Status |
 |---|---|---|
 | Gitea release `created_unix=0` (lightweight tag + Gitea 1.23.x quirk) | `feedback_gitea_release_tag_timing.md` | Fixed in `build-release.sh` — uses `target_commitish` + removes remote tag first |
 | ccp-agent image missing bash/curl/jq/python3 + git safe.directory | `feedback_ccp_agent_image_deps.md` | Fixed in agent Dockerfile + rolled out to all 7 tenants |
 | ccp-agent compose mount was `:ro` (blocked status.json writes) | (in `feedback_ccp_agent_image_deps.md`) | Fixed in both compose files |
 | CCP upgrade Phase 5 collision: `COMPOSE_PROJECT_NAME` mismatch | `feedback_upgrade_compose_project_name.md` | Fixed via env-var addition in compose env block (e88ac79) — also needs `.env` entry on tenants installed before v2.10.2 |
 | upgrade.sh Phase 6 self-destruct | `feedback_upgrade_sh_bugs.md` | Fixed in v2.10.2 — deferred ccp-agent restart |
 ### Open
 - **upgrade.sh `git stash → git pull` stash-no-pop** — Pride Corner has 3 stashes from March 9 holding mkdocs.yml customizations. Existing `save_user_paths`/`restore_user_paths` in upgrade.sh handles the common case; the snapshot fallback (v2.10.2) covers edge cases. Pridecorner-specific recovery handled by another agent.
 - **Agent-side `detached: true` spawn** — Defense-in-depth. Skip unless Phase 6 self-destruct re-emerges.
 ---
 ## Tenant content protection layers (all in v2.10.2)
 1. **`save_user_paths`/`restore_user_paths`** in upgrade.sh — preserves working-tree state of `mkdocs/docs/`, `mkdocs/mkdocs.yml`, `mkdocs/site/`, `configs/`, `nginx/conf.d/services.conf` across `git pull`.
 2. **`git stash` + auto-resolve on USER_PATHS** — modified tracked files stash + pop with `git checkout --theirs` on USER_PATH conflicts.
 3. **Pre-upgrade mkdocs snapshot** — tarball of `mkdocs/` to install root before any other phase runs. Fallback for everything else.
 ---
 ## Tonight's recovery work — already applied
 These tenants had content damage from earlier in the session; recovery was completed:
 - **trbh** — mkdocs.yml + 143 M files restored from `stash@{0}`; 538 D-entry files re-deleted. Public sites serve correct branding.
 - **bnkops** — same pattern, 100 M files restored + 82 D-entry re-deletions. Public sites serve correct branding.
 - **marcelle** — manual recovery from Phase 6 self-destruct test (file rollback + service restart). On v2.10.1 currently. Operating normally.
 `stash@{0}` is preserved on trbh and bnkops as forensic record + safety net.
 ---
 ## CCP access
 ```
 URL:       http://n4-bnkops.taile33572.ts.net:5100  (UI)
           http://n4-bnkops.taile33572.ts.net:5000  (API)
 User:      admin@thebunkerops.ca
 Password:  NRTgHdC7Zxxs2P2UmNwnEbn3jTwU8uJN  (seed; rotate if you want)
 Role:      SUPER_ADMIN
 ```
 ---
 ## Test bench (marcelle)
 ```
 SSH:           ssh bunker-admin@100.90.78.47
 Install dir:   ~/changemaker.lite
 Domain:        cursedknowledge.org
 Admin:         admin@cursedknowledge.org / @TheBunker2025!
 CCP slug:      changemakerlite
 CCP id:        71b5bc4a-c47e-4435-b460-e9bc303b76ed
 ```
 Marcelle is the test bench per `docs/TEST_SERVER.md`. Use it for ALL upgrade experiments before touching production tenants.
 ---
 ## Per-tenant quick reference
 | Tenant | SSH | Install dir | CCP id |
 |---|---|---|---|
 | bnkops | bunker-admin@n4-bnkops.taile33572.ts.net | ~/changemaker.lite | 21238536-7c04-4a3b-a073-38390a939046 |
 | marcelle | bunker-admin@100.90.78.47 | ~/changemaker.lite | 71b5bc4a-c47e-4435-b460-e9bc303b76ed |
 | trbh | bunker-admin@n6-trbh.taile33572.ts.net | ~/changemaker.lite | c066dc23-64a5-4684-96a7-992e65c1b82c |
 | pia | pia-bnkops@n3-pia.taile33572.ts.net | ~/changemaker.lite | 92a11622-d357-4ab4-b21e-60c030c1b026 |
 | pridecorner | bunker-admin@n1-pridecorner.taile33572.ts.net | ~/cmlite/changemaker.lite | a30de94b-ef28-42b6-a71d-112669526a62 |
 | soroush | bunker-admin@n7-soroush.taile33572.ts.net | ~/changemaker.lite | 0c70f94c-1319-41e1-867c-5674f17cadda |
 | linda | bunker-admin@n2-linda.taile33572.ts.net | ~/changemaker.lite.canonical | 6dcc19a1-f4fd-45df-be77-5bf62f8110c8 |
 ---
 ## Most important "don't repeat my mistakes" notes
 1. **Never `git stash + git pull --ff-only origin main` on a tenant** outside of upgrade.sh. The stash silently displaces tenant content. If you must update files on a source-installed tenant, use targeted `git checkout origin/main -- <specific-file>` instead.
 2. **Never blindly trigger CCP "Upgrade Now"** on a tenant still running pre-v2.10.2 upgrade.sh — it will Phase 6 self-destruct. Apply surgical script update first (instructions above), THEN trigger CCP upgrade.
 3. **mkdocs/docs/ contains upstream tracked files** (default screenshots, demo docs, blog posts). Tenants typically delete these locally without committing. ANY operation that brings origin/main's tracked tree into the working tree (git pull, tarball extract) will resurrect them. v2.10.2's snapshot fallback gives you a recovery path; the surgical update procedure (this doc) avoids the issue entirely.
 4. **mkdocs/mkdocs.yml is tracked, tenant-customized** with branding. Lives under USER_PATHS so v2.10.2's upgrade.sh protects it. But if you do raw git operations outside the script, it's exposed.
 5. **CCP backend on n4 is decoupled from per-tenant ccp-agent**. Restarting a tenant's ccp-agent does NOT affect CCP itself. Verified during bnkops patch (CCP backend stayed at 41h uptime while ccp-agent recreated).
 ---
 ## Memory files (in `/home/bunker-admin/.claude/projects/-home-bunker-admin-changemaker-lite/memory/`)
 Latest session work documented in:
 - `feedback_gitea_release_tag_timing.md`
 - `feedback_ccp_agent_image_deps.md`
 - `feedback_upgrade_compose_project_name.md`
 - `feedback_upgrade_sh_bugs.md`
 - `feedback_session_2026_05_20_damage_report.md`
 Plus the architectural plan: `/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md`
 ---
 ## Where to start the next session
 Recommended sequence:
 1. **Apply surgical update to remaining 6 tenants** (~30-45 min, low risk; pia procedure already proven). Order: marcelle, linda (release), then soroush, trbh, bnkops, pridecorner (source).
 2. **Test CCP-driven upgrade on marcelle** after surgical update lands. This will verify the deferred ccp-agent restart works end-to-end through the CCP path (the test we couldn't complete tonight because Phase 6 kept self-destructing).
 3. **Implement Approach B** per the plan — image-only upgrade mode. Estimated 1-2 days.
 4. **Implement Approach C** — CCP template re-render. 3-5 days.
 If only one thing happens next session: **do step 1**. Six surgical updates × ~5 minutes each. The rest of the fleet stays vulnerable to Phase 6 self-destruct until they're on v2.10.2's upgrade.sh.
--- a/docs/SESSION_HANDOFF_2026-05-21.md
+++ b/docs/SESSION_HANDOFF_2026-05-21.md
@ -0,0 +1,169 @@
 # Session Handoff: Approach B Rollout + Approach C Planning (2026-05-21)
 Carries forward all context from a long working session. If you're a fresh agent: read this top-to-bottom before touching anything.
 ---
 ## What landed in this session (commits on origin/main)
 | Commit | Description |
 |---|---|
 | `4a3d9d7` | `feat(upgrade): Approach B - image-only upgrade mode` — 7 files, 666 insertions. scripts/image-upgrade.sh + CCP agent endpoint + CCP backend (driver/service/route/schema) + admin UI "Quick Upgrade" button. |
 | `<this commit>` | docs: session handoff + Approach C Phase 0 initial template overlay |
 Plus several non-tracked deploys:
 - v2.10.2 surgical update applied to remaining 6 tenants (soroush, linda, marcelle, bnkops, trbh, pridecorner — pia was done previously). All verified mkdocs untouched, upgrade.sh sha matches `b9f37d59...`.
 - Fleet rollout of Approach B: new `image-upgrade.sh` script delivered + new `ccp-agent` image (with `/upgrade/start-image-only` endpoint) deployed to all 7 tenants. Bnkops's ccp-agent was rebuilt from source (builds locally rather than pulled from registry).
 ---
 ## Fleet state at session end
 | Tenant | Surgical update v2.10.2 | image-upgrade.sh | New ccp-agent with image-only endpoint |
 |---|---|---|---|
 | pia | ✅ (prior session) | ✅ | ✅ |
 | soroush | ✅ | ✅ | ✅ |
 | linda | ✅ | ✅ | ✅ |
 | marcelle | ✅ + tested both A and B E2E | ✅ | ✅ |
 | bnkops | ✅ | ✅ | ✅ (rebuilt locally) |
 | trbh | ✅ | ✅ | ✅ |
 | pridecorner | ✅ | ✅ | ✅ |
 Marcelle E2E test results:
 - **Approach A (full upgrade)**: v2.10.1 → v2.10.2 in 250s, COMPLETED, no SIGKILL on script. Phase 6 deferred ccp-agent restart fix worked end-to-end through CCP path.
 - **Approach B (Quick Upgrade) run 1**: 121s, COMPLETED, mkdocs.yml md5 unchanged.
 - **Approach B (Quick Upgrade) run 2**: 100s (cached pull), COMPLETED, mkdocs unchanged again — confirms idempotency.
 ---
 ## Fleet backup (Phase 0 work — defensive)
 All 7 tenants backed up to `/media/bunker-admin/BACKUP/fleet/<node>/2026-05-21-pre-v2.10.2/`:
 | Node | Tenant | Size |
 |---|---|---|
 | n1 | pridecorner | 182MB (includes 3 stash patches from March 9) |
 | n2 | linda | 26MB |
 | n3 | pia | 45MB (post-surgical state) |
 | n4 | bnkops | 4.4GB (huge — 2277 mkdocs/docs files) |
 | n5 | marcelle | 28MB |
 | n6 | trbh | 336MB |
 | n7 | soroush | 76MB |
 Each tenant dir has `mkdocs.tar.gz`, `configs-and-nginx.tar.gz`, `config-files.tar.gz`, `host-state.txt`, `git-state.txt` (source installs only), and `MANIFEST.txt`.
 ---
 ## Approach C planning + initial overlay
 **Decision: rewrite `docker-compose.yml.hbs` in prod-compose style** to make CCP-driven template re-render safe for the install.sh fleet.
 ### Why a rewrite (not sync-by-addition)
 Discovered the CCP template and `docker-compose.prod.yml` use fundamentally different conventions:
 | | Old template (`.hbs`) | Canonical prod |
 |---|---|---|
 | Container names | `{{containerPrefix}}-postgres` (dynamic) | `changemaker-v2-postgres` (hardcoded) |
 | Secrets | `{{secrets.postgresPassword}}` (Handlebars-rendered) | `${POSTGRES_PASSWORD}` (env-substituted) |
 | Optional services | `{{#if enableX}}` blocks | Always-defined, gated via `COMPOSE_PROFILES` |
 | Ports | `{{ports.api}}` | Hardcoded |
 Sync-by-additions can't reconcile these. Rewrite is cleaner long-term.
 ### Initial overlay committed this session
 `changemaker-control-panel/templates/docker-compose.yml.hbs.OLD-style-pre-approach-c` — preserved old template for reference.
 `changemaker-control-panel/templates/docker-compose.yml.hbs` — now a near-mirror of `changemaker.lite/docker-compose.prod.yml` (1493 lines + Handlebars header):
 - Header comment includes `{{name}}`, `{{slug}}`, `{{composeProject}}` for traceability.
 - 5 image refs replaced `${IMAGE_TAG:-latest}` → `{{imageTag}}` so CCP can per-instance override via `Instance.imageTag` once Phase 1 lands.
 - All other variation flows through env-var substitution from tenant's `.env`.
 ### Remaining Approach C work (next session)
 See `/home/bunker-admin/.claude/plans/insight-temporal-bachman.md` for the full plan. Quick summary of what's next:
 **Phase 0 completion (next session):**
 - Audit `env.hbs` against the new compose's expected env vars. Add missing.
 - Sync static config files in `templates/`: nginx/, configs/prometheus/, configs/alertmanager/, configs/grafana/. They may have drifted too.
 - Write a one-off render harness (`api/scripts/render-for-instance.ts`) that loads an instance row, builds context, renders templates to scratch dir.
 - Render against marcelle, linda, pia. Diff against their actual files. Iterate the template until diff is per-instance values only (`COMPOSE_PROJECT_NAME`, ports, secrets — not structure).
 **Phase 1 (~30 min):** Add `Instance.imageTag` Prisma column + migration. Modify `template-engine.ts:211` to use `instance.imageTag || env.IMAGE_TAG`.
 **Phase 2 (~3-4 hr):** Pre-flight diff endpoint. New agent route `POST /instance/:slug/files/diff` + `RemoteDriver.diffFiles()` + `LocalDriver.diffFiles()` + `previewReleaseUpgrade()` in upgrade.service. Includes `envCoverage` check for registered tenants.
 **Phase 3 (~3-4 hr):** `startReleaseUpgrade()` + `runReleaseUpgrade()` in upgrade.service. Split logic for `isRegistered=true` (skip env render) vs `isRegistered=false` (render env).
 **Phase 4 (~30 min):** CCP routes `/upgrade-release` + `/upgrade-release/preview` + Zod schema.
 **Phase 5 (~2-3 hr):** "Upgrade to Release" UI button + preview modal + env-coverage warning.
 **Phase 6 (~1 hr):** Tag v2.10.3 in changemaker.lite, push images with tag, trigger upgrade-release on marcelle via CCP UI, verify mkdocs untouched + containers on new tag.
 **Total remaining: 11-14 hours.** Recommended split:
 - Session 2: complete Phase 0 (render harness + iterate template + env.hbs sync + static file syncs). ~half day.
 - Session 3: Phases 1-5. ~half day.
 - Session 4: Phase 6 E2E test. ~1 hour.
 ---
 ## Critical files for Approach C
 **Already modified this session:**
 - `changemaker-control-panel/templates/docker-compose.yml.hbs` — overlay from prod compose with minimal Handlebars markup.
 - `changemaker-control-panel/templates/docker-compose.yml.hbs.OLD-style-pre-approach-c` — preserved old template.
 **To be modified in next sessions (per plan):**
 - `changemaker-control-panel/templates/env.hbs` (Phase 0 audit)
 - `changemaker-control-panel/templates/configs/**` (Phase 0 syncs)
 - `changemaker-control-panel/api/prisma/schema.prisma` (Phase 1)
 - `changemaker-control-panel/api/prisma/migrations/<ts>_add_instance_image_tag/` (Phase 1)
 - `changemaker-control-panel/api/src/services/template-engine.ts` line 211 (Phase 1)
 - `changemaker-control-panel/api/src/services/upgrade.service.ts` (Phases 2-3)
 - `changemaker-control-panel/api/src/services/remote-driver.ts` + `local-driver.ts` + `execution-driver.ts` (Phase 2)
 - `changemaker-control-panel/agent/src/routes/files.routes.ts` + `services/file.service.ts` (Phase 2)
 - `changemaker-control-panel/api/src/modules/instances/instances.routes.ts` + `instances.schemas.ts` (Phase 4)
 - `changemaker-control-panel/admin/src/pages/InstanceDetailPage.tsx` (Phase 5)
 ---
 ## Memory key gotchas (write to MEMORY.md next session)
 1. **CCP template vs prod compose: were divergent, now aligned.** As of this session, `templates/docker-compose.yml.hbs` is structurally a near-mirror of `docker-compose.prod.yml`. Going forward, any new service in prod compose must be ported into the template manually (or via a future CI drift check).
 2. **bnkops's ccp-agent is locally built**, not pulled from registry. Has a `build:` directive in compose. The other 6 tenants pull `gitea.bnkops.com/admin/changemaker-ccp-agent:latest`.
 3. **install.sh tenants (`isRegistered=true`)** lack `encryptedSecrets` in CCP DB. Approach C must skip `env.hbs` rendering for them — they keep their tarball-provisioned `.env`. The pre-flight envCoverage check is the safety net.
 4. **n4 SSH lacks marcelle's host key by default** — first `ssh n4 → marcelle` connection needs `StrictHostKeyChecking=accept-new` or interactive accept. Other tenants in the lab have the same pattern.
 5. **`docker save | ssh ... docker load` is the registry-less image distribution path** when n4 doesn't have docker login to gitea.bnkops.com. Worked well for the ccp-agent rollout this session.
 6. **`set -o pipefail` + `grep -q` shorts the pipeline** because grep closes the pipe early on first match, sending SIGPIPE to the writer. Solution: capture upstream output into a variable, then grep against the variable. (Bug found + fixed in `scripts/image-upgrade.sh` during this session.)
 ---
 ## CCP access (unchanged)
 ```
 URL:       http://n4-bnkops.taile33572.ts.net:5100  (UI)
           http://n4-bnkops.taile33572.ts.net:5000  (API)
 User:      admin@thebunkerops.ca
 Password:  NRTgHdC7Zxxs2P2UmNwnEbn3jTwU8uJN  (seed)
 Role:      SUPER_ADMIN
 ```
 ---
 ## Where to start next session
 Recommended:
 1. **Read this doc + `/home/bunker-admin/.claude/plans/insight-temporal-bachman.md` (Approach C plan)** first.
 2. **Phase 0 completion:** finish the template rewrite. Build a render harness (`api/scripts/render-for-instance.ts`), render against marcelle/linda/pia, iterate until structural-clean.
 3. Commit Phase 0 as standalone PR with rendered-vs-actual diffs in description.
 4. Move to Phases 1-5 in a second commit/PR.
 5. Phase 6 manual E2E.
 Approach B is in production-ready state across the fleet. Approach C is the longer-term path for releases that change orchestration.
--- a/scripts/build-release.sh
+++ b/scripts/build-release.sh
@ -126,7 +126,7 @@ RUNTIME_SCRIPTS=(
  install.sh
  nocodb-init.sh gitea-init.sh mkdocs-entrypoint.sh
  backup.sh restore.sh
-  upgrade.sh upgrade-check.sh upgrade-watcher.sh
+  upgrade.sh upgrade-check.sh upgrade-watcher.sh upgrade-stash-cleanup.sh
  uninstall.sh test-deployment.sh
  validate-env.sh pangolin-teardown.sh ccp-deregister.sh register-with-ccp.sh
  update-env.sh
@ -178,6 +178,13 @@ if [[ -f "$PROJECT_DIR/scripts/mkdocs-build-trigger.py" ]]; then
  cp "$PROJECT_DIR/scripts/mkdocs-build-trigger.py" "$STAGE_DIR/scripts/"
 fi
 # Shared shell libraries (scripts/lib/) — sourced by upgrade.sh + image-upgrade.sh.
 # Whole directory ships verbatim; safe because nothing executable lives here
 # besides the .sh helpers that the runtime scripts depend on.
 if [[ -d "$PROJECT_DIR/scripts/lib" ]]; then
  cp -a "$PROJECT_DIR/scripts/lib" "$STAGE_DIR/scripts/"
 fi
 # Systemd units
 if [[ -d "$PROJECT_DIR/scripts/systemd" ]]; then
  cp -r "$PROJECT_DIR/scripts/systemd" "$STAGE_DIR/scripts/"
@ -295,12 +302,23 @@ if [[ "$UPLOAD" == "true" ]]; then
      fi
    fi
    # Gitea 1.23.x only initializes Release.CreatedUnix inside its createTag()
    # path. If the git tag already exists on origin when we POST /releases,
    # createTag() is skipped and CreatedUnix stays 0, which makes /releases/latest
    # silently return an older release. Remove the remote tag first so Gitea
    # creates it via target_commitish below. The tag is preserved locally and
    # gets recreated at the same SHA — no history is lost.
    if git ls-remote --exit-code origin "refs/tags/${TAG}" >/dev/null 2>&1; then
      warn "Removing remote tag ${TAG} so Gitea can recreate it (CreatedUnix init)"
      git push origin ":refs/tags/${TAG}" >/dev/null 2>&1 || true
    fi
    info "Creating Gitea release ${TAG}..."
    RELEASE_RESPONSE=$(curl -sf -X POST \
      "${GITEA_HOST}/api/v1/repos/admin/changemaker.lite/releases" \
      -H "Authorization: token ${GITEA_TOKEN}" \
      -H "Content-Type: application/json" \
-      -d "{\"tag_name\":\"${TAG}\",\"name\":\"Changemaker Lite ${TAG}\",\"body\":\"Release ${TAG} (${COMMIT_SHA})\"}" \
+      -d "{\"tag_name\":\"${TAG}\",\"target_commitish\":\"${COMMIT_SHA}\",\"name\":\"Changemaker Lite ${TAG}\",\"body\":\"Release ${TAG} (${COMMIT_SHA})\"}" \
      2>/dev/null || true)
    RELEASE_ID=$(echo "$RELEASE_RESPONSE" | python3 -c "import sys,json; print(json.load(sys.stdin).get('id',''))" 2>/dev/null || true)
--- a/scripts/image-upgrade.sh
+++ b/scripts/image-upgrade.sh
@ -0,0 +1,383 @@
 #!/usr/bin/env bash
 # image-upgrade.sh — Approach B: image-only upgrade
 #
 # Pulls latest images from the registry and recreates services WITHOUT touching
 # tracked files in the install tree (no git pull, no tarball extract, no VERSION
 # mutation). Tenant content (mkdocs/, configs/) is implicitly safe because this
 # script never writes outside data/upgrade/ and the docker daemon.
 #
 # Used by CCP "Quick Upgrade" button. Pairs with scripts/upgrade.sh which
 # remains the full upgrade path for orchestration-changing releases.
 #
 # Schema parity: writes data/upgrade/progress.json + result.json with the same
 # fields upgrade.sh writes, so the CCP poll loop is unchanged.
 set -euo pipefail
 PROJECT_DIR="$(cd "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")/.." && pwd)"
 SCRIPT_DIR="$PROJECT_DIR/scripts"
 UPGRADE_DIR="$PROJECT_DIR/data/upgrade"
 LOG_DIR="$PROJECT_DIR/logs"
 LOG_FILE="$LOG_DIR/image-upgrade-$(date +%Y%m%d_%H%M%S).log"
 LOCK_FILE="$PROJECT_DIR/.upgrade.lock"
 PROGRESS_FILE="$UPGRADE_DIR/progress.json"
 RESULT_FILE="$UPGRADE_DIR/result.json"
 START_TIME=$SECONDS
 # --- Detect install mode ---
 if [[ -f "$PROJECT_DIR/VERSION" ]] && [[ ! -d "$PROJECT_DIR/.git" ]]; then
  INSTALL_MODE="release"
 else
  INSTALL_MODE="source"
 fi
 # --- Defaults ---
 API_MODE=false
 DRY_RUN=false
 IMAGE_TAG=""
 usage() {
  cat <<EOF
 Usage: $(basename "$0") [options]
 Image-only upgrade: pulls latest images from the configured registry and
 recreates services without touching the install tree.
 Options:
  --api-mode           Emit data/upgrade/{progress,result}.json (no TTY output)
  --dry-run            Print what would happen; do not pull or recreate
  --image-tag TAG      Override IMAGE_TAG (env var) for this run
  -h, --help           Show this help
 This script never modifies mkdocs/, configs/, scripts/, docker-compose.yml,
 or VERSION. It is the safest upgrade path for orchestration-stable releases.
 EOF
 }
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --api-mode)    API_MODE=true; shift ;;
    --dry-run)     DRY_RUN=true; shift ;;
    --image-tag)   IMAGE_TAG="${2:?--image-tag requires a value}"; shift 2 ;;
    -h|--help)     usage; exit 0 ;;
    *) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
  esac
 done
 # --- Colors ---
 if [[ -t 1 ]] && [[ -z "${NO_COLOR:-}" ]]; then
  RED='\033[0;31m'  GREEN='\033[0;32m'  YELLOW='\033[0;33m'
  CYAN='\033[0;36m' BOLD='\033[1m'      NC='\033[0m'
 else
  RED='' GREEN='' YELLOW='' CYAN='' BOLD='' NC=''
 fi
 info()    { echo -e "${CYAN}[INFO]${NC} $*"; }
 success() { echo -e "${GREEN}[ OK ]${NC} $*"; }
 warn()    { echo -e "${YELLOW}[WARN]${NC} $*"; }
 error()   { echo -e "${RED}[ERR ]${NC} $*" >&2; }
 phase()   { echo ""; echo -e "${BOLD}${CYAN}=== Phase $1: $2 ===${NC}"; }
 # --- Logging: mirror stdout/stderr to LOG_FILE ---
 # logs/ may be root-owned on installs where upgrade.sh has run via ccp-agent.
 # Fall back to /tmp if we can't write, so bunker-admin manual invocations don't
 # crash with "Permission denied" on tee.
 mkdir -p "$UPGRADE_DIR"
 if mkdir -p "$LOG_DIR" 2>/dev/null && touch "$LOG_FILE" 2>/dev/null; then
  :  # primary log location is writable
 else
  LOG_FILE="/tmp/image-upgrade-$(date +%Y%m%d_%H%M%S)-$$.log"
  echo "[INFO] logs/ not writable; using $LOG_FILE" >&2
 fi
 exec > >(tee -a "$LOG_FILE") 2>&1
 # --- Capture previous version for result.json ---
 if [[ "$INSTALL_MODE" == "release" ]]; then
  PRE_VERSION="$(head -1 "$PROJECT_DIR/VERSION" 2>/dev/null || echo "unknown")"
 else
  PRE_VERSION="$(cd "$PROJECT_DIR" && git rev-parse --short HEAD 2>/dev/null || echo "unknown")"
 fi
 write_progress() {
  local phase_num="$1" phase_name="$2" pct="$3" msg="$4"
  [[ "$API_MODE" != "true" ]] && return
  mkdir -p "$UPGRADE_DIR"
  cat > "$PROGRESS_FILE" <<PEOF
 {
  "phase": ${phase_num},
  "phaseName": "${phase_name}",
  "percentage": ${pct},
  "message": "$(echo "$msg" | sed 's/"/\\"/g')",
  "lastUpdate": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
 }
 PEOF
 }
 write_result() {
  [[ "$API_MODE" != "true" ]] && return
  local success_val="$1" msg="$2"
  local warnings_json="${3:-[]}"
  local duration_secs=$((SECONDS - START_TIME))
  local new_version="$PRE_VERSION"
  if [[ "$INSTALL_MODE" == "release" ]]; then
    new_version="$(head -1 "$PROJECT_DIR/VERSION" 2>/dev/null || echo "$PRE_VERSION")"
  else
    new_version="$(cd "$PROJECT_DIR" && git rev-parse --short HEAD 2>/dev/null || echo "$PRE_VERSION")"
  fi
  mkdir -p "$UPGRADE_DIR"
  cat > "$RESULT_FILE" <<REOF
 {
  "success": ${success_val},
  "message": "$(echo "$msg" | sed 's/"/\\"/g')",
  "previousCommit": "${PRE_VERSION}",
  "newCommit": "${new_version}",
  "commitCount": 0,
  "durationSeconds": ${duration_secs},
  "warnings": ${warnings_json},
  "completedAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "mode": "image-only"
 }
 REOF
  rm -f "$PROGRESS_FILE"
 }
 # --- Lock + cleanup ---
 acquire_lock() {
  if [[ -f "$LOCK_FILE" ]]; then
    local pid; pid="$(cat "$LOCK_FILE" 2>/dev/null || echo "")"
    if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
      error "Upgrade already running (pid $pid). Refusing to start."
      write_result "false" "Another upgrade is already running (pid $pid)"
      exit 1
    fi
    warn "Stale lock file found; removing"
    rm -f "$LOCK_FILE"
  fi
  echo $$ > "$LOCK_FILE"
 }
 release_lock() { rm -f "$LOCK_FILE" || true; }
 on_failure() {
  local exit_code=$?
  local line_no=${1:-?}
  error "image-upgrade.sh failed at line $line_no (exit $exit_code)"
  write_result "false" "Image upgrade failed at line $line_no (exit $exit_code)"
  release_lock
  exit "$exit_code"
 }
 trap 'on_failure $LINENO' ERR
 trap 'release_lock' EXIT
 # --- Banner ---
 echo ""
 echo -e "${BOLD}${CYAN}================================================${NC}"
 echo -e "${BOLD}  Image-Only Upgrade${NC}"
 echo -e "${BOLD}${CYAN}================================================${NC}"
 echo "Install mode: $INSTALL_MODE"
 echo "Project dir:  $PROJECT_DIR"
 echo "Pre-version:  $PRE_VERSION"
 [[ -n "$IMAGE_TAG" ]] && echo "Image tag:    $IMAGE_TAG"
 [[ "$DRY_RUN" == "true" ]] && echo "DRY RUN: no images will be pulled or services recreated"
 echo ""
 acquire_lock
 # =============================================================================
 # Phase 1: Pre-flight + mkdocs snapshot (defensive)
 # =============================================================================
 phase "1" "Pre-flight"
 write_progress 1 "Pre-flight" 10 "Snapshotting mkdocs (defensive)..."
 # Source mkdocs-snapshot.sh and run it. This is the same snapshot every
 # upgrade path takes — leaves mkdocs-backup-<timestamp>.tar.gz in project root.
 # Image-only upgrades shouldn't damage mkdocs (no filesystem mutation), but
 # the snapshot is cheap insurance and keeps operator habits consistent.
 if [[ -r "$SCRIPT_DIR/lib/mkdocs-snapshot.sh" ]]; then
  if [[ "$DRY_RUN" == "true" ]]; then
    info "[DRY RUN] Would snapshot mkdocs/"
  else
    # shellcheck disable=SC1091
    PROJECT_DIR="$PROJECT_DIR" bash -c ". $SCRIPT_DIR/lib/mkdocs-snapshot.sh; snapshot_mkdocs" \
      || warn "mkdocs snapshot failed (non-fatal; continuing)"
  fi
 else
  warn "scripts/lib/mkdocs-snapshot.sh not found; skipping snapshot"
 fi
 # Sanity-check docker
 if ! docker compose version &>/dev/null; then
  error "docker compose is not available"
  write_result "false" "docker compose not available"
  exit 1
 fi
 success "Pre-flight checks passed"
 # =============================================================================
 # Phase 2: Pull images
 # =============================================================================
 phase "2" "Pull Images"
 write_progress 2 "Pull Images" 30 "Pulling images from registry..."
 PULL_ENV=()
 if [[ -n "$IMAGE_TAG" ]]; then
  PULL_ENV+=("IMAGE_TAG=$IMAGE_TAG")
 fi
 if [[ "$DRY_RUN" == "true" ]]; then
  info "[DRY RUN] Would run: ${PULL_ENV[*]:-} docker compose pull"
 else
  info "Pulling all images (this may take a few minutes)..."
  if (( ${#PULL_ENV[@]} > 0 )); then
    if ! env "${PULL_ENV[@]}" docker compose pull; then
      warn "docker compose pull had errors (continuing — some images may be local)"
    fi
  else
    if ! docker compose pull; then
      warn "docker compose pull had errors (continuing — some images may be local)"
    fi
  fi
 fi
 success "Image pull complete"
 # =============================================================================
 # Phase 3: Recreate core app services (targeted, not broad)
 # =============================================================================
 phase "3" "Recreate Services"
 write_progress 3 "Recreate Services" 60 "Recreating core app services with new images..."
 # Targeted recreate: only the services whose IMAGES are released as part of
 # changemaker.lite (api, admin, media-api, nginx). Broader `up -d` is risky
 # because a single misconfigured mount in any service (e.g. mkdocs-site-server)
 # can cascade and leave dependent containers in "Created" state. Image-only
 # upgrade should only touch the actual code containers, not third-party
 # infrastructure that happens to live in the same compose file.
 #
 # Same Phase 6 pattern as upgrade.sh: drop ccp-agent from COMPOSE_PROFILES
 # during recreate so we don't suicide-restart the agent that spawned us.
 # Restart ccp-agent at the end via detached subshell.
 PROFILES_SAVED="${COMPOSE_PROFILES:-}"
 COMPOSE_PROFILES_WITHOUT_AGENT="$(echo "${PROFILES_SAVED}" \
  | tr ',' '\n' | grep -vx 'ccp-agent' | paste -sd, -)"
 UP_ENV=("COMPOSE_PROFILES=${COMPOSE_PROFILES_WITHOUT_AGENT}")
 if [[ -n "$IMAGE_TAG" ]]; then
  UP_ENV+=("IMAGE_TAG=$IMAGE_TAG")
 fi
 # Core services that ship as v2 release images. nginx last so it doesn't
 # briefly proxy to an old api. media-api may not be enabled on all installs;
 # tolerate it being missing from compose.
 CORE_SERVICES=(api admin media-api nginx)
 EXISTING_SERVICES=()
 # Capture the service list once. Don't pipe `docker compose config` into
 # `grep -q` directly: with `set -o pipefail`, grep exits early on match and
 # SIGPIPEs the docker writer, making the pipeline exit non-zero. The grep -q
 # would then "match" all services as missing. Capture-then-check avoids it.
 COMPOSE_SERVICES_LIST="$(docker compose config --services 2>/dev/null || true)"
 for svc in "${CORE_SERVICES[@]}"; do
  if grep -qx -- "$svc" <<<"$COMPOSE_SERVICES_LIST"; then
    EXISTING_SERVICES+=("$svc")
  else
    info "Skipping service '$svc' (not in compose file)"
  fi
 done
 if (( ${#EXISTING_SERVICES[@]} == 0 )); then
  warn "No core app services found in compose; skipping recreate"
 elif [[ "$DRY_RUN" == "true" ]]; then
  info "[DRY RUN] Would run: ${UP_ENV[*]} docker compose up -d ${EXISTING_SERVICES[*]}"
 else
  info "Recreating core services: ${EXISTING_SERVICES[*]}"
  env "${UP_ENV[@]}" docker compose up -d "${EXISTING_SERVICES[@]}"
 fi
 success "Services recreated"
 # Restart Pangolin tunnel connector if running (image may have changed)
 if docker ps --format '{{.Names}}' | grep -q 'newt'; then
  if [[ "$DRY_RUN" == "true" ]]; then
    info "[DRY RUN] Would restart newt"
  else
    info "Restarting Pangolin tunnel connector..."
    docker compose restart newt 2>/dev/null || true
    success "Newt tunnel restarted"
  fi
 fi
 # =============================================================================
 # Phase 4: Verify (light health checks)
 # =============================================================================
 phase "4" "Verification"
 write_progress 4 "Verification" 85 "Running health checks..."
 VERIFY_FAILED=false
 UPGRADE_WARNINGS="[]"
 verify_health() {
  local name="$1" check_cmd="$2" max_wait="${3:-45}"
  local waited=0
  while [[ $waited -lt $max_wait ]]; do
    if eval "$check_cmd" 2>/dev/null; then
      success "$name: healthy (${waited}s)"
      return 0
    fi
    sleep 3
    waited=$((waited + 3))
  done
  warn "$name: not responding after ${max_wait}s"
  VERIFY_FAILED=true
  return 0
 }
 if [[ "$DRY_RUN" != "true" ]]; then
  verify_health "API (port 4000)" \
    "docker compose exec -T api wget -q --spider http://localhost:4000/api/health" 60
  verify_health "Admin (port 3000)" \
    "docker compose exec -T admin wget -q --spider http://localhost:3000/" 90
  if docker ps --format '{{.Names}}' | grep -q 'changemaker-media-api'; then
    verify_health "Media API (port 4100)" \
      "docker compose exec -T media-api wget -q --spider http://127.0.0.1:4100/health" 30
  fi
  if "$VERIFY_FAILED"; then
    UPGRADE_WARNINGS='["Some health checks failed after image-only upgrade — services may still be starting"]'
  fi
 fi
 # =============================================================================
 # Summary + deferred ccp-agent restart
 # =============================================================================
 ELAPSED_MIN=$(( (SECONDS - START_TIME) / 60 ))
 ELAPSED_SEC=$(( (SECONDS - START_TIME) % 60 ))
 echo ""
 echo -e "${BOLD}${GREEN}================================================${NC}"
 echo -e "${BOLD}  Image-Only Upgrade Complete${NC}"
 echo -e "${BOLD}${GREEN}================================================${NC}"
 printf "  Previous:  %s\n" "$PRE_VERSION"
 printf "  Duration:  %dm %ds\n" "$ELAPSED_MIN" "$ELAPSED_SEC"
 printf "  Log:       %s\n" "$LOG_FILE"
 write_progress 4 "Complete" 100 "Image-only upgrade complete"
 write_result "true" "Image-only upgrade complete (previous: ${PRE_VERSION})" "$UPGRADE_WARNINGS"
 # Deferred ccp-agent restart — see upgrade.sh for full rationale. Same
 # mechanism: nohup'd, disowned subshell that picks up the new image after
 # this script has cleanly exited.
 if echo "${PROFILES_SAVED:-}" | tr ',' '\n' | grep -qx 'ccp-agent'; then
  if [[ "$DRY_RUN" == "true" ]]; then
    info "[DRY RUN] Would schedule deferred ccp-agent restart"
  else
    info "Scheduling deferred ccp-agent restart..."
    nohup bash -c "
      sleep 3
      cd '$PROJECT_DIR'
      COMPOSE_PROFILES='ccp-agent' docker compose --profile ccp-agent up -d ccp-agent
    " >/dev/null 2>&1 < /dev/null &
    disown
    success "ccp-agent restart scheduled (will pick up new image)"
  fi
 fi
 release_lock
 trap - EXIT
 exit 0
--- a/scripts/lib/mkdocs-snapshot.sh
+++ b/scripts/lib/mkdocs-snapshot.sh
@ -0,0 +1,81 @@
 #!/usr/bin/env bash
 # =============================================================================
 # mkdocs-snapshot.sh — shared library function
 # =============================================================================
 # Defines snapshot_mkdocs(): writes a tarball of mkdocs/ into the install root
 # as mkdocs-backup-<timestamp>.tar.gz, keeping the last 5 snapshots.
 #
 # Sourced by scripts/upgrade.sh and scripts/image-upgrade.sh (and may be
 # invoked agent-side by changemaker-control-panel during template re-render).
 #
 # Why the install root instead of backups/?
 #   - Discoverable: operators see mkdocs-backup-*.tar.gz with a plain `ls`.
 #   - The agent's /app/instance bind mount maps directly to the install root,
 #     so the agent can restore from this archive without path translation.
 #   - backups/ is owned by root in some installs (DB dumps via container)
 #     and gets rotated on a different schedule than docs snapshots.
 #
 # Restoration one-liner:
 #   tar xzf "$(ls -t mkdocs-backup-*.tar.gz | head -1)" -C . \
 #     && docker compose restart mkdocs mkdocs-site-server
 #
 # Requires: $PROJECT_DIR (absolute path to install root), info() function
 # from the caller (falls back to plain echo if info is not defined).
 # =============================================================================
 # Fallback log function if caller didn't define one (e.g. when sourcing standalone)
 if ! declare -F info >/dev/null 2>&1; then
  info() { echo "[INFO] $*"; }
 fi
 if ! declare -F warn >/dev/null 2>&1; then
  warn() { echo "[WARN] $*" >&2; }
 fi
 # snapshot_mkdocs — take a tarball of mkdocs/ into the install root.
 #
 # Returns 0 if successful (or if mkdocs/ doesn't exist — non-fatal).
 # Returns non-zero only if tar itself fails AND $SNAPSHOT_REQUIRED is true.
 #
 # Optional env vars:
 #   PROJECT_DIR      (required) Install root containing mkdocs/
 #   SNAPSHOT_KEEP    Number of snapshots to retain (default 5)
 #   SNAPSHOT_REQUIRED  If "true", failure to snapshot aborts (default false)
 snapshot_mkdocs() {
  if [[ -z "${PROJECT_DIR:-}" ]]; then
    warn "snapshot_mkdocs: PROJECT_DIR not set; skipping"
    return 0
  fi
  if [[ ! -d "${PROJECT_DIR}/mkdocs" ]]; then
    # No mkdocs dir = nothing to snapshot. Common on minimal installs.
    return 0
  fi
  local stamp
  stamp="$(date +%Y%m%d_%H%M%S)"
  local archive="${PROJECT_DIR}/mkdocs-backup-${stamp}.tar.gz"
  local keep="${SNAPSHOT_KEEP:-5}"
  if tar czf "$archive" -C "$PROJECT_DIR" mkdocs 2>/dev/null; then
    local size
    size="$(du -h "$archive" 2>/dev/null | cut -f1)"
    info "Tenant docs snapshot: $(basename "$archive") (${size})"
  else
    warn "snapshot_mkdocs: tar failed for $archive"
    rm -f "$archive" 2>/dev/null
    if [[ "${SNAPSHOT_REQUIRED:-false}" == "true" ]]; then
      return 1
    fi
    return 0
  fi
  # Retention: keep the most recent N snapshots, prune older ones.
  # ls -t lists newest first; tail -n +N+1 selects items after the Nth.
  local prune_from=$((keep + 1))
  # shellcheck disable=SC2012  # ls is intentional for mtime sort
  ls -t "${PROJECT_DIR}"/mkdocs-backup-*.tar.gz 2>/dev/null \
    | tail -n +${prune_from} \
    | xargs -r rm -f
  return 0
 }
--- a/scripts/upgrade-stash-cleanup.sh
+++ b/scripts/upgrade-stash-cleanup.sh
@ -0,0 +1,135 @@
 #!/usr/bin/env bash
 # =============================================================================
 # upgrade-stash-cleanup.sh — clean up stale upgrade-* git stashes
 # =============================================================================
 # Older versions of upgrade.sh used `git stash push --include-untracked` to
 # protect tenant content during pulls. When pop conflicts went unresolved,
 # the stashes accumulated in `git stash list` forever — Pride Corner ended up
 # with three from 2026-03-09 alone, each containing displaced tenant
 # customizations that the running site no longer reflected.
 #
 # This script lists every `upgrade-*` stash, shows its scope, and offers to
 # drop them. It does NOT auto-restore content; that's a separate decision per
 # tenant. The intent is to clear the backlog so future `git stash list` is
 # meaningful.
 #
 # Usage:
 #   bash scripts/upgrade-stash-cleanup.sh          # interactive, lists + prompts
 #   bash scripts/upgrade-stash-cleanup.sh --dry    # list only
 #   bash scripts/upgrade-stash-cleanup.sh --yes    # drop all upgrade-* without prompt
 # =============================================================================
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
 cd "$PROJECT_DIR"
 # Colors
 if [[ -t 1 ]] && [[ -z "${NO_COLOR:-}" ]]; then
  RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[0;33m' CYAN='\033[0;36m'
  BOLD='\033[1m' NC='\033[0m'
 else
  RED='' GREEN='' YELLOW='' CYAN='' BOLD='' NC=''
 fi
 info() { echo -e "${CYAN}[INFO]${NC} $*"; }
 ok()   { echo -e "${GREEN}[ OK ]${NC} $*"; }
 warn() { echo -e "${YELLOW}[WARN]${NC} $*"; }
 DRY=false
 YES=false
 for arg in "$@"; do
  case "$arg" in
    --dry|--dry-run) DRY=true ;;
    --yes|-y)        YES=true ;;
    --help|-h)
      sed -n '2,/^# =====/p' "$0" | sed -n '2,/^# =====/p' | sed 's/^# //;s/^#//'
      exit 0
      ;;
  esac
 done
 if [[ ! -d .git ]]; then
  warn "Not a git repository — this script only applies to source installs."
  exit 0
 fi
 # Collect upgrade-* stash refs
 mapfile -t STASHES < <(git stash list 2>/dev/null | grep -E ': (On|WIP on) [^:]+: upgrade-' || true)
 if [[ ${#STASHES[@]} -eq 0 ]]; then
  ok "No upgrade-* stashes found. Nothing to clean up."
  exit 0
 fi
 echo ""
 echo -e "${BOLD}Found ${#STASHES[@]} upgrade-* stash(es):${NC}"
 echo ""
 for entry in "${STASHES[@]}"; do
  REF="${entry%%:*}"
  LABEL="${entry#*: }"
  FILE_COUNT=$(git stash show "$REF" --name-only 2>/dev/null | wc -l)
  HAS_MKDOCS_YML=$(git stash show "$REF" --name-only 2>/dev/null | grep -c '^mkdocs/mkdocs\.yml$' || true)
  printf "  %-12s  %-50s  files=%-4d  mkdocs.yml=%s\n" \
    "$REF" "$LABEL" "$FILE_COUNT" "$HAS_MKDOCS_YML"
 done
 echo ""
 if [[ "$DRY" == "true" ]]; then
  info "Dry-run: no stashes will be dropped."
  exit 0
 fi
 # Warn loudly if any stash holds mkdocs.yml — operator should manually review
 # before dropping (tenant content might be there).
 MKDOCS_STASHES=$(printf '%s\n' "${STASHES[@]}" \
  | while read -r entry; do
      REF="${entry%%:*}"
      if git stash show "$REF" --name-only 2>/dev/null | grep -q '^mkdocs/mkdocs\.yml$'; then
        echo "$REF"
      fi
    done)
 if [[ -n "$MKDOCS_STASHES" ]]; then
  echo ""
  echo -e "${RED}${BOLD}⚠ WARNING:${NC} the following stashes contain ${BOLD}mkdocs/mkdocs.yml${NC}:"
  echo "$MKDOCS_STASHES" | sed 's/^/    /'
  echo ""
  echo "   These may hold tenant branding (site_name, site_url, custom theme, etc.)"
  echo "   that ISN'T reflected on disk. Before dropping, verify:"
  echo ""
  echo "     git show <stash-ref>:mkdocs/mkdocs.yml | head -10"
  echo "     diff <(git show <stash-ref>:mkdocs/mkdocs.yml) mkdocs/mkdocs.yml"
  echo ""
  echo "   If disk mkdocs.yml already has the tenant content, the stash is safe to drop."
  echo "   If disk is upstream and stash has tenant content, restore first:"
  echo "     git checkout <stash-ref> -- mkdocs/mkdocs.yml"
  echo ""
 fi
 if [[ "$YES" != "true" ]]; then
  echo -en "${BOLD}Drop all ${#STASHES[@]} upgrade-* stashes? [y/N] ${NC}"
  read -r CONFIRM
  case "$CONFIRM" in
    y|Y|yes|YES) ;;
    *) info "Cancelled. No stashes dropped."; exit 0 ;;
  esac
 fi
 # Drop in reverse order so indices stay stable
 mapfile -t SORTED_REFS < <(printf '%s\n' "${STASHES[@]}" \
  | sed 's/:.*//' \
  | sort -t'{' -k2 -n -r)
 for REF in "${SORTED_REFS[@]}"; do
  if git stash drop "$REF" >/dev/null 2>&1; then
    ok "Dropped $REF"
  else
    warn "Failed to drop $REF (already gone?)"
  fi
 done
 echo ""
 ok "Cleanup complete. Remaining stashes:"
 git stash list 2>/dev/null || echo "  (none)"
--- a/scripts/upgrade.sh
+++ b/scripts/upgrade.sh
@ -95,6 +95,14 @@ phase() {
  echo ""
 }
 # Pre-upgrade tenant docs snapshot (no-regrets fallback). Sourced regardless
 # of install mode so snapshot_mkdocs is available in Phase 2.
 # shellcheck source=lib/mkdocs-snapshot.sh
 if [[ -f "$SCRIPT_DIR/lib/mkdocs-snapshot.sh" ]]; then
  # shellcheck disable=SC1091
  . "$SCRIPT_DIR/lib/mkdocs-snapshot.sh"
 fi
 # --- API mode: JSON progress/result writing ---
 UPGRADE_DIR="${PROJECT_DIR}/data/upgrade"
 PROGRESS_FILE="${UPGRADE_DIR}/progress.json"
@ -188,11 +196,22 @@ restore_user_paths() {
 #   "Non empty db! Please move your current db elsewhere than retry."
 # This regenerates config.json from .env vars when missing.
 verify_gancio_config() {
-  local gancio_volume
+  # Note: as of the gancio-config-init sidecar in docker-compose{,prod}.yml,
-  gancio_volume="$(docker volume ls --format '{{.Name}}' | grep 'gancio-data' | head -1 || true)"
+  # config.json is regenerated automatically on every `up`. This function is
-  if [[ -z "$gancio_volume" ]]; then
+  # kept as belt-and-braces for the upgrade flow specifically (e.g. so the
  # check happens before the compose-up rather than at compose-up time, and
  # so operators see explicit log output during upgrade).
  local matches
  matches="$(docker volume ls --format '{{.Name}}' | grep 'gancio-data' || true)"
  local count
  count=$(printf '%s\n' "$matches" | grep -c '.' || true)
  if [[ "$count" -eq 0 ]]; then
    return  # No gancio volume exists yet; first run will handle it
  fi
  if [[ "$count" -gt 1 ]]; then
    error "Multiple gancio-data volumes found — refusing to guess. Resolve manually:\n$matches"
  fi
  local gancio_volume="$matches"
  # Check if config.json exists and is non-empty
  if docker run --rm -v "${gancio_volume}:/data" alpine test -s /data/config.json 2>/dev/null; then
@ -698,6 +717,18 @@ fi
 phase "2" "Backup"
 write_progress 2 "Backup" 15 "Creating backup..."
 # Pre-upgrade tenant docs snapshot — the no-regrets fallback. Runs even when
 # --skip-backup is set, because this is for tenant content recovery (not DB
 # state) and is fast enough that skipping it would never be intentional. It
 # lives in the install root (not backups/) so operators discover it via `ls`.
 if declare -F snapshot_mkdocs >/dev/null 2>&1; then
  if [[ "$DRY_RUN" == "true" ]]; then
    info "[DRY RUN] Would snapshot mkdocs/ to ${PROJECT_DIR}/mkdocs-backup-*.tar.gz"
  else
    snapshot_mkdocs || warn "mkdocs snapshot failed (non-fatal; continuing)"
  fi
 fi
 if [[ "$SKIP_BACKUP" == "true" ]]; then
  warn "Backup skipped (--skip-backup --force)"
 else
@ -1273,13 +1304,24 @@ while true; do
 done
 success "API healthy (${API_WAIT}s)"
-# Start everything else (exclude one-shot init containers)
+# Start everything else (exclude one-shot init containers AND the ccp-agent
 # service that's running this very script). Recreating ccp-agent here would
 # SIGKILL the script process before write_result has a chance to run; we
 # instead schedule a detached restart at the very end of the script.
 #
 # Mechanism: temporarily drop "ccp-agent" from COMPOSE_PROFILES so the broad
 # `up -d` doesn't include it. We re-add it only when scheduling the deferred
 # restart so the new agent comes up under its profile.
 info "Starting remaining services..."
 PROFILES_SAVED="${COMPOSE_PROFILES:-}"
 COMPOSE_PROFILES_WITHOUT_AGENT="$(echo "${PROFILES_SAVED}" \
  | tr ',' '\n' | grep -vx 'ccp-agent' | paste -sd, -)"
 COMPOSE_PROFILES="${COMPOSE_PROFILES_WITHOUT_AGENT}" \
 docker compose up -d \
  --scale listmonk-init=0 \
  --scale gancio-init=0 \
  --scale vaultwarden-init=0
-success "All services started"
+success "All services started (ccp-agent restart deferred to end-of-script)"
 # Restart Pangolin tunnel connector if running (may hold stale state after nginx rebuild)
 if docker ps --format '{{.Names}}' | grep -q 'newt'; then
@ -1450,6 +1492,27 @@ echo -e "  ${BOLD}Duration:${NC}  $ELAPSED"
 echo -e "  ${BOLD}Log:${NC}       $LOG_FILE"
 echo ""
 # Deferred ccp-agent restart — the LAST thing the script does before exit.
 # This must run AFTER write_result and archive_success_to_history so the new
 # agent comes up to a complete result.json (otherwise CCP polls forever).
 # We launch a detached subshell that:
 #   1. Sleeps briefly so this script has time to exit cleanly first.
 #   2. Restarts ccp-agent under its profile, picking up any new image.
 # `nohup` + `disown` ensures the subshell survives the agent container dying
 # (when ccp-agent is recreated, the parent agent process — which spawned this
 # upgrade.sh — gets SIGKILL'd; the disowned subshell is reparented to PID 1
 # on the host and continues).
 if echo "${PROFILES_SAVED:-}" | tr ',' '\n' | grep -qx 'ccp-agent'; then
  info "Scheduling deferred ccp-agent restart..."
  nohup bash -c "
    sleep 3
    cd '$PROJECT_DIR'
    COMPOSE_PROFILES='ccp-agent' docker compose --profile ccp-agent up -d ccp-agent
  " >/dev/null 2>&1 < /dev/null &
  disown
  success "ccp-agent restart scheduled (will pick up new image)"
 fi
 release_lock
 trap - EXIT
Author	SHA1	Message	Date
bunker-admin	f34382ebdd	chore(approach-c): Phase 0 initial template overlay + session handoff This session shipped: - Approach B end-to-end (commit 4a3d9d7): full rollout to all 7 tenants; marcelle E2E validated twice (121s + 100s). - v2.10.2 surgical update applied to 6 remaining tenants. This commit lands the kickoff for Approach C (template re-render path): scripts/templates changes: - docker-compose.yml.hbs.OLD-style-pre-approach-c: preserved old CCP template (Handlebars-heavy, dynamic container names, secrets rendered at template-time). - docker-compose.yml.hbs: REWRITTEN as a near-mirror of canonical docker-compose.prod.yml. Minimal Handlebars overlay: - Header comment lists {{name}}, {{slug}}, {{composeProject}}. - 5 image refs: ${IMAGE_TAG:-latest} -> {{imageTag}}, so CCP can per-instance override once Phase 1 lands the Instance.imageTag column. All other variation flows through env-var substitution from tenant's .env. Container names are now hardcoded (matching prod), feature flags are deferred to COMPOSE_PROFILES gating (matching prod). Why a rewrite: the old CCP template and prod compose used fundamentally different conventions (dynamic vs hardcoded names, render-time vs substitute-time secrets, Handlebars vs profiles gating). Sync-by-addition couldn't reconcile them. The rewrite makes Approach C re-render safe for the install.sh-installed fleet (marcelle, linda, pia and future). docs/SESSION_HANDOFF_2026-05-21.md: full session handoff covering fleet state, Approach B rollout, Approach C plan, and where to start next session. force-added because /docs is gitignored (same precedent as docs/SESSION_HANDOFF_2026-05-20.md from prior session). Phase 0 remaining work (next session): - Audit env.hbs against new compose env-var expectations - Sync static config files (nginx/, configs/prometheus/, etc.) - Build api/scripts/render-for-instance.ts harness - Iterate template until rendered output is per-instance-only diff against marcelle/linda/pia actual compose. Then Phases 1-6 per plan in subsequent sessions (~11-14 hours total). Bunker Admin	2026-05-21 19:32:21 -06:00
bunker-admin	4a3d9d7c41	feat(upgrade): Approach B - image-only upgrade mode Add a "Quick Upgrade" path that pulls latest container images and recreates only the core app services (api, admin, media-api, nginx) without touching any tracked files. Tenant content (mkdocs/, configs/, scripts/) is implicitly preserved because the script never writes outside docker. Faster (~2 min vs ~4-5 min for full upgrade) and structurally safer for releases that don't change orchestration/templates. Pieces: - scripts/image-upgrade.sh: new ~350-line script. Phases: pre-flight + mkdocs snapshot, image pull, targeted recreate (broad up -d would cascade on misconfigured infra containers — proven on marcelle), light health checks, deferred ccp-agent restart. Writes the same progress.json + result.json schema as upgrade.sh so the CCP poll loop is unchanged. - agent/src/routes/upgrade.routes.ts: POST /instance/:slug/upgrade/start-image-only. Same lock + staleness guards as the existing /upgrade/start endpoint. - api/src/services/remote-driver.ts: RemoteDriver.startImageUpgrade(). - api/src/services/upgrade.service.ts: startImageUpgrade() entry point; reuses runRemoteUpgrade with mode='image-only' (only the initial agent call differs — result schema and polling are identical). - api/src/modules/instances/instances.routes.ts: POST /:id/upgrade-images + startImageUpgradeSchema. - admin/src/pages/InstanceDetailPage.tsx: secondary "Quick Upgrade" button next to "Upgrade Now" on the Updates tab. Tooltip explains when to use it. Tested locally on marcelle (v2.10.2 idempotent run): 1m 49s, mkdocs.yml md5 unchanged, file count unchanged, only api/admin/media-api/nginx touched. Subtle bug found and fixed: `set -o pipefail` + `grep -q` shorts pipe and SIGPIPEs the writer — captured services list once instead. Bunker Admin	2026-05-21 15:20:35 -06:00
bunker-admin	731e70ee42	docs: session handoff for the upgrade-flow redesign work Captures the full state of the 2026-05-20/21 working session for the next agent or future-self: fleet status, what landed in v2.10.2, remaining Phase B + C work from the plan, surgical-update procedures for the 6 remaining tenants (proven on pia 2026-05-21), bug inventory, and "don't repeat my mistakes" notes. Plan reference: /home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md Force-added because docs/ is gitignored but the handoff needs to be discoverable in-repo (same pattern as COMPETITIVE_ANALYSIS.md). Bunker Admin	2026-05-21 13:42:08 -06:00
bunker-admin	a7d3dd772b	chore(release): ship scripts/lib/ + classify upgrade-stash-cleanup.sh Two release-build fixes paired with the Approach A changes: 1. Add upgrade-stash-cleanup.sh to RUNTIME_SCRIPTS so it ships in the release tarball. Tenants need it to be able to recover from stale upgrade-* git stashes on their own hosts. 2. Copy scripts/lib/ wholesale into the staged release tree. Without this, upgrade.sh's `. scripts/lib/mkdocs-snapshot.sh` source line silently fails on release installs (the file isn't there), and the pre-upgrade tenant-docs snapshot wouldn't fire — defeating the no-regrets fallback. Bunker Admin	2026-05-21 10:36:28 -06:00
bunker-admin	9613c3ec81	fix(upgrade): Phase 1 of upgrade-flow redesign (Approach A) Three coordinated fixes from the upgrade-flow redesign plan (/home/bunker-admin/.claude/plans/okay-so-we-can-enumerated-hejlsberg.md): 1. scripts/lib/mkdocs-snapshot.sh (NEW): pre-upgrade tarball snapshot of the entire mkdocs/ directory into the install root as mkdocs-backup-<timestamp>.tar.gz. Discoverable via `ls`, retained last 5. No-regrets fallback if anything in the upgrade goes sideways. Sourced by upgrade.sh (and later by image-upgrade.sh under Approach B). 2. scripts/upgrade.sh Phase 6 self-destruct fix: previously, the broad `docker compose up -d` recreated the ccp-agent container that was running the script, sending SIGKILL to the bash process before write_result could land result.json. Marcelle's test upgrade hit this tonight. Fix: temporarily remove `ccp-agent` from COMPOSE_PROFILES during Phase 6's broad up -d, then schedule a detached `nohup ... & disown` restart at the very end of the script (after write_result and archive_success_to_history). The deferred subshell sleeps 3s, then recreates ccp-agent under its profile, picking up the new image. 3. scripts/upgrade-stash-cleanup.sh (NEW): one-shot utility to list and drop accumulated `upgrade-` git stashes left over by older upgrade.sh runs whose pop failed silently (Pride Corner has three from 2026-03-09 alone). Warns loudly if any stash holds tenant mkdocs.yml content so operators verify recovery before dropping. The .gitignore now excludes /mkdocs-backup-.tar.gz so the rescue archives don't leak into commits. This is Phase 1 of three: Approach B (image-only upgrade mode) and Approach C (CCP template re-render) follow in subsequent commits. Bunker Admin	2026-05-20 20:43:34 -06:00
bunker-admin	e88ac79ae8	fix(ccp-agent): export COMPOSE_PROJECT_NAME so upgrade.sh sees correct project The agent already passed COMPOSE_PROJECT in env, but Docker Compose actually reads COMPOSE_PROJECT_NAME. When upgrade.sh (running inside the agent container at cwd=/app/instance) shelled out to `docker compose up -d` in Phase 5, compose defaulted the project name to "instance" (cwd basename), collided with the host's existing containers under "changemakerlite", and the upgrade aborted with "Container ... already in use by container ..." errors. Discovered when triggering the first end-to-end CCP "Upgrade Now" on marcelle (v2.9.15 → v2.10.1). Backup/code/rebuild phases all succeeded; migration phase failed instantly. Rollback restored marcelle cleanly. This commit adds COMPOSE_PROJECT_NAME alongside the existing COMPOSE_PROJECT (which the agent's TypeScript still reads for its own slug derivation). Bunker Admin	2026-05-20 15:57:30 -06:00
bunker-admin	1b80e8294c	fix(ccp-agent): whitelist /app/instance for git safe.directory The agent container runs as root but the bind-mounted instance directory is owned by the host user (UID 1000 = `node` in the container). Modern git refuses to operate on such repos without an explicit safe.directory entry, breaking upgrade-check.sh's `git fetch/log` calls on source-installed tenants. Verified empirically on soroush after the previous fix landed. Bunker Admin	2026-05-20 12:14:39 -06:00
bunker-admin	a531f9b9ce	fix(ccp): make agent functional + fix Gitea release timestamp bug Three related fixes uncovered during a marcelle CCP registration test: 1. ccp-agent image was missing bash + curl + jq + python3, so every spawn('bash', ...) in upgrade.routes.ts and backup.routes.ts failed silently with ENOENT. CCP kept reading stale status.json files from disk, masking that no agent had successfully checked for updates in weeks. apk-add the missing tools. 2. ccp-agent's /app/instance mount was :ro, blocking the agent from writing data/upgrade/status.json (and result/progress/backups). Agent already has docker.sock — removing :ro is not a security escalation. Patched both docker-compose.yml and docker-compose.prod.yml. 3. Gitea 1.23.x only initializes Release.CreatedUnix inside its createTag() helper, which is skipped if the tag already exists on origin. The old DEV_WORKFLOW pattern (push tag, then run build-release.sh --upload) was triggering this — releases got created_unix=0 and lost /releases/latest sort order to v2.9.14. build-release.sh now removes the remote tag first and POSTs with target_commitish so Gitea creates the tag and release atomically. After these fixes, CCP's "Check for Updates" path returns truthful data end-to-end (verified on marcelle: v2.9.15 -> v2.10.1, 1 behind). Bunker Admin	2026-05-20 11:59:35 -06:00
bunker-admin	a82e95946b	fix(gancio): pre-start config-init sidecar prevents restart loop Gancio refuses to start when its DB has tables but the data volume has no config.json ("Non empty db! Please move your current db elsewhere than retry"), which produces an infinite restart loop. This hit production tenants bnkops and trbh (>1200 restart cycles each) — proximate cause was a missing config.json in changemakerlite_gancio-data with the DB fully populated. Add gancio-config-init alpine sidecar that runs on every `up`: - no-op when config.json exists - regenerates from .env when missing (1000:1000 ownership) - gancio service now depends on its service_completed_successfully Also harden verify_gancio_config in upgrade.sh to error loudly when multiple gancio-data volumes match (silent head -1 could pick the wrong one after a compose project rename).	2026-05-19 17:02:55 -06:00