From 13513aeca5c1e866a1f1200b2b494d9a0d0589f7 Mon Sep 17 00:00:00 2001 From: bunker-admin Date: Wed, 15 Apr 2026 18:33:13 -0600 Subject: [PATCH] Fix VERSION promotion regression: don't gate on soft health-check warnings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prior commit (ac901c9e, Fix B) gated VERSION.pending promotion behind VERIFY_FAILED=false, but VERIFY_FAILED is a soft warning signal — it also fires when the admin container's 30s verify budget is tight (which was the cry-wolf case Fix 3 addressed in the same commit). Observed on marcelle during v2.9.5 → v2.9.6: the upgrade completed successfully (tarball extracted, containers pulled and running new image), but because the admin healthcheck warned at 30s (still using v2.9.5's upgrade.sh with its 30s budget), VERIFY_FAILED=true pinned VERSION back to v2.9.5 despite everything else having advanced. result.json showed success=true but newCommit=v2.9.5. Hard failures still prevent promotion via on_failure's rm -f of VERSION.pending before the promotion site is reached. Reaching the promotion site means Phase 7 completed without exit-code or trap — that's the correct gate. Bunker Admin --- scripts/upgrade.sh | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/scripts/upgrade.sh b/scripts/upgrade.sh index 2c8908d0..0f2009d7 100755 --- a/scripts/upgrade.sh +++ b/scripts/upgrade.sh @@ -1405,11 +1405,16 @@ PYEOF fi # --- Atomic VERSION promotion (Fix B) --- -# The staged VERSION from Phase 3 lands only now, after full verification. -# On any prior failure, on_failure removes VERSION.pending and the live -# VERSION file remains at the pre-upgrade value — so upgrade-check.sh -# correctly reports "upgrade available" on the next check. -if [[ "$VERIFY_FAILED" != "true" ]] && [[ -f "$UPGRADE_DIR/VERSION.pending" ]]; then +# The staged VERSION from Phase 3 lands now that we've reached the end of +# Phase 7 without on_failure firing. Promote regardless of VERIFY_FAILED — +# that flag is a soft health-check warning (e.g. "admin slow to respond"), +# not an upgrade failure. The tarball is extracted, containers are up, and +# write_result below will record success=true. Gating promotion on +# VERIFY_FAILED previously caused a "stuck at old VERSION" bug where a +# transient admin healthcheck warning pinned the install back. +# Hard failures (SIGTERM, exit !=0) still prevent promotion via on_failure, +# which rm -f's VERSION.pending before it can be promoted. +if [[ -f "$UPGRADE_DIR/VERSION.pending" ]]; then mv "$UPGRADE_DIR/VERSION.pending" "$PROJECT_DIR/VERSION" success "VERSION promoted to $(head -1 "$PROJECT_DIR/VERSION" 2>/dev/null || echo "?")" fi