Skip to content

Release 0.18.4 — apptainer probe + LiveShell hardening + yaml fix#65

Open
phy0x1a79ed wants to merge 7 commits into
hallamlab:releasefrom
phy0x1a79ed:release
Open

Release 0.18.4 — apptainer probe + LiveShell hardening + yaml fix#65
phy0x1a79ed wants to merge 7 commits into
hallamlab:releasefrom
phy0x1a79ed:release

Conversation

@phy0x1a79ed

Copy link
Copy Markdown
Contributor

Summary

Patch release rolling up six commits on dev. No new features — bug fixes and robustness only. The new-feature work (caching plugin, lineage v2, axis-keyed tests, etc.) stays on feat/release-alpha (0.19.0) and is not in this PR.

  • 9d86540 Fix yaml_safe_load doubling newlines (was corrupting wrapped scalars in serialized libraries)
  • 7a8008f LiveShell: inline-marker completion protocol that survives sub-shells
  • deab88e LiveShell: SubShell(entry_cmd) context manager — Approach D rewrite; bumps version.txt to 0.18.4
  • 976bdfb Probe apptainer at deploy: two-axis static check, prefer SIF, build sandbox iff apptainer>=1.4 without setuid (Bug E.4 fir SIGBUS workaround that keeps Bug E.2 WSL2 path)
  • da39ee7 LiveShell: stdin-isolate user commands so ssh/cat/read can't steal markers
  • 8fc09e8 slurm.nf: clarify cleanup=false comment (was inverted)

Test plan

  • pytest tests/test_live_shell.py tests/test_deploy_sandbox_step.py tests/test_deploy_skip_marker.py tests/test_container_tag.py tests/test_dev_sh_tag.py tests/test_build_pip_version_split.py48 passed, 3 ssh-skipped
  • Fast suite (-m "not docker and not nextflow and not slow", excluding e2e_agentic/e2e_virtual/integration) — 270 passed, 4 skipped
  • CLI smoke: metasmith --help, transform libraries, type list, task show <key>, --json task show <key> all clean at 0.18.4
  • Author's recorded sockeye e2e deploy success at deab88e (subsequent commits covered by unit tests above)
  • Live sockeye smoke deploy of this exact build was skipped because the local vpn_ubc bounce container isn't running and starting it needs interactive credentials

Artifacts published

  • Wheel: metasmith-0.18.4+6e7020f-py3-none-any.whl
  • Docker: quay.io/hallamlab/metasmith:0.18.4-6e7020f + :latest
  • Apptainer SIF rebuilt locally from the new docker image

Tag

v0.18.4afe580e is pushed to phy0x1a79ed/Metasmith (no upstream push permission). Please push it to hallamlab/Metasmith after merging:

git fetch phy_fork v0.18.4
git push upstream v0.18.4

phy0x1a79ed and others added 7 commits May 28, 2026 12:23
f.readlines() returns lines that already end in \n, so '\n'.join(...)
inserts a second \n between every line. YAML treats \n\n as a literal
newline (vs \n folding to a space), so any plain scalar that yaml.dump
wrapped past 80 cols parsed with literal \n in the value via this path
but correctly via plain yaml.safe_load.

Net effect downstream: type-description strings diverged between
data_types/*.yml (regular load) and transforms/*/_metadata/types/*.yml
mirrors (this load path), producing different type hashes and breaking
DataInstance->Transform binding for any long-described type. Reported
from spanish-lakes/metagenomics: ref::kraken2_db and ref::metaphlan_db
hashed differently across paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5d2aa2f's fd-5 ctl channel wedged every Agent.Deploy() against an ssh
home: fd 5 doesn't traverse ssh, so the trampoline's `printf >&5` on the
remote bash wrote to a closed fd and no completion frame ever arrived.

Replaces the fd-5 + JSON-frame model with per-Exec inline markers on
stdout AND stderr:

    <user_cmd>
    __rc=$?; printf '\x1eMSM_END_<TOKEN>_<NONCE> %s\x1e\n' "$__rc"; \
             printf '\x1eMSM_ERR_<TOKEN>_<NONCE>\x1e\n' >&2

The marker is just bytes. Whichever shell is parsing bash's stdin when
it reaches the line emits the marker, and ssh / nested bash / docker
exec / `bash -c` forward the bytes back unchanged. This is the single
mechanism that generalises across local + sub-shell cases without
pre-installed state inside the target shell:

  - Exec("ssh host")           — marker comes back from remote
  - Exec(cmd) after entry      — runs on remote (state persists)
  - Exec(cmd) local            — marker on local bash
  - Exec(rsync src ssh://...)  — rsync uses its own pipes, marker
                                 runs on local bash after rsync exits

Collision immunity: \x1e (ASCII RS) brackets the marker, plus a
per-LiveShell 128-bit token and a per-Exec random nonce. The strip-
from-delivery rule only fires when `nonce in self._pending`, so user
output containing the marker shape with an unknown nonce passes through
verbatim — keeping the G2 collision-immunity tests green.

Drops the ctl pipe (`_ctl_r`, `_ctl_reader`, `_ctl_fd_in_child`,
`_on_ctl_line`, `extra_pass_fds`), the `import json`, the
`__msm_done` bash function injection, `_install_trampoline` /
`RemoteExec`, and the `__msm_rc=$?; (exit $__msm_rc)` gymnastics in
ExecAsync. TerminalProcess keeps `extra_pass_fds` as a default `()`
param for back-compat; nothing uses it now.

models/remote.py:
  - `_execute_ssh` probe phase: dropped the `; echo exited` workaround
    (it was always broken — `echo exited` runs whether ssh succeeds or
    fails); use `Exec(f"ssh {remote}")` + `res.exit_code` instead.
  - `_join` phase: interactive ssh entry per batch (`Exec(f"ssh
    {dest_host}")`), file checks on the now-remote shell, then
    `Dispose()` — no `Exec("exit")`. This sidesteps two failure modes
    discovered while testing against ssh xpsz:
      1. `Exec("ssh host 'cmd'")` (non-interactive command-mode ssh)
         wedges: ssh's non-interactive remote command doesn't read
         stdin, so the trailing marker bytes are forwarded to the
         remote one-shot which discards them.
      2. `Exec("exit")` from a sub-shell wedges: the sub-shell
         terminates before the marker emission line is read.

End-to-end verified: `Agent.Deploy(ssh://xpsz/...)` runs to completion
through interactive ssh + multiple Exec / rsync / per-batch existence
check / container-side `deploy_from_container` + relay extraction.
`pytest tests/test_live_shell.py -v` — 18/18 pass (G1–G8 + strict-
ordering invariant + batched-async).

Known residual limitations (in-band approach inherent):
  - `Exec("exec something")` replaces bash; no in-band approach
    survives. Refuse at API level if needed.
  - `Exec("cat")` / `Exec("read x")` — user command consumes the
    marker line from stdin. Caller uses `timeout=`.
  - `Exec("exec 2>/tmp/log")` — stderr marker goes to the file
    forever. Not handled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Approach D rewrite let `Exec("ssh host")` work cleanly because the
marker bytes flow through ssh to the remote shell and back. The mirror
case — leaving the sub-shell via `Exec("exit")` — wedged: bash writes
`exit\n` plus marker emission as one stream, ssh greedily consumes both,
the remote bash exits before reading the marker, and the marker bytes
are lost in transit.

SubShell(entry_cmd) is the surface API for shell-boundary crossing:
- __enter__ runs Exec(entry_cmd); the marker comes back from the
  sub-shell (no special path needed).
- __exit__ runs _pop(): writes `exit` alone, waits for true output
  quiescence via a multi-sample idle detector (N consecutive samples
  where now - _last_byte_time >= quiescence_ms, with a floor), THEN
  writes a fresh marker emission that lands on the now-foreground
  parent shell. If the first marker is lost (sub-shell still draining
  when written), one retry recovers — lost markers are no-op on a dead
  remote.

LiveShell gains _last_byte_time (updated in _make_tee on every non-empty
chunk) and _depth (bookkeeping for nesting). A class attribute
_pop_drop_first_marker is a test seam that forces the retry path; off
by default.

models/remote.py:_join now uses one shared LiveShell across all
dest-host batches, entering via `with check_shell.SubShell(f"ssh {host}")`
per batch instead of creating + disposing a fresh LiveShell per batch.
Local-dest branch goes back to plain Path.exists().

Verified:
- pytest tests/test_live_shell.py: 25/25 green (18 prior + 7 new
  covering nested-bash round-trip, two-level nest, exception-in-body,
  time bounds, chatty quiescence, retry path, timestamp update).
- e2e deploy sockeye (LiveShell + APPTAINER + scratch home): exit 0;
  lib/agent.yml and relay/msm_relay landed at
  /scratch/st-shallam-1/txyliu/metasmith_ws/msm_subshell_test/.
- e2e deploy fir (LiveShell + APPTAINER + home): exit 0; same artifacts
  at /home/phyberos/metasmith_ws/msm_subshell_test/.

Version bumped to 0.18.4 to signal completion.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….4 (Bug E.4 fir)

The SIF→sandbox auto-unpack added for Bug E.2 (commit 96c4634) gated
on a single proxy: `[ -u .../starter-suid ]`. That conflated two
distinct conditions — "host needs squashfuse workaround for SIF" and
"sandbox path is itself safe on this host" — and shipped a sandbox dir
on hosts where the sandbox path is the unsafe one. Empirically (today,
fir login3 + cvmfs apptainer 1.3.5): array tasks routed through the
sandbox SIGBUS 1/8 in the minimal repro; matches the ~30 FAILED of the
original nextflow run on runs/qZdfQy2h. The reporter's parallel
workaround (`mv sandbox sandbox.disabled_for_test_*`) has the SIF
fallback succeeding on the same compute fabric.

Mechanism, verified across four hosts by inspecting /proc/mounts and
pgrep squashfuse while `apptainer exec` is live:

- Sockeye 1.3.1 + setuid → SIF kernel-mount, sandbox kernel-overlay.
  Both safe. Verdict: use-sif (no sandbox built).
- fir 1.3.5 no setuid → SIF squashfuse_ll (works under sbatch),
  sandbox fuse-overlayfs (SIGBUS under array contention).
  Verdict: use-sif (no sandbox built).
- Cosmos 1.4.2 no setuid, WSL2 6.6 → SIF squashfuse_ll (Bug E.2
  wedge under relay fork chain), sandbox kernel-overlay.
  Verdict: use-sandbox.
- xps 1.4.5 no setuid, WSL2 6.6 (the original Bug E.2 host) → same
  as Cosmos. Verdict: use-sandbox.

The decision matrix is two-axis: setuid starter-suid AND apptainer
major.minor. Static signals; no apptainer exec required at probe time.

- Container.MakeSandboxDecisionProbe replaces MakeNeedsSandboxProbe.
  Emits `use-sif` or `use-sandbox`. setuid present → use-sif. Else
  major>=2 || (major==1 && minor>=4) → use-sandbox; else use-sif.
  Empty for non-APPTAINER runtime.
- Agent.Deploy(): probe → capture VERDICT → build sandbox idempotently
  on use-sandbox branch (`[ -d ... ] || apptainer build`); else
  `rm -rf` any stale sandbox dir so a flipped verdict self-heals on
  next deploy. The `assertive=True` `rm -rf` prefix is preserved.
- Container.MakeRunCommand(local=True) ternary is unchanged. Directory
  presence on the target host remains the run-time signal; deploy
  controls the presence.
- libraries.py:1383 cache check `( [ -e sif ] || [ -d sandbox ] )`
  unchanged; on use-sif hosts the OR collapses to the SIF check.
- AGENTS.md updated: new section heading "Apptainer SIF ↔ sandbox
  decision" describing the two-axis probe and what each verdict does.

Tests rewritten:
- tests/models/test_container_sandbox.py TestProbe → TestSandboxDecisionProbe;
  pins setuid axis, version axis, the two verdict literals, and the
  empty-for-docker contract. Existing TestCachePaths / TestBuildCommand
  / TestRunCommandSwitch unchanged.
- tests/test_deploy_sandbox_step.py rewritten to pin the new VERDICT
  capture, the branch on `use-sandbox`, idempotent `[ -d ...]` guard,
  and the else-arm `rm -rf` for the stale-sandbox case. `assertive`
  test preserved.

15/15 sandbox+deploy tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… markers

ExecAsync wraps the user cmd in `{ ...; } </dev/null` by default so children
(one-shot ssh, cat, read, sudo without -n, ...) inherit /dev/null as stdin
and cannot greedily consume the marker emission line off bash's stdin pipe.
This closes "known limitation hallamlab#1" from 7a8008f: Exec("ssh host cmd") and the
compound Exec("ssh host mkdir && rsync src host:dst") used by
main/local_mock/rsync.py wedged because ssh shipped the marker bytes to the
remote one-shot, where they were discarded; AwaitDone then waited on a
marker that would never arrive.

Diagnosed via /tmp/ls_probe.py against sockeye: Exec("ssh sockeye echo
from-remote") returned the expected stdout but rc=None at 30s. Confirmed
in-process: bash subshell parked in pipe_read at fresh prompt, no rsync/ssh
children, Python main thread futex-blocked in AwaitDone, NBR threads idle.

Brace group is the right wrap (not subshell parens) — it preserves env
mutations across Execs (e.g. `module load apptainer/1.3.1` in setup_commands
persists into later Execs on the same shell). `__rc=$?` on the next line
still captures the user cmd's exit code unchanged. body.rstrip() + explicit
'\n' before '}' guarantees the closing brace is its own token even for
multi-line bodies (agents.py:517's f"""...""" pattern).

inherit_stdin=True opt-in for cmds that DO need bash's real stdin — the
sub-shell-entry mechanism where the marker line is supposed to travel
through to the inner shell:
  - SubShell.__enter__ (terminals.py)
  - agents.py:151 — `shell.Exec(f"ssh {host}")` (Deploy's interactive ssh)
  - remote.py:432 — `remote_shell.Exec(f"ssh {remote}")` (rsync _join batch)

Tests:
  - 4 new local-only (mutation gates): stdin-stealing cmd doesn't block
    subsequent Execs; explicit `< file` overrides the wrap; multi-line
    body preserves env mutation across Execs; inherit_stdin=True passes
    bash's stdin through to a child bash.
  - 3 new network-gated (LIVESHELL_REMOTE_HOST=sockeye): ssh one-shot
    rc=0, rc=1, and the exact ssh+rsync compound from rsync.py.
  - conftest registers the `network` marker so pytest doesn't warn.
  - test_pop_quiescence_under_chatty_output updated to pass
    inherit_stdin=True on its bare `Exec("bash")` entry (mirroring what
    SubShell.__enter__ now does internally).

Verified:
  - pytest tests/test_live_shell.py: 29 local pass, 3 network pass against
    sockeye.
  - /tmp/ls_probe.py test B (Exec("ssh sockeye echo from-remote")): rc=0
    in 0.3s (was: rc=None at 30s timeout).
  - dev.sh -td sockeye: exits 0, sockeye-side terminals.py md5 matches local.
  - Agent.Deploy(home=ssh://sockeye/...) end-to-end: msm, lib/agent.yml,
    lib/msm_bootstrap, relay/msm_relay all land on sockeye with expected
    sizes; relay binary md5 matches local target/.../msm_relay.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous comment ("remove work dirs of completed tasks to save
disk/inodes") described the opposite of what the setting actually does.
Reword to state why we keep the dirs (recoverable .command.err/.out for
failed tasks).
…g + yaml fix

Patch release rolling up five dev commits:
- yaml_safe_load no longer doubles newlines on round-trip (corrupts wrapped scalars)
- LiveShell: inline-marker completion protocol survives sub-shells
- LiveShell: SubShell(entry_cmd) context manager + Approach D rewrite,
  bumping version.txt to 0.18.4
- Apptainer SIF↔sandbox decision now probes at deploy: prefer SIF, build
  sandbox iff apptainer>=1.4 without setuid (Bug E.4 fir SIGBUS)
- LiveShell: stdin-isolate user commands so ssh/cat/read can't steal markers
- slurm.nf: clarify cleanup=false comment

No new features — bug fixes and robustness only. release-alpha (0.19.0)
remains separate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@phy0x1a79ed

Copy link
Copy Markdown
Contributor Author

Live smoke deploy against xpsz (WSL2, apptainer 1.4.5, no setuid)

Ran metasmith agent deploy against an actual host that triggers the new probe's use-sandbox branch — i.e. the Bug E.2 WSL2 surface, not the Bug E.4 fir SIGBUS surface. The version on the host is exactly the artifact this PR ships: quay.io/hallamlab/metasmith:0.18.4-6e7020f.

Probe verdict: use-sandbox (apptainer 1.4.5, starter-suid absent → MAJ=1 MIN=4 → matches the new >=1.4 && !setuid branch in MakeSandboxDecisionProbe)

Deploy log highlights:

>>> {if not exists}: apptainer pull .../docker..quay.io_hallamlab_metasmith..0.18.4-6e7020f.sif docker://quay.io/hallamlab/metasmith:0.18.4-6e7020f
>>> {probe host; build sandbox iff apptainer>=1.4 and no setuid}: apptainer build --sandbox ....sandbox ....sif
staged [msm]
staged [lib/agent.yml]
staged [lib/msm_bootstrap]
api call to [deploy_from_container] with {'workspace': '/msm_home', 'architecture': 'x86_64', 'system': 'Linux'}
deploying relay server to [/msm_home/relay/msm_relay]
deployment complete
status: deployed

Post-deploy state on xpsz (/home/tony/metasmith_ws/smoke_0184/):

  • container_images/...0.18.4-6e7020f.sif (754 MB — kept alongside)
  • container_images/...0.18.4-6e7020f.sandbox/ (unpacked rootfs — runtime ternary will pick this)
  • lib/agent.yml pinned to docker://quay.io/hallamlab/metasmith:0.18.4-6e7020f
  • relay/msm_relay (x86_64 static-pie ELF)
  • msm bash wrapper

The whole deploy ran through a single LiveShell ssh session with no marker theft or sub-shell wedges, exercising deab88e + 7a8008f + da39ee7 + 976bdfb together on a real host.

@phy0x1a79ed

Copy link
Copy Markdown
Contributor Author

Live smoke deploy against fir (Compute Canada, apptainer 1.3.5, no setuid)

Deployed against an isolated scratch dir (/scratch/phyberos/metasmith_smoke_0184_1780385934, freshly created, no existing files touched). This is the Bug E.4 surface — the opposite branch from xpsz.

Probe verdict: use-sif (fallback) — apptainer 1.3.5, no starter-suid → MAJ=1 MIN=3 → falls into the <1.4 && !setuid branch, which the new probe correctly routes to SIF (avoiding the fuse-overlayfs SIGBUS race under SLURM array contention).

Deploy log highlights:

>>> {if not exists}: apptainer pull .../docker..quay.io_hallamlab_metasmith..0.18.4-6e7020f.sif docker://quay.io/hallamlab/metasmith:0.18.4-6e7020f
>>> {probe host; build sandbox iff apptainer>=1.4 and no setuid}: apptainer build --sandbox ....sandbox ....sif
staged [msm]
staged [lib/agent.yml]
staged [lib/msm_bootstrap]
deploying relay server to [/msm_home/relay/msm_relay]
deployment complete
status: deployed

Note the absence of INFO: Starting build... / Extracting local image... / Build complete: lines between the sandbox-build label and staged [msm]. Compare to the xpsz log earlier where those INFO lines DO appear — confirming the actual apptainer build --sandbox command was skipped on fir, as intended.

Post-deploy state on fir:

container_images/
  docker..quay.io_hallamlab_metasmith..0.18.4-6e7020f.sif       (754 MB, only artifact)
  # no .sandbox/ directory  ← key: confirms sandbox was not built

On xpsz this directory holds both the SIF and the unpacked sandbox; on fir, SIF only. The runtime ternary if [ -d <sandbox> ]; then echo <sandbox>; else echo <sif>; fi will correctly pick the SIF on fir.

Summary of the two-host probe matrix

Host apptainer setuid Probe verdict Sandbox built? Bug avoided
xpsz 1.4.5 no use-sandbox yes E.2 (WSL2 squashfuse_ll wedge)
fir 1.3.5 no use-sif no E.4 (fuse-overlayfs SIGBUS race)

Both branches of the new probe exercised end-to-end on real hosts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant