Release 0.18.4 — apptainer probe + LiveShell hardening + yaml fix#65
Release 0.18.4 — apptainer probe + LiveShell hardening + yaml fix#65phy0x1a79ed wants to merge 7 commits into
Conversation
f.readlines() returns lines that already end in \n, so '\n'.join(...) inserts a second \n between every line. YAML treats \n\n as a literal newline (vs \n folding to a space), so any plain scalar that yaml.dump wrapped past 80 cols parsed with literal \n in the value via this path but correctly via plain yaml.safe_load. Net effect downstream: type-description strings diverged between data_types/*.yml (regular load) and transforms/*/_metadata/types/*.yml mirrors (this load path), producing different type hashes and breaking DataInstance->Transform binding for any long-described type. Reported from spanish-lakes/metagenomics: ref::kraken2_db and ref::metaphlan_db hashed differently across paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5d2aa2f's fd-5 ctl channel wedged every Agent.Deploy() against an ssh home: fd 5 doesn't traverse ssh, so the trampoline's `printf >&5` on the remote bash wrote to a closed fd and no completion frame ever arrived. Replaces the fd-5 + JSON-frame model with per-Exec inline markers on stdout AND stderr: <user_cmd> __rc=$?; printf '\x1eMSM_END_<TOKEN>_<NONCE> %s\x1e\n' "$__rc"; \ printf '\x1eMSM_ERR_<TOKEN>_<NONCE>\x1e\n' >&2 The marker is just bytes. Whichever shell is parsing bash's stdin when it reaches the line emits the marker, and ssh / nested bash / docker exec / `bash -c` forward the bytes back unchanged. This is the single mechanism that generalises across local + sub-shell cases without pre-installed state inside the target shell: - Exec("ssh host") — marker comes back from remote - Exec(cmd) after entry — runs on remote (state persists) - Exec(cmd) local — marker on local bash - Exec(rsync src ssh://...) — rsync uses its own pipes, marker runs on local bash after rsync exits Collision immunity: \x1e (ASCII RS) brackets the marker, plus a per-LiveShell 128-bit token and a per-Exec random nonce. The strip- from-delivery rule only fires when `nonce in self._pending`, so user output containing the marker shape with an unknown nonce passes through verbatim — keeping the G2 collision-immunity tests green. Drops the ctl pipe (`_ctl_r`, `_ctl_reader`, `_ctl_fd_in_child`, `_on_ctl_line`, `extra_pass_fds`), the `import json`, the `__msm_done` bash function injection, `_install_trampoline` / `RemoteExec`, and the `__msm_rc=$?; (exit $__msm_rc)` gymnastics in ExecAsync. TerminalProcess keeps `extra_pass_fds` as a default `()` param for back-compat; nothing uses it now. models/remote.py: - `_execute_ssh` probe phase: dropped the `; echo exited` workaround (it was always broken — `echo exited` runs whether ssh succeeds or fails); use `Exec(f"ssh {remote}")` + `res.exit_code` instead. - `_join` phase: interactive ssh entry per batch (`Exec(f"ssh {dest_host}")`), file checks on the now-remote shell, then `Dispose()` — no `Exec("exit")`. This sidesteps two failure modes discovered while testing against ssh xpsz: 1. `Exec("ssh host 'cmd'")` (non-interactive command-mode ssh) wedges: ssh's non-interactive remote command doesn't read stdin, so the trailing marker bytes are forwarded to the remote one-shot which discards them. 2. `Exec("exit")` from a sub-shell wedges: the sub-shell terminates before the marker emission line is read. End-to-end verified: `Agent.Deploy(ssh://xpsz/...)` runs to completion through interactive ssh + multiple Exec / rsync / per-batch existence check / container-side `deploy_from_container` + relay extraction. `pytest tests/test_live_shell.py -v` — 18/18 pass (G1–G8 + strict- ordering invariant + batched-async). Known residual limitations (in-band approach inherent): - `Exec("exec something")` replaces bash; no in-band approach survives. Refuse at API level if needed. - `Exec("cat")` / `Exec("read x")` — user command consumes the marker line from stdin. Caller uses `timeout=`. - `Exec("exec 2>/tmp/log")` — stderr marker goes to the file forever. Not handled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Approach D rewrite let `Exec("ssh host")` work cleanly because the
marker bytes flow through ssh to the remote shell and back. The mirror
case — leaving the sub-shell via `Exec("exit")` — wedged: bash writes
`exit\n` plus marker emission as one stream, ssh greedily consumes both,
the remote bash exits before reading the marker, and the marker bytes
are lost in transit.
SubShell(entry_cmd) is the surface API for shell-boundary crossing:
- __enter__ runs Exec(entry_cmd); the marker comes back from the
sub-shell (no special path needed).
- __exit__ runs _pop(): writes `exit` alone, waits for true output
quiescence via a multi-sample idle detector (N consecutive samples
where now - _last_byte_time >= quiescence_ms, with a floor), THEN
writes a fresh marker emission that lands on the now-foreground
parent shell. If the first marker is lost (sub-shell still draining
when written), one retry recovers — lost markers are no-op on a dead
remote.
LiveShell gains _last_byte_time (updated in _make_tee on every non-empty
chunk) and _depth (bookkeeping for nesting). A class attribute
_pop_drop_first_marker is a test seam that forces the retry path; off
by default.
models/remote.py:_join now uses one shared LiveShell across all
dest-host batches, entering via `with check_shell.SubShell(f"ssh {host}")`
per batch instead of creating + disposing a fresh LiveShell per batch.
Local-dest branch goes back to plain Path.exists().
Verified:
- pytest tests/test_live_shell.py: 25/25 green (18 prior + 7 new
covering nested-bash round-trip, two-level nest, exception-in-body,
time bounds, chatty quiescence, retry path, timestamp update).
- e2e deploy sockeye (LiveShell + APPTAINER + scratch home): exit 0;
lib/agent.yml and relay/msm_relay landed at
/scratch/st-shallam-1/txyliu/metasmith_ws/msm_subshell_test/.
- e2e deploy fir (LiveShell + APPTAINER + home): exit 0; same artifacts
at /home/phyberos/metasmith_ws/msm_subshell_test/.
Version bumped to 0.18.4 to signal completion.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
….4 (Bug E.4 fir) The SIF→sandbox auto-unpack added for Bug E.2 (commit 96c4634) gated on a single proxy: `[ -u .../starter-suid ]`. That conflated two distinct conditions — "host needs squashfuse workaround for SIF" and "sandbox path is itself safe on this host" — and shipped a sandbox dir on hosts where the sandbox path is the unsafe one. Empirically (today, fir login3 + cvmfs apptainer 1.3.5): array tasks routed through the sandbox SIGBUS 1/8 in the minimal repro; matches the ~30 FAILED of the original nextflow run on runs/qZdfQy2h. The reporter's parallel workaround (`mv sandbox sandbox.disabled_for_test_*`) has the SIF fallback succeeding on the same compute fabric. Mechanism, verified across four hosts by inspecting /proc/mounts and pgrep squashfuse while `apptainer exec` is live: - Sockeye 1.3.1 + setuid → SIF kernel-mount, sandbox kernel-overlay. Both safe. Verdict: use-sif (no sandbox built). - fir 1.3.5 no setuid → SIF squashfuse_ll (works under sbatch), sandbox fuse-overlayfs (SIGBUS under array contention). Verdict: use-sif (no sandbox built). - Cosmos 1.4.2 no setuid, WSL2 6.6 → SIF squashfuse_ll (Bug E.2 wedge under relay fork chain), sandbox kernel-overlay. Verdict: use-sandbox. - xps 1.4.5 no setuid, WSL2 6.6 (the original Bug E.2 host) → same as Cosmos. Verdict: use-sandbox. The decision matrix is two-axis: setuid starter-suid AND apptainer major.minor. Static signals; no apptainer exec required at probe time. - Container.MakeSandboxDecisionProbe replaces MakeNeedsSandboxProbe. Emits `use-sif` or `use-sandbox`. setuid present → use-sif. Else major>=2 || (major==1 && minor>=4) → use-sandbox; else use-sif. Empty for non-APPTAINER runtime. - Agent.Deploy(): probe → capture VERDICT → build sandbox idempotently on use-sandbox branch (`[ -d ... ] || apptainer build`); else `rm -rf` any stale sandbox dir so a flipped verdict self-heals on next deploy. The `assertive=True` `rm -rf` prefix is preserved. - Container.MakeRunCommand(local=True) ternary is unchanged. Directory presence on the target host remains the run-time signal; deploy controls the presence. - libraries.py:1383 cache check `( [ -e sif ] || [ -d sandbox ] )` unchanged; on use-sif hosts the OR collapses to the SIF check. - AGENTS.md updated: new section heading "Apptainer SIF ↔ sandbox decision" describing the two-axis probe and what each verdict does. Tests rewritten: - tests/models/test_container_sandbox.py TestProbe → TestSandboxDecisionProbe; pins setuid axis, version axis, the two verdict literals, and the empty-for-docker contract. Existing TestCachePaths / TestBuildCommand / TestRunCommandSwitch unchanged. - tests/test_deploy_sandbox_step.py rewritten to pin the new VERDICT capture, the branch on `use-sandbox`, idempotent `[ -d ...]` guard, and the else-arm `rm -rf` for the stale-sandbox case. `assertive` test preserved. 15/15 sandbox+deploy tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… markers
ExecAsync wraps the user cmd in `{ ...; } </dev/null` by default so children
(one-shot ssh, cat, read, sudo without -n, ...) inherit /dev/null as stdin
and cannot greedily consume the marker emission line off bash's stdin pipe.
This closes "known limitation hallamlab#1" from 7a8008f: Exec("ssh host cmd") and the
compound Exec("ssh host mkdir && rsync src host:dst") used by
main/local_mock/rsync.py wedged because ssh shipped the marker bytes to the
remote one-shot, where they were discarded; AwaitDone then waited on a
marker that would never arrive.
Diagnosed via /tmp/ls_probe.py against sockeye: Exec("ssh sockeye echo
from-remote") returned the expected stdout but rc=None at 30s. Confirmed
in-process: bash subshell parked in pipe_read at fresh prompt, no rsync/ssh
children, Python main thread futex-blocked in AwaitDone, NBR threads idle.
Brace group is the right wrap (not subshell parens) — it preserves env
mutations across Execs (e.g. `module load apptainer/1.3.1` in setup_commands
persists into later Execs on the same shell). `__rc=$?` on the next line
still captures the user cmd's exit code unchanged. body.rstrip() + explicit
'\n' before '}' guarantees the closing brace is its own token even for
multi-line bodies (agents.py:517's f"""...""" pattern).
inherit_stdin=True opt-in for cmds that DO need bash's real stdin — the
sub-shell-entry mechanism where the marker line is supposed to travel
through to the inner shell:
- SubShell.__enter__ (terminals.py)
- agents.py:151 — `shell.Exec(f"ssh {host}")` (Deploy's interactive ssh)
- remote.py:432 — `remote_shell.Exec(f"ssh {remote}")` (rsync _join batch)
Tests:
- 4 new local-only (mutation gates): stdin-stealing cmd doesn't block
subsequent Execs; explicit `< file` overrides the wrap; multi-line
body preserves env mutation across Execs; inherit_stdin=True passes
bash's stdin through to a child bash.
- 3 new network-gated (LIVESHELL_REMOTE_HOST=sockeye): ssh one-shot
rc=0, rc=1, and the exact ssh+rsync compound from rsync.py.
- conftest registers the `network` marker so pytest doesn't warn.
- test_pop_quiescence_under_chatty_output updated to pass
inherit_stdin=True on its bare `Exec("bash")` entry (mirroring what
SubShell.__enter__ now does internally).
Verified:
- pytest tests/test_live_shell.py: 29 local pass, 3 network pass against
sockeye.
- /tmp/ls_probe.py test B (Exec("ssh sockeye echo from-remote")): rc=0
in 0.3s (was: rc=None at 30s timeout).
- dev.sh -td sockeye: exits 0, sockeye-side terminals.py md5 matches local.
- Agent.Deploy(home=ssh://sockeye/...) end-to-end: msm, lib/agent.yml,
lib/msm_bootstrap, relay/msm_relay all land on sockeye with expected
sizes; relay binary md5 matches local target/.../msm_relay.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous comment ("remove work dirs of completed tasks to save
disk/inodes") described the opposite of what the setting actually does.
Reword to state why we keep the dirs (recoverable .command.err/.out for
failed tasks).
…g + yaml fix Patch release rolling up five dev commits: - yaml_safe_load no longer doubles newlines on round-trip (corrupts wrapped scalars) - LiveShell: inline-marker completion protocol survives sub-shells - LiveShell: SubShell(entry_cmd) context manager + Approach D rewrite, bumping version.txt to 0.18.4 - Apptainer SIF↔sandbox decision now probes at deploy: prefer SIF, build sandbox iff apptainer>=1.4 without setuid (Bug E.4 fir SIGBUS) - LiveShell: stdin-isolate user commands so ssh/cat/read can't steal markers - slurm.nf: clarify cleanup=false comment No new features — bug fixes and robustness only. release-alpha (0.19.0) remains separate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live smoke deploy against
|
Live smoke deploy against
|
| Host | apptainer | setuid | Probe verdict | Sandbox built? | Bug avoided |
|---|---|---|---|---|---|
| xpsz | 1.4.5 | no | use-sandbox |
yes | E.2 (WSL2 squashfuse_ll wedge) |
| fir | 1.3.5 | no | use-sif |
no | E.4 (fuse-overlayfs SIGBUS race) |
Both branches of the new probe exercised end-to-end on real hosts.
Summary
Patch release rolling up six commits on dev. No new features — bug fixes and robustness only. The new-feature work (caching plugin, lineage v2, axis-keyed tests, etc.) stays on
feat/release-alpha(0.19.0) and is not in this PR.9d86540Fixyaml_safe_loaddoubling newlines (was corrupting wrapped scalars in serialized libraries)7a8008fLiveShell: inline-marker completion protocol that survives sub-shellsdeab88eLiveShell:SubShell(entry_cmd)context manager — Approach D rewrite; bumpsversion.txtto 0.18.4976bdfbProbe apptainer at deploy: two-axis static check, prefer SIF, build sandbox iffapptainer>=1.4without setuid (Bug E.4 fir SIGBUS workaround that keeps Bug E.2 WSL2 path)da39ee7LiveShell: stdin-isolate user commands so ssh/cat/read can't steal markers8fc09e8slurm.nf: clarifycleanup=falsecomment (was inverted)Test plan
pytest tests/test_live_shell.py tests/test_deploy_sandbox_step.py tests/test_deploy_skip_marker.py tests/test_container_tag.py tests/test_dev_sh_tag.py tests/test_build_pip_version_split.py— 48 passed, 3 ssh-skipped-m "not docker and not nextflow and not slow", excluding e2e_agentic/e2e_virtual/integration) — 270 passed, 4 skippedmetasmith --help,transform libraries,type list,task show <key>,--json task show <key>all clean at 0.18.4deab88e(subsequent commits covered by unit tests above)vpn_ubcbounce container isn't running and starting it needs interactive credentialsArtifacts published
metasmith-0.18.4+6e7020f-py3-none-any.whlquay.io/hallamlab/metasmith:0.18.4-6e7020f+:latestTag
v0.18.4→afe580eis pushed tophy0x1a79ed/Metasmith(no upstream push permission). Please push it tohallamlab/Metasmithafter merging: