Skip to content

fix(sandbox): strip " (deleted)" suffix from unlinked /proc/<pid>/exe paths#844

Open
mjamiv wants to merge 2 commits intoNVIDIA:mainfrom
mjamiv:fix/binary-path-strip-deleted-suffix
Open

fix(sandbox): strip " (deleted)" suffix from unlinked /proc/<pid>/exe paths#844
mjamiv wants to merge 2 commits intoNVIDIA:mainfrom
mjamiv:fix/binary-path-strip-deleted-suffix

Conversation

@mjamiv
Copy link
Copy Markdown
Contributor

@mjamiv mjamiv commented Apr 15, 2026

Summary

Strip the kernel's " (deleted)" suffix from /proc/<pid>/exe readlink results in binary_path(). Without this strip, the CONNECT proxy's ancestor integrity check stat()s a literal "...sandbox (deleted)" path and denies every outbound request from affected pods — including pods whose supervisor binary was replaced by a routine docker cp hot-swap.

Related Issue

No tracking issue — filing a PR directly with reproduction in the body. Happy to also open a parent tracking issue for the broader "cluster-deploy-fast hot-swap breaks integrity cache" work if that's preferred.

Changes

  • crates/openshell-sandbox/src/procfs.rs::binary_path now detects the " (deleted)" tail on the readlink target and returns the stripped PathBuf. Doc comment expanded to spell out the failure mode and point at the follow-up work needed for full hot-swap tolerance.
  • New unit test binary_path_strips_deleted_suffix: copies /bin/sleep to a temp path, spawns a child from it, unlinks the temp binary, verifies the raw readlink ends with " (deleted)", then asserts the public API returns the stripped path.

Root cause (production reproduction)

Encountered during a cluster-deploy-fast-style supervisor hot-swap on an internal fleet this morning. The sequence:

  1. Supervisor pod running v0.0.16, PID 1 exec'd from /opt/openshell/bin/openshell-sandbox.
  2. Operator docker cp-replaces that file with a new build (unlinks the v0.0.16 inode, creates a new inode at the same path).
  3. The running pod's /proc/1/exe symlink now resolves to the literal string /opt/openshell/bin/openshell-sandbox (deleted) — this is the kernel's documented behaviour for d_path of an unlinked dentry (see fs/d_path.c :: prepend_path).
  4. Every outbound CONNECT goes through route_inference_requestcollect_ancestor_binariesbinary_pathread_link(/proc/<pid>/exe) → the tainted PathBuf with " (deleted)" suffix.
  5. identity::verify_or_cache calls std::fs::metadata on that PathBuf, which stats a path that doesn't exist, returning ENOENT.
  6. The proxy emits ancestor integrity check failed for /opt/openshell/bin/openshell-sandbox (deleted): Failed to stat /opt/openshell/bin/openshell-sandbox (deleted): No such file or directory (os error 2) and denies every outbound CONNECT destination.

Concrete OCSF events from the incident:

[sandbox] CONNECT action=deny ancestors=/usr/bin/bash -> /opt/openshell/bin/openshell-sandbox (deleted)
  binary=/usr/bin/node dst_host=slack.com dst_port=443 policy=-
  reason=ancestor integrity check failed for /opt/openshell/bin/openshell-sandbox (deleted):
    Failed to stat /opt/openshell/bin/openshell-sandbox (deleted): No such file or directory

[sandbox] CONNECT action=deny ancestors=/usr/bin/node -> /usr/bin/bash -> /opt/openshell/bin/openshell-sandbox (deleted)
  binary=/usr/bin/node dst_host=registry.npmjs.org dst_port=443 policy=-
  reason=ancestor integrity check failed for ...

[sandbox] FORWARD action=deny ancestors=/usr/bin/bash -> /opt/openshell/bin/openshell-sandbox (deleted)
  binary=/usr/bin/node dst_host=169.254.169.254 dst_port=80 method=PUT path=/latest/api/token
  reason=ancestor integrity check failed for ...

Pre-established TCP tunnels (e.g. Slack Socket Mode over WSS) kept working because they don't re-enter the proxy per-message; any NEW REST call, npm install, or IMDS probe failed immediately.

Scope of this PR

This fix alone does not make BinaryIdentityCache tolerant of legitimate binary replacements. With the strip in place, verify_or_cache successfully stats the path and hashes whatever now lives there — if the file has been replaced, the hash differs from the cached hash and verify_or_cache returns Binary integrity violation. That's a more useful failure than Failed to stat ... (deleted) and it gives operators an actionable signal, but it still denies the connection.

Fully unblocking the dev workflow documented in architecture/build-containers.md — where docker cp is explicitly the supported fast-upgrade path for the supervisor binary — needs one of:

  1. Read running-binary content from /proc/<pid>/exe directly via File::open; procfs resolves this to the live in-memory executable even when the original inode has been unlinked. The identity cache would verify the bytes the process is actually running, not whatever happens to sit at the path.
  2. Key the identity cache by exec device + inode (which remains stable across a path-level replacement for the currently-running process) instead of path string.
  3. Add an explicit "reload" hook that clears the cache entry for a replaced binary after the replacement is acknowledged by the operator.

I have a working sketch for option (1) and am happy to send it as a follow-up PR once we agree on the approach. Filing this narrow fix first because:

  • It stands on its own — the " (deleted)" suffix leaking into cache keys, logs, and error messages is a bug regardless of the broader hot-swap question.
  • It turns a cryptic "can't find the binary" error into a clearer "integrity violation" error, which unblocks operator diagnosis during incidents.
  • It has a deterministic unit test.

Testing

  • cargo test -p openshell-sandbox --lib procfs::tests::binary_path_strips_deleted_suffix passes (new test; exercises the real /proc/<pid>/exe readlink path via a temp binary + unlink)
  • cargo test -p openshell-sandbox --lib full suite passes (499 tests, 0 failures, 2 ignored)
  • cargo fmt -p openshell-sandbox -- --check clean
  • Manually verified the new test fails cleanly on master (unreachable! replaced by assert_eq! produces left: "/tmp/...deleted-sleep (deleted)", right: "/tmp/...deleted-sleep") — the test is a true red/green signal for the strip.
  • mise run pre-commitmise not available in this environment; ran cargo fmt --check + cargo test + cargo clippy for the affected crate by hand. No new clippy findings in the touched file.

Checklist

  • Follows Conventional Commitsfix(sandbox): …
  • Commit is signed off (DCO)
  • Architecture docs updated — not applicable. architecture/build-containers.md already documents docker cp as the hot-swap path; this PR restores a small part of that promise but doesn't change the contract.

… paths

When a running binary is unlinked from its filesystem path — the common
case is a `docker cp` hot-swap of `/opt/openshell/bin/openshell-sandbox`
during the `cluster-deploy-fast` dev upgrade workflow — the Linux kernel
appends the literal string ` (deleted)` to the `/proc/<pid>/exe` readlink
target. The tainted `PathBuf` then flows into `collect_ancestor_binaries`
and on into `BinaryIdentityCache::verify_or_cache`, which tries to
`std::fs::metadata` the path. `stat()` fails with `ENOENT` because the
literal suffix isn't a real filesystem path, and the CONNECT proxy denies
every outbound request with:

    ancestor integrity check failed for \
      /opt/openshell/bin/openshell-sandbox (deleted): \
      Failed to stat ...: No such file or directory (os error 2)

Reproduced in production 2026-04-15: a cluster-deploy-fast-style hot-swap
of the supervisor binary caused every pod whose PID 1 held the now-deleted
inode to deny ALL outbound CONNECTs (slack.com, registry.npmjs.org,
169.254.169.254, etc.), breaking Slack REST delivery, npm installs, and
IMDS probes simultaneously. Existing pre-hot-swap TCP tunnels (e.g. Slack
Socket Mode WSS) kept working because they never re-evaluate the proxy.

Strip the suffix in `binary_path()` so downstream callers see the clean,
grep-friendly path. This aligns the cache key and log messages with the
original on-disk location.

Note: stripping the suffix does NOT by itself make the identity cache
tolerant of a legitimate binary replacement — `verify_or_cache` will now
`stat` and hash whatever currently lives at the stripped path, which is
the NEW binary, and surface a clearer `Binary integrity violation` error.
Fully unblocking the cluster-deploy-fast hot-swap workflow needs a
follow-up that either (a) reads running-binary content from
`/proc/<pid>/exe` directly via `File::open` (procfs resolves this to the
live in-memory executable even when the original inode has been unlinked),
or (b) keys the identity cache by exec dev+inode instead of path. Happy to
send that as a separate PR once the approach is decided — filing this
narrow fix first because it stands on its own: it fixes a concrete
misleading error and unblocks the obvious next step.

Added `binary_path_strips_deleted_suffix` test that copies `/bin/sleep`
to a temp path, spawns a child from it, unlinks the temp binary, verifies
the raw readlink contains the ` (deleted)` suffix, then asserts the public
API returns the stripped path.

Signed-off-by: mjamiv <michael.commack@gmail.com>
@mjamiv mjamiv requested a review from a team as a code owner April 15, 2026 07:09
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mjamiv
Copy link
Copy Markdown
Contributor Author

mjamiv commented Apr 15, 2026

I have read the DCO document and I hereby sign the DCO.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@johntmyers
Copy link
Copy Markdown
Collaborator

recheck

Comment thread crates/openshell-sandbox/src/procfs.rs Outdated

// Strip the " (deleted)" suffix the kernel appends for unlinked binaries.
// See the doc comment above for the full rationale.
if let Some(stripped) = target.to_str().and_then(|s| s.strip_suffix(" (deleted)")) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too aggressive for a trusted identity source. It rewrites real executable names that literally end with (deleted), and it skips deleted paths that are not valid UTF-8. Since identity.rs (line 105) hashes whatever binary_path() returns, this can attribute a process to the wrong binary.

suggest something like:

use std::ffi::OsString;
use std::io::ErrorKind;
use std::os::unix::ffi::{OsStrExt, OsStringExt};

const DELETED_SUFFIX: &[u8] = b" (deleted)";

let target = std::fs::read_link(&link).map_err(|e| {
    miette::miette!(
        "Failed to read /proc/{pid}/exe: {e}. \
         Cannot determine binary identity — denying request. \
         Hint: the proxy may need CAP_SYS_PTRACE or to run as the same user."
    )
})?;

let raw_target_missing =
    matches!(std::fs::metadata(&target), Err(err) if err.kind() == ErrorKind::NotFound);

let bytes = target.as_os_str().as_bytes();
if raw_target_missing && bytes.ends_with(DELETED_SUFFIX) {
    let stripped = bytes[..bytes.len() - DELETED_SUFFIX.len()].to_vec();
    return Ok(PathBuf::from(OsString::from_vec(stripped)));
}

Ok(target)
  • does not rewrite a live executable that could literally be named * (deleted) ... unlikely to exist but is the case here
  • handles UTF8 paths
  • strips exactly one kernel added-suffix

suggest a couple of new tests:

  • A live executable whose basename already ends with (deleted) is returned unchanged.
  • An unlinked executable with a non-UTF-8 filename still strips exactly one suffix.

@johntmyers
Copy link
Copy Markdown
Collaborator

I think the new test is useful, but it stops one layer short of the behavior that actually regressed in production. The failure path was resolve_tcp_peer_identity() -> BinaryIdentityCache::verify_or_cache() inside evaluate_opa_tcp(), not just binary_path() returning a suffixed path. As written, this test can pass even if CONNECT requests still fail in the hot-swap scenario.

Could we add a higher-level Linux-only regression that exercises that chain and asserts the new behavior is Binary integrity violation rather than Failed to stat ... (deleted)? If that is awkward to reach through evaluate_opa_tcp() directly, I’d suggest factoring the identity-resolution block into a small helper and testing that helper instead.

suggested test flow:

  1. Extract the identity-resolution block from evaluate_opa_tcp() into a small private helper.

    The relevant block is the code in crates/openshell-sandbox/src/proxy.rs from the resolve_tcp_peer_identity() call through the ancestor walk / verify_or_cache() calls. Have that helper return the resolved binary path, pid, ancestors, cmdline paths, and hash, or surface the same error that currently becomes a deny reason.

  2. Add a Linux-only regression test in crates/openshell-sandbox/src/proxy.rs that exercises the real identity path.

    Suggested shape:

    • start a TcpListener in the test process
    • copy /bin/bash to a temp path
    • prime BinaryIdentityCache with that temp binary
    • spawn the temp bash as a parent process and have it launch a child that opens a real TCP connection to the listener
    • replace the temp bash on disk with different bytes to simulate docker cp
    • run the new helper, or evaluate_opa_tcp() with a trivial allow-all OPA engine
  3. Assert the contract this PR is trying to improve.

    Since this PR is intentionally narrow, I think the test should assert that:

    • the resolved path no longer contains (deleted)
    • the failure is now Binary integrity violation
    • it is no longer Failed to stat ... (deleted)

    That would make the regression coverage much more complete without claiming this PR fully fixes the broader hot-swap workflow.

…d-to-end

Address review feedback from @johntmyers on the initial version of this PR:

procfs::binary_path
- Only strip the kernel's " (deleted)" suffix when stat() on the raw
  readlink target reports NotFound. A live executable whose basename
  literally ends with " (deleted)" is now returned unchanged instead of
  being silently truncated, which matters because identity.rs hashes
  whatever this function returns as a trust anchor.
- Operate on raw bytes via OsStrExt, so filenames that are not valid
  UTF-8 still get exactly one suffix stripped. The previous
  strip_suffix-on-&str path skipped non-UTF-8 entirely and fell through
  to returning the tainted path.
- Expand the doc comment to describe both guardrails.

procfs tests
- binary_path_preserves_live_deleted_basename: copy /bin/sleep to a
  live file literally named "sleepy (deleted)", spawn it, and assert
  that the returned path still ends with " (deleted)".
- binary_path_strips_suffix_for_non_utf8_filename: exec a binary whose
  basename contains a 0xFF byte, unlink it, and assert that
  binary_path returns the stripped non-UTF-8 path. Writes the bytes
  with OpenOptions + sync_all + explicit drop so the write fd is fully
  released before exec() to avoid ETXTBSY under concurrent tests.

proxy: extract resolve_process_identity helper
- Pull the peer-resolution + TOFU verify + ancestor walk + cmdline
  collection block out of evaluate_opa_tcp into resolve_process_identity.
- Introduce IdentityError which carries the deny reason along with
  whatever partial identity data was resolved before the failure so
  evaluate_opa_tcp can thread that into ConnectDecision unchanged.
- evaluate_opa_tcp now calls the helper and proceeds straight to the
  OPA evaluate step; the surface visible to OPA and OCSF is unchanged.

proxy: end-to-end regression test for the hot-swap contract
- resolve_process_identity_surfaces_binary_integrity_violation_on_hot_swap
  stands up a real TcpListener, copies /bin/bash to a temp path,
  primes BinaryIdentityCache with that binary, spawns bash with a
  /dev/tcp one-liner that opens a real connection to the listener,
  and captures the peer's ephemeral port from accept().
- Simulates `docker cp` correctly: unlink the running binary (which
  persists via the child's exec mapping) and create a fresh file
  with different bytes at the same path. Writing in place is
  rejected by the kernel with ETXTBSY, so the old single-inode
  approach did not actually model the production failure mode.
- Asserts the error returned by resolve_process_identity contains
  "Binary integrity violation" (from BinaryIdentityCache) and does
  NOT contain "Failed to stat" or "(deleted)" — the pre-PR-NVIDIA#844
  failure mode. The binary field on the error is populated and is
  free of the tainted suffix.
- Skips cleanly if /bin/bash is not installed. Child process is
  always reaped before the assertion block so a failure does not
  leak a sleeping process.
@mjamiv
Copy link
Copy Markdown
Contributor Author

mjamiv commented Apr 15, 2026

Thanks John — both points addressed in 9022f0f, pushed to the branch.

binary_path() (re: the line comment on procfs.rs:55)

Adopted your snippet essentially verbatim. The strip is now guarded by a stat()-returns-NotFound check on the raw readlink target and operates on raw bytes via OsStrExt / OsStringExt, so:

  • a live executable whose basename literally ends with " (deleted)" is returned unchanged (trusted identity source stays intact), and
  • non-UTF-8 filenames still get exactly one suffix stripped (the old .to_str().and_then(strip_suffix) path silently fell through and returned the tainted path).

Added the two unit tests you suggested:

  • binary_path_preserves_live_deleted_basename — copies /bin/sleep to sleepy (deleted), spawns it while the file is still linked, and asserts the resolved path still ends with " (deleted)".
  • binary_path_strips_suffix_for_non_utf8_filename — builds a filename containing 0xFF, execs it, unlinks, and asserts the resolved path is the stripped non-UTF-8 path.

Small implementation wrinkle worth flagging: the non-UTF-8 test initially hit ETXTBSY on spawn under concurrent test parallelism when I used std::fs::copy. Switching to OpenOptions::new().write(true).create_new(true).mode(0o755) + write_all + explicit sync_all() + scope-drop before spawn made it deterministic. The raw bytes are emitted via OsString::from_vec so the filename survives round-tripping through Path.

proxy.rs (re: the PR-level comment on the identity chain)

You were right — the original binary_path_strips_deleted_suffix test stopped at the first layer and could have passed while CONNECT was still broken end-to-end. Refactored along the lines you suggested:

  • Extracted the identity-resolution block from evaluate_opa_tcp into a new private helper resolve_process_identity(entrypoint_pid, peer_port, &identity_cache)Result<ResolvedIdentity, IdentityError>. The helper owns the resolve_tcp_peer_identityverify_or_cache(bin_path)collect_ancestor_binariesverify_or_cache(ancestor)collect_cmdline_paths chain. IdentityError carries the deny reason plus whatever partial identity data was resolved before the failure, so evaluate_opa_tcp threads it straight into ConnectDecision with no behavioral change visible to OPA or OCSF.
  • Added resolve_process_identity_surfaces_binary_integrity_violation_on_hot_swap as a Linux-only integration test in proxy.rs. The shape matches your sketch:
    1. TcpListener::bind("127.0.0.1:0") in the test process.
    2. Copy /bin/bash to <tempdir>/hotswap-bash.
    3. BinaryIdentityCache::verify_or_cache(&hotswap_bash) to prime the cache with the v1 hash.
    4. Command::new(&hotswap_bash).args(["-c", "exec 3<>/dev/tcp/127.0.0.1/$PORT; sleep 30 <&3"]) to get a real established peer socket from a process whose /proc/<pid>/exe points at the temp binary.
    5. listener.accept() captures peer_addr.port() — that's what resolve_process_identity passes to parse_proc_net_tcp.
    6. Simulate docker cp: unlink the running binary (which persists via the child's exec mapping) and create a fresh file with different bytes at the same path. I initially wrote it as an in-place fs::write like your sketch, but the kernel rejects that with ETXTBSY because the inode is still being executed — the unlink + create-new pattern is what docker cp actually does, and it's the pattern that reproduces production correctly.
    7. Call resolve_process_identity(test_pid, peer_port, &cache) and assert:
      • err.reason.contains("Binary integrity violation")
      • !err.reason.contains("Failed to stat")
      • !err.reason.contains("(deleted)")
      • err.binary is populated and the path is free of the tainted suffix

Skips cleanly if /bin/bash isn't present (minimal containers). The child process is always killed + reaped before the assertion block so a failure doesn't leak a sleeping process across test runs.

Tests green locally:

cargo test -p openshell-sandbox --lib
...
test result: ok. 502 passed; 0 failed; 2 ignored; 0 measured

cargo fmt --all -- --check clean, cargo clippy -p openshell-sandbox --all-targets clean on the touched files (the remaining 261 warnings are all pre-existing on unrelated code).

Happy to iterate further if you'd rather see the helper wrapped differently (e.g. returning ConnectDecision directly instead of a separate IdentityError), or if you'd like me to split the helper extraction into its own commit for easier review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants