[WIP][DO NOT MERGE] feat(kata): hacky KataAgent backend — run a Backend.AI session in a Kata VM (PoC)#12330
Draft
hhoikoo wants to merge 3 commits into
Draft
[WIP][DO NOT MERGE] feat(kata): hacky KataAgent backend — run a Backend.AI session in a Kata VM (PoC)#12330hhoikoo wants to merge 3 commits into
hhoikoo wants to merge 3 commits into
Conversation
Live validation on kata-lab-150 (10.100.64.150, Kata 3.31.0) proved all load-bearing host primitives (microVM boot, 127.0.0.1 repl port publish, krunner bind->virtio-fs) and corrected the file-IO design: - tar over 'nerdctl exec -i' hangs and poisons the container exec channel; bulk cat over exec -i truncates. The rw virtio-fs scratch share round-trips 1 MiB byte-exact in both directions. - accept_file: revert to inherited DockerKernel host-side scratch write. - download_file/download_single: read host-side scratch instead of exec-tar. - nerdctl_exec: document that -i bulk stdin is unsafe on Kata. Findings + progress files updated outside the repo.
…-150
Stood up the full Backend.AI dev stack on the live Kata host and created a real
compute session inside a Kata microVM. Four integration bugs surfaced that unit
tests could not catch, each fixed:
1. nerdctl rejects '-i' and '-d' together: do not translate OpenStdin to -i
(kernels run detached and talk over ZMQ, not container stdin).
2. seccomp profile: Docker inlines it as JSON in SecurityOpt, but nerdctl's
--security-opt seccomp= wants a FILE PATH ('file name too long'). Externalize
the inline profile to the kernel config dir.
3. destroy_kernel/clean_kernel: guard empty (not just None) container_id so a
create that fails before producing a cid does not error on nerdctl stop/logs.
4. enumerate_containers: DockerAgent lists the 'moby' namespace, which never
has Kata kernels, so the registry-reconciliation loop evicted live kernels as
'dangling' and leaked the VM. Override to enumerate via nerdctl/containerd.
Proven: session create -> code exec (guest kernel 6.18.28 != host 6.8.0-124) ->
file IO over virtio-fs scratch -> destroy with full VM reap; survives the 60s
reconciliation loop with 0 dangling evictions. Operational note: the agent runs
as a non-root user, so nerdctl must target the rootful containerd via a
'sudo nerdctl' wrapper (BACKENDAI_NERDCTL_BIN).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A hacky
KataAgentbackend that runs a real Backend.AI compute session inside a Kata Containers lightweight VM, as a low-friction precursor to the full BEP-1051 design (deliberately no containerd-gRPC client / CoCo / VFIO / Calico).KataAgent(DockerAgent)reuses the entirecreate_kernelpipeline (scratch/config generation, krunner injection, vfolder mounts, resource specs, image scan, the ZMQ kernel-runner contract). The only functional delta is translating the already-assembledcontainer_configdict into anerdctl run --runtime io.containerd.kata.v2invocation instead of twoaiodockercalls.What it changes
agent/types.py— addAgentBackend.KATA.agent/config/unified.py— KATA shares Docker's container-config validation.agent/kata/package —nerdctl.py(purecontainer_config→nerdctl translation + subprocess shims),agent.py(KataAgent+KataKernelCreationContext),kernel.py(KataKernel),README.md.agent/errors/kata.py—NerdctlError,KataVolumeResolutionError.tests/agent/kata/— unit tests for the translation.Enable with
[agent] backend = "kata".Validation (live)
Deployed the full dev stack on a nested-KVM Kata host and ran a real session end-to-end: session create → code exec inside the microVM (guest kernel
6.18.28!= host6.8.0-124) → file I/O over the rw virtio-fs scratch → destroy with full VM reap; survives the registry-reconciliation loop with 0 dangling evictions. The live run surfaced and fixed 4 integration bugs unit tests could not catch (nerdctl-i+-d; inline-JSON vs file-path seccomp; empty-cid cleanup guard;enumerate_containersscanning themobynamespace -> VM leak).Why this must NOT be merged
It is a subprocess shim with intentional degradations: a 5 s
nerdctl psliveness poller instead of an events stream, stats stubbed,commitunimplemented, rootfulnerdctlvia asudowrapper, port reconstruction on restart and overhead accounting missing. Production should graduate to a native containerd-gRPC backend. Full degradation list:src/ai/backend/agent/kata/README.md.Checklist: (if applicable)
[agent] backend)ai.backend.test— not done (PoC)tests/agent/kata/test_nerdctl_translate.pysrc/ai/backend/agent/kata/README.md(enable steps + degradations)