Skip to content

[WIP][DO NOT MERGE] feat(kata): hacky KataAgent backend — run a Backend.AI session in a Kata VM (PoC)#12330

Draft
hhoikoo wants to merge 3 commits into
mainfrom
feat/kata-agent-mvp-wip
Draft

[WIP][DO NOT MERGE] feat(kata): hacky KataAgent backend — run a Backend.AI session in a Kata VM (PoC)#12330
hhoikoo wants to merge 3 commits into
mainfrom
feat/kata-agent-mvp-wip

Conversation

@hhoikoo

@hhoikoo hhoikoo commented Jun 21, 2026

Copy link
Copy Markdown
Member

⚠️ WIP — DO NOT MERGE. This is an exploratory MVP / proof-of-concept branch, pushed for visibility and review only. It is not intended to be merged. It intentionally does not reference or resolve any GitHub/Jira issue — it ties to no issue.

What this is

A hacky KataAgent backend that runs a real Backend.AI compute session inside a Kata Containers lightweight VM, as a low-friction precursor to the full BEP-1051 design (deliberately no containerd-gRPC client / CoCo / VFIO / Calico).

KataAgent(DockerAgent) reuses the entire create_kernel pipeline (scratch/config generation, krunner injection, vfolder mounts, resource specs, image scan, the ZMQ kernel-runner contract). The only functional delta is translating the already-assembled container_config dict into a nerdctl run --runtime io.containerd.kata.v2 invocation instead of two aiodocker calls.

What it changes

  • agent/types.py — add AgentBackend.KATA.
  • agent/config/unified.py — KATA shares Docker's container-config validation.
  • new agent/kata/ package — nerdctl.py (pure container_config→nerdctl translation + subprocess shims), agent.py (KataAgent + KataKernelCreationContext), kernel.py (KataKernel), README.md.
  • agent/errors/kata.pyNerdctlError, KataVolumeResolutionError.
  • tests/agent/kata/ — unit tests for the translation.

Enable with [agent] backend = "kata".

Validation (live)

Deployed the full dev stack on a nested-KVM Kata host and ran a real session end-to-end: session create → code exec inside the microVM (guest kernel 6.18.28 != host 6.8.0-124) → file I/O over the rw virtio-fs scratch → destroy with full VM reap; survives the registry-reconciliation loop with 0 dangling evictions. The live run surfaced and fixed 4 integration bugs unit tests could not catch (nerdctl -i+-d; inline-JSON vs file-path seccomp; empty-cid cleanup guard; enumerate_containers scanning the moby namespace -> VM leak).

Why this must NOT be merged

It is a subprocess shim with intentional degradations: a 5 s nerdctl ps liveness poller instead of an events stream, stats stubbed, commit unimplemented, rootful nerdctl via a sudo wrapper, port reconstruction on restart and overhead accounting missing. Production should graduate to a native containerd-gRPC backend. Full degradation list: src/ai/backend/agent/kata/README.md.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version — N/A (PoC, not for merge)
  • Mention to the original issue — N/A (ties to no issue, by design)
  • Installer updates (db fixtures / mandatory config) — N/A (no schema change; KATA is an opt-in value of the existing [agent] backend)
  • End-to-end CLI integration tests in ai.backend.test — not done (PoC)
  • API server-client counterparts — N/A
  • Test case(s) demonstrating the flow with a concrete implementation — tests/agent/kata/test_nerdctl_translate.py
  • Documentation — src/ai/backend/agent/kata/README.md (enable steps + degradations)

hhoikoo added 3 commits June 21, 2026 19:08
Live validation on kata-lab-150 (10.100.64.150, Kata 3.31.0) proved all
load-bearing host primitives (microVM boot, 127.0.0.1 repl port publish,
krunner bind->virtio-fs) and corrected the file-IO design:

- tar over 'nerdctl exec -i' hangs and poisons the container exec channel;
  bulk cat over exec -i truncates. The rw virtio-fs scratch share round-trips
  1 MiB byte-exact in both directions.
- accept_file: revert to inherited DockerKernel host-side scratch write.
- download_file/download_single: read host-side scratch instead of exec-tar.
- nerdctl_exec: document that -i bulk stdin is unsafe on Kata.

Findings + progress files updated outside the repo.
…-150

Stood up the full Backend.AI dev stack on the live Kata host and created a real
compute session inside a Kata microVM. Four integration bugs surfaced that unit
tests could not catch, each fixed:

1. nerdctl rejects '-i' and '-d' together: do not translate OpenStdin to -i
   (kernels run detached and talk over ZMQ, not container stdin).
2. seccomp profile: Docker inlines it as JSON in SecurityOpt, but nerdctl's
   --security-opt seccomp= wants a FILE PATH ('file name too long'). Externalize
   the inline profile to the kernel config dir.
3. destroy_kernel/clean_kernel: guard empty (not just None) container_id so a
   create that fails before producing a cid does not error on nerdctl stop/logs.
4. enumerate_containers: DockerAgent lists the 'moby' namespace, which never
   has Kata kernels, so the registry-reconciliation loop evicted live kernels as
   'dangling' and leaked the VM. Override to enumerate via nerdctl/containerd.

Proven: session create -> code exec (guest kernel 6.18.28 != host 6.8.0-124) ->
file IO over virtio-fs scratch -> destroy with full VM reap; survives the 60s
reconciliation loop with 0 dangling evictions. Operational note: the agent runs
as a non-root user, so nerdctl must target the rootful containerd via a
'sudo nerdctl' wrapper (BACKENDAI_NERDCTL_BIN).
@github-actions github-actions Bot added size:XL 500~ LoC comp:agent Related to Agent component labels Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant