fix: drop containerd config mounts when NRI is enabled#2514
Conversation
|
Relates to: NVIDIA/nvidia-container-toolkit#1861 |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds NRI-plugin-aware runtime transformation so the toolkit installer avoids mounting/modifying runtime config on read-only hosts (e.g., Talos), and updates unit tests to cover the new behavior.
Changes:
- Add early-return NRI plugin path in
transformForRuntimethat mounts only the NRI socket. - Extract runtime config/socket mounting logic into
transformRuntimeConfigAndSocketMounts. - Extend
TestTransformForRuntimewith per-test ClusterPolicy input and a new “containerd with NRI plugin enabled” case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| controllers/transforms_test.go | Adds a new NRI-enabled containerd test case and allows per-test ClusterPolicy inputs. |
| controllers/object_controls.go | Skips runtime config/socket mounts when NRI plugin is enabled; factors mount logic into a helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7d78a7c to
3383f08
Compare
|
/ok to test 3383f08 |
|
@tariq1890 requesting your review |
3383f08 to
c3feb2e
Compare
|
/ok to test c3feb2e |
|
Thank you very much @frezbo for your contribution here! |
When NRI is enabled there's no need to setup containerd config file path mounts as the installer is a no-op. This fixes issues on immutable hosts where the containerd paths are non-writable and prevents the NRI pod from coming up since kubelet tries to create mount-points for those. Part of upstream discusssion at: NVIDIA#2385 Signed-off-by: Noel Georgi <git@frezbo.dev>
c3feb2e to
543bf7a
Compare
| // transformRuntimeConfigAndSocketMounts configures the toolkit container with the | ||
| // mounts and environment required for the toolkit installer to update the container | ||
| // runtime configuration (top-level config, drop-in config, and runtime socket). This | ||
| // is not required when the NRI plugin is enabled, since the installer does not modify | ||
| // the runtime configuration in that mode. |
There was a problem hiding this comment.
Let's leave out the parts that are more or less obvious.
| // transformRuntimeConfigAndSocketMounts configures the toolkit container with the | |
| // mounts and environment required for the toolkit installer to update the container | |
| // runtime configuration (top-level config, drop-in config, and runtime socket). This | |
| // is not required when the NRI plugin is enabled, since the installer does not modify | |
| // the runtime configuration in that mode. | |
| // transformRuntimeConfigAndSocketMounts configures the toolkit container with the | |
| // mounts and environment required for the toolkit installer to update the container | |
| // runtime configuration (top-level config, drop-in config, and runtime socket). |
| // hostPath mounts (e.g. /etc/containerd/conf.d with DirectoryOrCreate) break on | ||
| // read-only hosts such as Talos. Only the NRI socket mount (below) is required. | ||
| if config.CDI.IsNRIPluginEnabled() { | ||
| // setup mounts for the runtime NRI socket file |
There was a problem hiding this comment.
Why not move L1389-L1401 to a private the same way as you have done in transformRuntimeConfigAndSocketMounts?
| testCases := []struct { | ||
| description string | ||
| runtime gpuv1.Runtime | ||
| clusterPolicy *gpuv1.ClusterPolicySpec |
There was a problem hiding this comment.
Instead of modifying the test case struct here, you can add a new test case to the TestTransformToolkit method in object_controls_test.go. The test case struct over there already has the ClusterPolicySpec object as a struct field.
When NRI is enabled there's no need to setup containerd config file path mounts as the installer is a no-op. This fixes issues on immutable hosts where the containerd paths are non-writable and prevents the NRI pod from coming up since kubelet tries to create mount-points for those.
Part of upstream discusssion at: #2385