Skip to content

Add GPUClusterConfig CRD and controller for DRA-based stack#2513

Draft
karthikvetrivel wants to merge 4 commits into
NVIDIA:mainfrom
karthikvetrivel:kv-gpuclusterconfig-crd
Draft

Add GPUClusterConfig CRD and controller for DRA-based stack#2513
karthikvetrivel wants to merge 4 commits into
NVIDIA:mainfrom
karthikvetrivel:kv-gpuclusterconfig-crd

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

@karthikvetrivel karthikvetrivel commented Jun 2, 2026

1. Overview

We introduce a new CRD named GPUClusterConfig and a new controller for reconciling it. Like ClusterPolicy today, it is a singleton, cluster-scoped CRD that configures the operands needed to enable GPUs in Kubernetes. GPUClusterConfig represents the new DRA-based software-enablement stack; it is an evolution of ClusterPolicy.

Change Log

9f08bec:

  • Defined GPUClusterConfig Go types in api/nvidia/v1alpha1, cluster-scoped + singleton, with kubebuilder validation/default markers for every operand block. Wire AddToScheme. Generated the CRD manifest + deepcopy.
  • Tested: make manifests generate produces the CRD yaml and deepcopy. kubectl apply the CRD succeeds.

73c9d30:

  • Introduced a new controller built on the existing state.Manager / SyncState() engine (the same
    pattern NVIDIADriver uses), registered in cmd/gpu-operator/main.go.
  • Singleton enforcement (first-wins): a single instance owns reconciliation; any additional
    instance is marked Ignored and skipped. Mirrors how ClusterPolicy handles duplicates.

ccc0f7a:

  • Added cmd/dra-driver-validator, the init container binary for the DRA kubelet-plugin DaemonSet. It runs before the gpus and computeDomains containers start, validates that the NVIDIA driver is installed, and writes /run/nvidia/validations/driver-ready with the two env vars the kubelet-plugin containers source on startup (NVIDIA_DRIVER_ROOT, DRIVER_ROOT_CTR_PATH).
  • Tested: unit tests end-to-end with fake driver, and against a real NVIDIA 595.58.03 driver on an A100 node.

87fa6c0:

  • Adds the DRA driver operand to the GPUClusterConfig controller (a new state-dra-driver state with its manifests wired into the state manager)

@karthikvetrivel karthikvetrivel changed the title Add GPUClusterConfig v1alpha1 CRD types and scaffolding (ST-1) Add GPUClusterConfig CRD and controller for DRA-based stack Jun 2, 2026
@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch 4 times, most recently from 0a080e3 to 5ddc1d5 Compare June 2, 2026 20:09
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch 2 times, most recently from 73c9d30 to 94566b5 Compare June 4, 2026 00:12
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch from 94566b5 to a4e79f6 Compare June 4, 2026 14:51
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant