Skip to content

add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518

Open
rahulait wants to merge 1 commit into
NVIDIA:mainfrom
rahulait:default-nvidiadriver
Open

add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518
rahulait wants to merge 1 commit into
NVIDIA:mainfrom
rahulait:default-nvidiadriver

Conversation

@rahulait
Copy link
Copy Markdown
Contributor

@rahulait rahulait commented Jun 4, 2026

Fixes:

#2353
#2355

Summary

This PR adds support for a default NVIDIADriver CR and enables controlled migration from legacy ClusterPolicy driver management to NVIDIADriver-based driver management.

When driver.nvidiaDriverCRD.enabled=true and driver.nvidiaDriverCRD.deployDefaultCR=true, Helm renders a default NVIDIADriver CR. This CR is identified by the label:

nvidia.com/gpu-operator.default-driver: "true"

The CR name is not semantically important. Helm renders it as default, but users can create an arbitrarily named NVIDIADriver and mark it as the fallback driver by applying the default label.

Behavior

  • Non-default NVIDIADriver CRs own nodes selected by their spec.nodeSelector.
  • The default-labeled NVIDIADriver acts as fallback for GPU nodes that do not match any non-default driver.
  • If no default-labeled NVIDIADriver exists, fallback ownership is skipped.
  • If more than one NVIDIADriver is labeled as default, reconciliation fails closed.
  • If the default label is removed from a CR, that CR becomes a normal user-defined NVIDIADriver and participates in nodeSelector conflict detection.
  • User-defined driver conflicts do not relabel nodes to the default driver or clear existing owner labels.
  • NVIDIADriver node selectors cannot include the operator-managed owner label nvidia.com/gpu.driver.owner.
  • In NVIDIADriver mode, NVIDIADriver ownership conflicts mark ClusterPolicy as notReady and surface the conflict in its status conditions.

Migration Support

This PR supports migration from ClusterPolicy driver management to NVIDIADriver driver management by:

  • rendering a default NVIDIADriver from Helm values when CRD mode is enabled
  • preserving legacy ClusterPolicy driver fields when driver.nvidiaDriverCRD.enabled=false
  • routing GPU nodes to NVIDIADriver ownership through the operator-managed node label nvidia.com/gpu.driver.owner
  • cleaning up ClusterPolicy-owned driver daemonsets after NVIDIADriver ownership takes over
  • preserving running driver pods during migration by orphaning old ClusterPolicy driver daemonsets before NVIDIADriver rolls replacement pods

Migration requires the target GPU nodes to be matched by at least one NVIDIADriver CR. If driver.nvidiaDriverCRD.enabled=true but no matching NVIDIADriver exists, the controller has no driver owner to assign to those nodes, so fallback ownership is skipped until a matching CR is created.

Upgrade Example

When upgrading from a release that does not contain this migration support, migration from ClusterPolicy driver management to NVIDIADriver driver management must be performed in two upgrades. This applies only to clusters that are currently using ClusterPolicy for driver management and are switching to NVIDIADriver. It applies to releases older than v26.7.0.

Clusters that are already using NVIDIADriver in an older release do not need this two-phase migration flow.

Starting with v26.7.0, the migration support is already present in the running controller. Clusters already running v26.7.0 or newer can switch to NVIDIADriver mode directly by upgrading or updating configuration with driver.nvidiaDriverCRD.enabled=true, as long as the target GPU nodes are matched by NVIDIADriver CRs. The recommended path is to also set driver.nvidiaDriverCRD.deployDefaultCR=true so Helm renders the default fallback CR. Alternatively, users can create their own NVIDIADriver CRs and mark one as the default fallback with the nvidia.com/gpu-operator.default-driver: "true" label. See the Behavior section for details on default and non-default ownership.

Do not jump directly from an old release to the new release with driver.nvidiaDriverCRD.enabled=true. The old controller does not know about the controlled migration flow. If the old controller exits while the new release is already configured for NVIDIADriver mode, it can tear down all existing driver pods immediately before the new controller has a chance to take ownership.

Use this sequence instead:

  1. Upgrade to the new release with ClusterPolicy driver management still enabled.

    Example Helm settings:

    driver:
      enabled: true
      nvidiaDriverCRD:
        enabled: false
        deployDefaultCR: false

    This gets the new controller logic running while the existing ClusterPolicy-owned driver pods remain in place.

  2. Upgrade again to the same new release, this time enabling NVIDIADriver mode.

    Example Helm settings:

    driver:
      enabled: true
      nvidiaDriverCRD:
        enabled: true
        deployDefaultCR: true

    This renders the default-labeled NVIDIADriver and performs the controlled migration from ClusterPolicy ownership to NVIDIADriver ownership.

Issues Found And Fixed

During testing, we found several edge cases:

  • Treating the literal name default as special was too restrictive and could conflict with user-created CRs. The default driver is now identified by label instead.
  • If multiple default-labeled drivers existed, reconciliation could behave ambiguously. The controller now fails closed when multiple defaults are found.
  • On user-driver nodeSelector conflicts, nodes could be relabeled away from their existing owner, causing existing driver pods to disappear. Owner assignment now preflights conflicts and preserves existing owner labels on conflict.
  • User-supplied spec.nodeSelector could include nvidia.com/gpu.driver.owner and override the operator-managed owner selector used by driver daemonsets. The controller, admission validation, and daemonset nodeSelector construction now reject or defensively preserve that invariant.
  • ClusterPolicy reconciliation previously attempted NVIDIADriver owner assignment even when NVIDIADriver mode was disabled. Owner assignment and orphaned-pod migration are now gated on driver.enabled=true and driver.nvidiaDriverCRD.enabled=true.
  • ClusterPolicy status did not consistently move to notReady for early NVIDIADriver ownership failures. It now sets status.state to notReady and records the concrete conflict message in status conditions.
  • Removing or moving the default label is a metadata-only change and does not bump .metadata.generation. The NVIDIADriver controller now reconciles when the default label changes.
  • Conflict messages did not identify the other conflicting drivers. They now include the first conflicting node and the conflicting NVIDIADriver names.

E2E Coverage

The NVIDIADriver e2e flow now covers:

  • Helm renders a default NVIDIADriver only when driver.nvidiaDriverCRD.enabled=true and driver.nvidiaDriverCRD.deployDefaultCR=true.
  • Helm does not render a default CR when nvidiaDriverCRD mode is disabled.
  • Helm does not render a default CR when deployDefaultCR=false.
  • Operator installs in NVIDIADriver CRD mode.
  • Default driver can use an arbitrary name as long as it has the default label.
  • User-defined NVIDIADriver can take ownership of GPU nodes.
  • Driver image/version update works through a user-defined NVIDIADriver.
  • Custom driver pod labels update through a user-defined NVIDIADriver.
  • Removing the default label makes that CR a normal driver and conflict detection applies.
  • Conflicts preserve existing node owner labels.
  • ClusterPolicy reports notReady with the NVIDIADriver conflict message when ownership assignment fails in NVIDIADriver mode.
  • Multiple default-labeled drivers are rejected or fail closed without disrupting existing ownership.
  • The ClusterPolicy e2e workflow now performs an in-place migration to NVIDIADriver mode before teardown. It verifies the legacy driver daemonset is removed, the existing legacy pod is orphaned, the default NVIDIADriver takes ownership, the orphaned pod is deleted by the upgrade flow within 5 minutes, and the replacement driver pod becomes ready.
  • Helm uninstall remains covered by the existing e2e teardown.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Added unit-tests, e2e tests and also did manual testing on a 3 node cluster with multiple nvidiadriver CRs. Everything worked as expected even with migration.

@rahulait rahulait force-pushed the default-nvidiadriver branch 7 times, most recently from cf23d79 to 4adfc71 Compare June 5, 2026 17:12
@rahulait rahulait marked this pull request as ready for review June 5, 2026 17:14
@rahulait rahulait force-pushed the default-nvidiadriver branch from 4adfc71 to 8125f47 Compare June 5, 2026 19:03
… to nvidiadriver

changes include:
* added default nvidiadriver which doesn't conflict with other user-specified nvidiadrivers
* added migration support from clusterpolicy to nvidiadriver
* added/updated e2e tests for nvidiadriver workflow

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant