add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518
Open
rahulait wants to merge 1 commit into
Open
add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518rahulait wants to merge 1 commit into
rahulait wants to merge 1 commit into
Conversation
cf23d79 to
4adfc71
Compare
4adfc71 to
8125f47
Compare
… to nvidiadriver changes include: * added default nvidiadriver which doesn't conflict with other user-specified nvidiadrivers * added migration support from clusterpolicy to nvidiadriver * added/updated e2e tests for nvidiadriver workflow Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
8125f47 to
b21a20b
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes:
#2353
#2355
Summary
This PR adds support for a default
NVIDIADriverCR and enables controlled migration from legacyClusterPolicydriver management toNVIDIADriver-based driver management.When
driver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true, Helm renders a defaultNVIDIADriverCR. This CR is identified by the label:The CR name is not semantically important. Helm renders it as
default, but users can create an arbitrarily namedNVIDIADriverand mark it as the fallback driver by applying the default label.Behavior
NVIDIADriverCRs own nodes selected by theirspec.nodeSelector.NVIDIADriveracts as fallback for GPU nodes that do not match any non-default driver.NVIDIADriverexists, fallback ownership is skipped.NVIDIADriveris labeled as default, reconciliation fails closed.NVIDIADriverand participates in nodeSelector conflict detection.NVIDIADrivernode selectors cannot include the operator-managed owner labelnvidia.com/gpu.driver.owner.NVIDIADrivermode,NVIDIADriverownership conflicts markClusterPolicyasnotReadyand surface the conflict in its status conditions.Migration Support
This PR supports migration from
ClusterPolicydriver management toNVIDIADriverdriver management by:NVIDIADriverfrom Helm values when CRD mode is enabledClusterPolicydriver fields whendriver.nvidiaDriverCRD.enabled=falseNVIDIADriverownership through the operator-managed node labelnvidia.com/gpu.driver.ownerClusterPolicy-owned driver daemonsets afterNVIDIADriverownership takes overClusterPolicydriver daemonsets beforeNVIDIADriverrolls replacement podsMigration requires the target GPU nodes to be matched by at least one
NVIDIADriverCR. Ifdriver.nvidiaDriverCRD.enabled=truebut no matchingNVIDIADriverexists, the controller has no driver owner to assign to those nodes, so fallback ownership is skipped until a matching CR is created.Upgrade Example
When upgrading from a release that does not contain this migration support, migration from
ClusterPolicydriver management toNVIDIADriverdriver management must be performed in two upgrades. This applies only to clusters that are currently usingClusterPolicyfor driver management and are switching toNVIDIADriver. It applies to releases older thanv26.7.0.Clusters that are already using
NVIDIADriverin an older release do not need this two-phase migration flow.Starting with
v26.7.0, the migration support is already present in the running controller. Clusters already runningv26.7.0or newer can switch toNVIDIADrivermode directly by upgrading or updating configuration withdriver.nvidiaDriverCRD.enabled=true, as long as the target GPU nodes are matched byNVIDIADriverCRs. The recommended path is to also setdriver.nvidiaDriverCRD.deployDefaultCR=trueso Helm renders the default fallback CR. Alternatively, users can create their ownNVIDIADriverCRs and mark one as the default fallback with thenvidia.com/gpu-operator.default-driver: "true"label. See theBehaviorsection for details on default and non-default ownership.Do not jump directly from an old release to the new release with
driver.nvidiaDriverCRD.enabled=true. The old controller does not know about the controlled migration flow. If the old controller exits while the new release is already configured forNVIDIADrivermode, it can tear down all existing driver pods immediately before the new controller has a chance to take ownership.Use this sequence instead:
Upgrade to the new release with
ClusterPolicydriver management still enabled.Example Helm settings:
This gets the new controller logic running while the existing
ClusterPolicy-owned driver pods remain in place.Upgrade again to the same new release, this time enabling
NVIDIADrivermode.Example Helm settings:
This renders the default-labeled
NVIDIADriverand performs the controlled migration fromClusterPolicyownership toNVIDIADriverownership.Issues Found And Fixed
During testing, we found several edge cases:
defaultas special was too restrictive and could conflict with user-created CRs. The default driver is now identified by label instead.spec.nodeSelectorcould includenvidia.com/gpu.driver.ownerand override the operator-managed owner selector used by driver daemonsets. The controller, admission validation, and daemonset nodeSelector construction now reject or defensively preserve that invariant.ClusterPolicyreconciliation previously attemptedNVIDIADriverowner assignment even whenNVIDIADrivermode was disabled. Owner assignment and orphaned-pod migration are now gated ondriver.enabled=trueanddriver.nvidiaDriverCRD.enabled=true.ClusterPolicystatus did not consistently move tonotReadyfor earlyNVIDIADriverownership failures. It now setsstatus.statetonotReadyand records the concrete conflict message in status conditions..metadata.generation. TheNVIDIADrivercontroller now reconciles when the default label changes.NVIDIADrivernames.E2E Coverage
The
NVIDIADrivere2e flow now covers:NVIDIADriveronly whendriver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true.deployDefaultCR=false.NVIDIADriverCRD mode.NVIDIADrivercan take ownership of GPU nodes.NVIDIADriver.NVIDIADriver.ClusterPolicyreportsnotReadywith theNVIDIADriverconflict message when ownership assignment fails inNVIDIADrivermode.NVIDIADrivermode before teardown. It verifies the legacy driver daemonset is removed, the existing legacy pod is orphaned, the defaultNVIDIADrivertakes ownership, the orphaned pod is deleted by the upgrade flow within 5 minutes, and the replacement driver pod becomes ready.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Added unit-tests, e2e tests and also did manual testing on a 3 node cluster with multiple nvidiadriver CRs. Everything worked as expected even with migration.