Prevent driver rollout caused by helm.sh/chart label change#2515
Closed
rajathagasthya wants to merge 1 commit into
Closed
Prevent driver rollout caused by helm.sh/chart label change#2515rajathagasthya wants to merge 1 commit into
rajathagasthya wants to merge 1 commit into
Conversation
On a Helm chart upgrade, the operator copies the chart's helm.sh/chart label (which encodes the chart version) onto operand pod templates. The changed value alters the DaemonSet's controller revision hash, which the upgrade controller treats as a spec change and rolls the driver DaemonSet -- disrupting GPU workloads even on patch upgrades where the driver itself is unchanged. Stop propagating helm.sh/chart to operand pod templates in both the ClusterPolicy and NVIDIADriver paths, while still applying it to the DaemonSet object metadata for chart traceability. To keep the label on the running pods without recreating them, reconcile it directly onto the live operand and driver pods on each reconcile. Also inject the label into the NVIDIADriver CR so its pods are covered the same way as the ClusterPolicy-managed operands. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
Contributor
Author
|
Closing this draft. The pod-template label-exclusion approach special-cases a single label ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
On a Helm chart upgrade, the operator copies the chart's
helm.sh/chartlabel (which encodes the chart version) onto operand pod templates. The changed value alters the DaemonSet's controller revision hash, which the upgrade controller treats as a spec change and rolls the driver DaemonSet — disrupting GPU workloads even on patch upgrades where the driver itself is unchanged.This change stops propagating
helm.sh/chartto operand pod templates in both the ClusterPolicy and NVIDIADriver paths, while still applying it to the DaemonSet object metadata for chart traceability. To keep the label on the running pods without recreating them, the operator reconciles it directly onto the live operand and driver pods on each reconcile. The label is also injected into the NVIDIADriver CR so its pods are covered the same way as the ClusterPolicy-managed operands.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Unit tests added/updated for the pod-template exclusion (both paths) and the live-pod label reconcilers (
TestApplyCommonDaemonsetMetadata,TestDriverPodTemplateExcludesChartLabel,TestReconcileOperandPodLabels,TestReconcileDriverPodLabels).go build,go vet, and the fullcontrollersandinternal/statesuites pass.helm templateconfirms the NVIDIADriver CR renders thehelm.sh/chartlabel. Not yet validated on a live cluster.