Skip to content

Prevent driver rollout caused by helm.sh/chart label change#2515

Closed
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade
Closed

Prevent driver rollout caused by helm.sh/chart label change#2515
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade

Conversation

@rajathagasthya
Copy link
Copy Markdown
Contributor

Description

On a Helm chart upgrade, the operator copies the chart's helm.sh/chart label (which encodes the chart version) onto operand pod templates. The changed value alters the DaemonSet's controller revision hash, which the upgrade controller treats as a spec change and rolls the driver DaemonSet — disrupting GPU workloads even on patch upgrades where the driver itself is unchanged.

This change stops propagating helm.sh/chart to operand pod templates in both the ClusterPolicy and NVIDIADriver paths, while still applying it to the DaemonSet object metadata for chart traceability. To keep the label on the running pods without recreating them, the operator reconciles it directly onto the live operand and driver pods on each reconcile. The label is also injected into the NVIDIADriver CR so its pods are covered the same way as the ClusterPolicy-managed operands.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Unit tests added/updated for the pod-template exclusion (both paths) and the live-pod label reconcilers (TestApplyCommonDaemonsetMetadata, TestDriverPodTemplateExcludesChartLabel, TestReconcileOperandPodLabels, TestReconcileDriverPodLabels). go build, go vet, and the full controllers and internal/state suites pass. helm template confirms the NVIDIADriver CR renders the helm.sh/chart label. Not yet validated on a live cluster.

On a Helm chart upgrade, the operator copies the chart's helm.sh/chart
label (which encodes the chart version) onto operand pod templates. The
changed value alters the DaemonSet's controller revision hash, which the
upgrade controller treats as a spec change and rolls the driver
DaemonSet -- disrupting GPU workloads even on patch upgrades where the
driver itself is unchanged.

Stop propagating helm.sh/chart to operand pod templates in both the
ClusterPolicy and NVIDIADriver paths, while still applying it to the
DaemonSet object metadata for chart traceability. To keep the label on
the running pods without recreating them, reconcile it directly onto the
live operand and driver pods on each reconcile. Also inject the label
into the NVIDIADriver CR so its pods are covered the same way as the
ClusterPolicy-managed operands.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rajathagasthya
Copy link
Copy Markdown
Contributor Author

Closing this draft. The pod-template label-exclusion approach special-cases a single label (helm.sh/chart), and we'd rather pursue a more general solution for non-disruptive patch upgrades. Will follow up with a revised design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant