Skip to content

Preserve image plan metadata in VMSS model conversion to prevent false model drift#6129

Open
stelucz wants to merge 2 commits intokubernetes-sigs:mainfrom
stelucz:vmss-model-check
Open

Preserve image plan metadata in VMSS model conversion to prevent false model drift#6129
stelucz wants to merge 2 commits intokubernetes-sigs:mainfrom
stelucz:vmss-model-check

Conversation

@stelucz
Copy link
Copy Markdown
Contributor

@stelucz stelucz commented Mar 2, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a VMSS model comparison issue in CAPZ where plan metadata was lost during Azure SDK → CAPZ image conversion.
As a result, CAPZ could detect a model mismatch even when AzureMachinePool spec was unchanged, and repeatedly roll MachinePool instances.

Root Cause

  • LatestModelApplied for AzureMachinePoolMachine depends on comparing:
    • desired image from MachinePoolScope.GetVMImage(...)
    • observed VM/VMSS instance image from Azure APIs.
  • For Compute/Shared Gallery images that include a marketplace-like plan, CAPZ expected plan data in desired model.
  • Before this change, converter logic SDKImageToImage did not carry Plan fields for gallery-based image references.
  • This produced structurally different image objects (desired had plan metadata, observed often did not), leading to persistent LatestModelApplied=false.

Actual State Before Change

  • No user-visible spec mutation on MachinePool/AzureMachinePool.
  • CAPZ logs repeatedly showed model-change predicates firing and ongoing reconciles/deletes:
    • MachinePoolModelHasChanged ... shouldUpdate=true
    • repeated Deleting AzureMachinePoolMachine ...
  • Node/VM churn continued despite stable target configuration.

Change Introduced

  • Added plan-aware conversion path in VMSS converter:
    • SDKImageToImageWithPlan(imageRef, isThirdParty, sdkPlan)
    • Existing SDKImageToImage(imageRef, isThirdParty) preserved as compatibility wrapper.
  • Updated VM/VMSS conversion call sites to pass both:
    • existing isThirdParty signal (sdkPlan != nil)
    • full sdkPlan object when available.
  • Added plan propagation for gallery images:
    • Compute Gallery: map Plan into ComputeGallery.Plan
    • Shared Gallery: map Plan into Publisher/Offer/SKU fields
  • Updated/extended unit tests in converter package to cover the plan-carrying behavior.

Expected Result

  • Observed image model now preserves relevant plan metadata, reducing false differences against desired model.
  • LatestModelApplied is evaluated against a more faithful observed state.
  • Prevents unnecessary VMSS instance replacement loops caused by plan metadata loss.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Preserve image plan metadata in VMSS model conversion to prevent false model drift

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 2, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @stelucz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 2, 2026
@stelucz
Copy link
Copy Markdown
Contributor Author

stelucz commented Mar 2, 2026

/assign vincepri

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.94%. Comparing base (012443d) to head (86b42fe).
⚠️ Report is 129 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6129      +/-   ##
==========================================
- Coverage   44.41%   43.94%   -0.48%     
==========================================
  Files         280      289       +9     
  Lines       25379    25357      -22     
==========================================
- Hits        11273    11144     -129     
- Misses      13293    13433     +140     
+ Partials      813      780      -33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stelucz stelucz force-pushed the vmss-model-check branch from bdf24b8 to c304453 Compare March 2, 2026 13:31
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stelucz
Copy link
Copy Markdown
Contributor Author

stelucz commented Mar 2, 2026

added missing test

Comment thread azure/converters/vmss.go Outdated
@github-project-automation github-project-automation Bot moved this from Todo to Wait-On-Author in CAPZ Planning Mar 2, 2026
Signed-off-by: Lukas Stehlik <stehlik.lukas@gmail.com>
@stelucz stelucz force-pushed the vmss-model-check branch from c304453 to c380ed5 Compare March 3, 2026 13:25
@stelucz stelucz requested a review from jackfrancis March 5, 2026 08:39
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 15, 2026
@stelucz
Copy link
Copy Markdown
Contributor Author

stelucz commented Apr 15, 2026

Hi @jackfrancis may i ask you to have a look at change? Thanks

Copy link
Copy Markdown
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle regression possible here: if the user specifies the image via image.ID (a raw string), then the desired image from GetVMImage has ID set and ComputeGallery is nil. The spec has no plan data.

The whole purpose of the IDImageRefToImage comparison is to handle the case where the user wrote a raw gallery resource ID instead of using the structured ComputeGallery field. In that case, the user has no way to express plan metadata in their spec.

Maybe the fix is to strip the plan from the observed image when comparing via this code path, since the "desired" side can never have one. We could add this to machinepoolmachine.go:

if newImage.ComputeGallery != nil {
    instanceImageCopy := s.instance.Image
    instanceImageCopy.ComputeGallery.Plan = nil
    return reflect.DeepEqual(instanceImageCopy, newImage), nil
}

Comment thread azure/converters/vmss_test.go Outdated
Comment thread azure/converters/vmss_test.go Outdated
Address review feedback:
- Strip ComputeGallery.Plan from observed instance image when comparing
  against desired image built from raw ID, preventing false model drift.
- Remove unused IsThirdParty field from SDKImageToImage test table.
- Collapse test assertion to single SDKImageToImage call.
- Add hasLatestModelApplied regression tests for plan-stripping path.

Signed-off-by: Lukas Stehlik <stehlik.lukas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Wait-On-Author

Development

Successfully merging this pull request may close these issues.

5 participants