DNM: debug by damdo · Pull Request #5898 · kubernetes-sigs/cluster-api-provider-aws

damdo · 2026-03-10T20:48:51Z

No description provided.

k8s-ci-robot · 2026-03-10T20:48:53Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-03-10T20:48:53Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-03-10T20:48:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nrb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

damdo · 2026-03-10T21:27:37Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

damdo · 2026-03-10T21:53:36Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-11T07:01:35Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-11T19:50:56Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-12T10:57:46Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-12T12:12:30Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-13T06:50:38Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-16T18:29:20Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-16T22:55:48Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-19T21:32:39Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-30T19:59:00Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-31T08:16:41Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-31T13:13:22Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-31T15:33:59Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-31T19:45:15Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-03-31T22:43:56Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-04-01T06:46:45Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-04-01T11:10:41Z

/test pull-cluster-api-provider-aws-e2e

- Update cluster-api to v1.12.2 - Update kubernetes depencencies to v0.34.3: - k8s.io/api - k8s.io/apiextensions-apiserver - k8s.io/apimachinery - k8s.io/client-go - k8s.io/component-base - Update controller-runtime to v0.22.5 - Update ginkgo to v2.27.2 - Update gomega to v1.38.2 - Regenerate base CRDs Signed-off-by: Borja Clemente <bclement@redhat.com>

Signed-off-by: Borja Clemente <bclement@redhat.com>

Background diagnostic goroutines (resource dump every 5s, machine dump every 60s) call upstream CAPI framework functions (DumpAllResources, DumpMachines) that use Gomega Expect() assertions internally. When a cluster is being deleted, these assertions can fail with "not found" errors. Since these dumps are purely informational, their failures should not mark the test spec as failed or cause panics. The previous fix used InterceptGomegaFailure() to catch these failures, but that function is not goroutine-safe: it temporarily replaces the global Gomega fail handler, which can race with assertions in the test's main goroutine. This caused [PANICKED] failures when Eventually() timeouts in the test goroutine hit the intercepted handler, which panics with "stop execution" instead of calling ginkgo.Fail(). Replace InterceptGomegaFailure with a goroutine-aware custom fail handler registered via RegisterFailHandler. The handler uses a sync.Map to track diagnostic goroutine IDs. When a Gomega assertion fails: - In a diagnostic goroutine: panic with a sentinel value (without ever calling ginkgo.Fail), caught by a per-call recover() which logs a warning. - In any other goroutine: delegate to ginkgo.Fail normally. This ensures diagnostic dump failures are silently absorbed without affecting global state or racing with test assertions. Ref: https://kubernetes.slack.com/archives/CD6U2V71N/p1758795545213209

…duling During self-hosted e2e tests, the CAPA controller pod can be evicted from a control plane node during upgrade drain and rescheduled to a worker node. The previous gcr.io/k8s-staging-cluster-api/capa-manager:e2e image reference is a local-only tag that doesn't exist on any registry, so if the kubelet garbage-collects the pre-loaded image, the pod enters ImagePullBackOff permanently. Fix this by dynamically rewriting the CAPA provider component image replacement to use the ECR Public URL (where the image is already pushed by ensureTestImageUploaded). With imagePullPolicy: IfNotPresent, the kubelet uses the local cache when available and falls back to pulling from ECR Public if needed. The ECR-tagged image is also added to the Kind bootstrap images list for consistency.

…it race Lower the healthy threshold from 5 to 2 and the health check interval from 10s to 5s, reducing the time for the target group to mark the API server as healthy from 50s to 10s. This leaves a comfortable 50s margin within kubeadm's default 60s kubernetesAPICallTimeout, preventing the upload-config/kubeadm phase from hitting a context deadline when the ELB has not yet started forwarding traffic. Also increase the unhealthy threshold from 3 to 6 to compensate for the shorter interval and avoid flapping during transient hiccups.

Add a new `DNSResolutionCheck` field to AWSLoadBalancerSpec that allows users to control whether the provider verifies DNS resolution of the API server LB before marking the load balancer as ready and setting the control plane endpoint. By default (when unset), the DNS lookup is performed So this check is now made by default as it is necessary to have a fully ready and routable load balancer before proceeding further with the cluster bootstrapping. This has now become a strict prerequisite for CAPI clusters being installed and bootrapped via the CAPI KubeAdm bootstrap provider, given the recent changes to make this a stricter requirement that have been introduced by kubeadm (see: kubernetes/kubeadm#3294) Setting the field to "None" skips the check entirely, which is useful in environments with no kubeadm requirements, slow DNS propagation, custom resolvers, or private hosted zones where the controller node may not be able to resolve the ELB's FQDN despite it being valid.

damdo · 2026-04-02T19:15:45Z

/test pull-cluster-api-provider-aws-e2e

damdo · 2026-04-03T08:36:30Z

Flakes

/test pull-cluster-api-provider-aws-e2e

k8s-ci-robot · 2026-04-03T13:27:57Z

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e-eks	`507499a`	link	false	`/test pull-cluster-api-provider-aws-e2e-eks`
pull-cluster-api-provider-aws-e2e	`0512411`	link	false	`/test pull-cluster-api-provider-aws-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 10, 2026

k8s-ci-robot requested a review from fiunchinho March 10, 2026 20:48

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2026

k8s-ci-robot requested a review from serngawy March 10, 2026 20:48

k8s-ci-robot added the needs-priority label Mar 10, 2026

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 10, 2026

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 14, 2026

damdo mentioned this pull request Mar 17, 2026

fix: skip IMDS crawl in DataSourceEc2KubernetesLocal init-local phase kubernetes-sigs/image-builder#1951

Merged

damdo force-pushed the capi-1.12-bump-debug branch from ff78ddf to 2db02c8 Compare March 30, 2026 19:58

damdo force-pushed the capi-1.12-bump-debug branch from 2db02c8 to b51760f Compare March 31, 2026 13:13

damdo force-pushed the capi-1.12-bump-debug branch from b51760f to aab2eb4 Compare March 31, 2026 15:33

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026

damdo force-pushed the capi-1.12-bump-debug branch from aab2eb4 to 978d143 Compare April 2, 2026 18:34

clebs and others added 11 commits April 2, 2026 21:08

deps: update e2e testing versions and envtest

f625a85

Signed-off-by: Borja Clemente <bclement@redhat.com>

fix(e2e): use correct AMI k8s version

ba0265f

Signed-off-by: Borja Clemente <bclement@redhat.com>

test(e2e): Add v1.12 to CAPI version to contract map

f4989ac

Signed-off-by: Borja Clemente <bclement@redhat.com>

fix: flatcar log collection

8e139e5

fix SpotMarketOptions comparison

2dd81cc

chore: bump calico to v3.31.4 (to support k8s 1.34)

1cbe838

damdo force-pushed the capi-1.12-bump-debug branch from 978d143 to 0512411 Compare April 2, 2026 19:15

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2026

Conversation

damdo commented Mar 10, 2026

Uh oh!

k8s-ci-robot commented Mar 10, 2026

Uh oh!

k8s-ci-robot commented Mar 10, 2026

Uh oh!

k8s-ci-robot commented Mar 10, 2026

Uh oh!

damdo commented Mar 10, 2026

Uh oh!

damdo commented Mar 10, 2026

Uh oh!

damdo commented Mar 11, 2026

Uh oh!

damdo commented Mar 11, 2026

Uh oh!

damdo commented Mar 12, 2026

Uh oh!

damdo commented Mar 12, 2026

Uh oh!

damdo commented Mar 13, 2026

Uh oh!

damdo commented Mar 16, 2026

Uh oh!

damdo commented Mar 16, 2026

Uh oh!

damdo commented Mar 19, 2026

Uh oh!

damdo commented Mar 30, 2026

Uh oh!

damdo commented Mar 31, 2026

Uh oh!

damdo commented Mar 31, 2026

Uh oh!

damdo commented Mar 31, 2026

Uh oh!

damdo commented Mar 31, 2026

Uh oh!

damdo commented Mar 31, 2026

Uh oh!

damdo commented Apr 1, 2026

Uh oh!

damdo commented Apr 1, 2026

Uh oh!

damdo commented Apr 2, 2026

Uh oh!

damdo commented Apr 3, 2026

Uh oh!

k8s-ci-robot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants