Skip to content

DNM: test gc fix #5931

Open
faiq wants to merge 18 commits intokubernetes-sigs:mainfrom
faiq:faiq/fix-gc-test
Open

DNM: test gc fix #5931
faiq wants to merge 18 commits intokubernetes-sigs:mainfrom
faiq:faiq/fix-gc-test

Conversation

@faiq
Copy link
Copy Markdown
Contributor

@faiq faiq commented Apr 6, 2026

What type of PR is this?
stacked on the capi bump #5857

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:


clebs and others added 18 commits April 3, 2026 11:12
- Update cluster-api to v1.12.2
- Update kubernetes depencencies to v0.34.3:
  - k8s.io/api
  - k8s.io/apiextensions-apiserver
  - k8s.io/apimachinery
  - k8s.io/client-go
  - k8s.io/component-base
- Update controller-runtime to v0.22.5
- Update ginkgo to v2.27.2
- Update gomega to v1.38.2
- Regenerate base CRDs

Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <bclement@redhat.com>
Signed-off-by: Borja Clemente <borja.clemente@gmail.com>
As we update to kubernetes 1.33+ and introduce the AMI AL2023 templates,
we need to use the NodeadmConfig instead of EKSConfig for bootstrapping.

Signed-off-by: Borja Clemente <borja.clemente@gmail.com>
* test: fix mmp tests

* fix(e2e): remove check to use old secret generation method because templates always use nodeadm now

* fix: remove field cloud init field for machinepool

* fix(e2e): adds al2023 explictly for AMI type
* feat: convert default eks templates to use nodeadm by default

* test: removes redundant nodeadm test
Background diagnostic goroutines (resource dump every 5s, machine dump
every 60s) call upstream CAPI framework functions (DumpAllResources,
DumpMachines) that use Gomega Expect() assertions internally. When a
cluster is being deleted, these assertions can fail with "not found"
errors. Since these dumps are purely informational, their failures
should not mark the test spec as failed or cause panics.

The previous fix used InterceptGomegaFailure() to catch these failures,
but that function is not goroutine-safe: it temporarily replaces the
global Gomega fail handler, which can race with assertions in the test's
main goroutine. This caused [PANICKED] failures when Eventually()
timeouts in the test goroutine hit the intercepted handler, which panics
with "stop execution" instead of calling ginkgo.Fail().

Replace InterceptGomegaFailure with a goroutine-aware custom fail
handler registered via RegisterFailHandler. The handler uses a sync.Map
to track diagnostic goroutine IDs. When a Gomega assertion fails:
- In a diagnostic goroutine: panic with a sentinel value (without ever
  calling ginkgo.Fail), caught by a per-call recover() which logs a
  warning.
- In any other goroutine: delegate to ginkgo.Fail normally.

This ensures diagnostic dump failures are silently absorbed without
affecting global state or racing with test assertions.

Ref: https://kubernetes.slack.com/archives/CD6U2V71N/p1758795545213209
…duling

During self-hosted e2e tests, the CAPA controller pod can be evicted
from a control plane node during upgrade drain and rescheduled to a
worker node. The previous gcr.io/k8s-staging-cluster-api/capa-manager:e2e
image reference is a local-only tag that doesn't exist on any registry,
so if the kubelet garbage-collects the pre-loaded image, the pod enters
ImagePullBackOff permanently.

Fix this by dynamically rewriting the CAPA provider component image
replacement to use the ECR Public URL (where the image is already
pushed by ensureTestImageUploaded). With imagePullPolicy: IfNotPresent,
the kubelet uses the local cache when available and falls back to
pulling from ECR Public if needed. The ECR-tagged image is also added
to the Kind bootstrap images list for consistency.
…it race

Lower the healthy threshold from 5 to 2 and the health check interval
from 10s to 5s, reducing the time for the target group to mark the API
server as healthy from 50s to 10s. This leaves a comfortable 50s margin
within kubeadm's default 60s kubernetesAPICallTimeout, preventing the
upload-config/kubeadm phase from hitting a context deadline when the ELB
has not yet started forwarding traffic.

Also increase the unhealthy threshold from 3 to 6 to compensate for the
shorter interval and avoid flapping during transient hiccups.
Add a new `DNSResolutionCheck` field to AWSLoadBalancerSpec that allows
users to control whether the provider verifies DNS resolution of the API
server LB before marking the load balancer as ready and setting the
control plane endpoint.

By default (when unset), the DNS lookup is performed
So this check is now made by default as it is necessary to have a fully ready
and routable load balancer before proceeding further with the cluster
bootstrapping.
This has now become a strict prerequisite for CAPI clusters being
installed and bootrapped via the CAPI KubeAdm bootstrap provider,
given the recent changes to make this a stricter requirement that have
been introduced by kubeadm (see:
kubernetes/kubeadm#3294)

Setting the field to "None" skips the check entirely, which is useful in
environments with no kubeadm requirements, slow DNS propagation, custom resolvers, or private hosted
zones where the controller node may not be able to resolve the ELB's
FQDN despite it being valid.
…ration

Introduce PendingInstanceRequeue (1s) to replace DefaultReconcilerRequeue
(30s) for the pending-instance polling loop. This reduces the worst-case
delay between an EC2 instance reaching Running state and the controller
registering it with the API server NLB target group.

AWS NLB target registration has an internal propagation delay of
90–180s before health checks begin and traffic is routed. With the
previous 30s requeue, the total time from instance creation to NLB
readiness left insufficient margin for kubeadm's 60-second
upload-config timeout. Measured CI timelines showed failures missing
the deadline by only 2–24s; polling every 1s instead of 30s recovers
~29s of that margin, shifting most clusters from failure to success.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Apr 6, 2026
@k8s-ci-robot k8s-ci-robot requested review from nrb and serngawy April 6, 2026 21:40
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dlipovetsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-priority cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 6, 2026
@faiq
Copy link
Copy Markdown
Contributor Author

faiq commented Apr 6, 2026

/test pull-cluster-api-provider-aws-e2e

1 similar comment
@faiq
Copy link
Copy Markdown
Contributor Author

faiq commented Apr 6, 2026

/test pull-cluster-api-provider-aws-e2e

@faiq
Copy link
Copy Markdown
Contributor Author

faiq commented Apr 6, 2026

/retest

@faiq
Copy link
Copy Markdown
Contributor Author

faiq commented Apr 7, 2026

/test pull-cluster-api-provider-aws-e2e-blocking

@faiq
Copy link
Copy Markdown
Contributor Author

faiq commented Apr 7, 2026

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@faiq: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e fb86551 link false /test pull-cluster-api-provider-aws-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-priority size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants