no-jira: test/monitoring: increase load balancer readiness and curl connection timeout#30994
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@tthvo: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughConditionally extend the HTTP readiness timeout to 20 minutes for AWS regions with prefix "eusc-"; add Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 inconclusive)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Scheduling required tests: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/monitor/backenddisruption/disruption_backend_sampler.go`:
- Around line 235-237: Validate the timeout in BackendSampler.WithTimeout and
refuse non-positive values: check if timeout <= 0 and if so do not set b.timeout
(return b unchanged) so callers cannot accidentally disable timeouts; otherwise
set b.timeout = &timeout and return b. This ensures BackendSampler.WithTimeout
enforces a positive time.Duration (use the existing BackendSampler and
WithTimeout identifiers and the b.timeout field).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 304a4f0e-faeb-43e6-99fa-921457c4fef5
📒 Files selected for processing (2)
pkg/monitor/backenddisruption/disruption_backend_sampler.gopkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
✅ Files skipped from review due to trivial changes (1)
- pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
|
Scheduling required tests: |
|
/test e2e-aws-ovn-fips |
|
/retitle no-jira: test/monitoring: increase load balancer readiness and curl connection timeout |
|
A couple of
I think the slow DNS propagation can be delegated to AWS to improve their infrastructure in the near future (this EUSC cloud is net new - Jan 2026). However, I am taking the proactive approach here and this should help better avoid flakes 🤔 |
|
Scheduling required tests: |
|
/test e2e-gcp-csi |
The service-type-load-balancer-availability monitor test was timing out during preparation when waiting for the load balancer to become reachable. DNS SOA records for AWS ELB in EUSC region show a 900s (15-minute) retry interval, which can cause the load balancer hostname resolution to take longer than the previous 10-minute timeout. Increase the timeout from 10 to 20 minutes to accommodate slow DNS propagation in regions with high TTL values.
The service-load-balancer-with-pdb disruption test was experiencing DNS resolution failures when connecting to the load balancer during test execution. AWS ELB DNS zones have an SOA TTL of 900 seconds, which means DNS propagation can take up to 15 minutes for newly created load balancers. This change adds a WithTimeout() method to BackendSampler and sets a 120-second timeout for the service load balancer disruption samplers, allowing sufficient time for DNS retries during the propagation period. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/monitor/backenddisruption/disruption_backend_sampler.go (1)
234-238: Document the one-time effect ofWithTimeout.Because
GetHTTPClient()is initialized viasync.Once, callingWithTimeoutafter the first request won’t change runtime behavior. Please state this explicitly in the method doc to prevent misuse.Proposed doc update
-// WithTimeout sets a custom timeout for HTTP requests including DNS resolution and connection establishment +// WithTimeout sets a custom timeout for HTTP requests including DNS resolution and connection establishment. +// Call this before the first request; once the HTTP client is initialized, later timeout updates are ignored. func (b *BackendSampler) WithTimeout(timeout time.Duration) *BackendSampler { b.timeout = &timeout return b }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/monitor/backenddisruption/disruption_backend_sampler.go` around lines 234 - 238, The docstring for BackendSampler.WithTimeout must state that the timeout is only effective if called before the HTTP client is initialized because GetHTTPClient() uses sync.Once; update the comment on WithTimeout to explicitly mention that calling WithTimeout after the first GetHTTPClient()/request will have no effect at runtime, so users must call it prior to the first request or client initialization; reference BackendSampler.WithTimeout and the GetHTTPClient sync.Once initialization in the description to make this one-time behavior clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/monitor/backenddisruption/disruption_backend_sampler.go`:
- Around line 234-238: The docstring for BackendSampler.WithTimeout must state
that the timeout is only effective if called before the HTTP client is
initialized because GetHTTPClient() uses sync.Once; update the comment on
WithTimeout to explicitly mention that calling WithTimeout after the first
GetHTTPClient()/request will have no effect at runtime, so users must call it
prior to the first request or client initialization; reference
BackendSampler.WithTimeout and the GetHTTPClient sync.Once initialization in the
description to make this one-time behavior clear.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e6cbe906-e387-4795-b4a9-1b99d68c68c1
📒 Files selected for processing (2)
pkg/monitor/backenddisruption/disruption_backend_sampler.gopkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/monitortests/network/disruptionserviceloadbalancer/monitortest.go
|
/approve |
|
Scheduling required tests: |
|
See e2e-aws-eusc-techpreview. Back in a few hours to check 👀 |
|
/verified by e2e and #30994 (comment) The EUSC run passed the test as expeced 👍 |
|
@tthvo: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
@neisw: Overrode contexts on behalf of neisw: ci/prow/e2e-vsphere-ovn-upi DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neisw, tthvo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/tide refresh |
|
@tthvo: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR increases the timeout for load balancer readiness and curl connection to address slow DNS propagation on cloud provider, for example, AWS EU Sovereign Cloud.
Background
AWS Load balancer DNS SOA records for AWS ELB in EUSC region show a 900s (15-minute) retry interval, which can cause the load balancer hostname resolution (if first hit
NXDOMAIN) to take longer than the previous 10-minute timeout. Additionally, it can take a few more minutes for DNS to fully propagate to its nameserver.References
We are seeing a similar issue in CCM hairpin tests: openshift/cluster-cloud-controller-manager-operator#449.
Summary by CodeRabbit
New Features
Improvements