Skip to content

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164

Open
sg00dwin wants to merge 2 commits into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix
Open

OCPBUGS-38676: Add retry for transient API errors to prevent Degraded blips#1164
sg00dwin wants to merge 2 commits into
openshift:mainfrom
sg00dwin:OCPBUGS-38676-degraded-blips-fix

Conversation

@sg00dwin
Copy link
Copy Markdown
Member

@sg00dwin sg00dwin commented May 27, 2026

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
    • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
    • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability by adding retry-on-transient-error handling to many console controllers and operator sync flows for create/update/delete/apply operations, reducing spurious failures and improving reconciliation robustness.
  • Tests

    • Added unit tests covering retry/backoff behavior and distinguishing retryable vs non-retryable errors.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 70ec601c-07b3-49f2-a670-dcba4fd8923a

📥 Commits

Reviewing files that changed from the base of the PR and between 58cf789 and 71b7e78.

📒 Files selected for processing (2)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/operator/sync_v400.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/operator/sync_v400.go

Walkthrough

Adds a transient-error retry utility and applies it across operator sync and controller write operations, replacing single-attempt create/update/apply/delete calls with retries and propagating final retry errors into existing failure-reason flows.

Changes

Transient Error Retry Integration

Layer / File(s) Summary
Retry infrastructure
pkg/console/controllers/util/retry.go, pkg/console/controllers/util/retry_test.go
Defines TransientBackoff, IsRetryableError, and RetryOnTransientError(fn) with V(4) logging; adds unit tests for classification and retry behavior.
Operator helper & sync changes
pkg/console/operator/retry.go, pkg/console/operator/sync_v400.go
Adds an operator-local retryOnTransientError delegating to the controller retry util; wraps ConfigMap/Secret/Deployment apply and status update calls in retry wrappers and removes prior conflict-retry import.
CLI downloads create/update
pkg/console/controllers/clidownloads/controller.go
Wraps CLIDownload create and update flows in retry closures; update path re-gets resource inside retry before setting Spec/metadata and calling Update.
Downloads Deployment apply
pkg/console/controllers/downloadsdeployment/controller.go
Moves resourceapply.ApplyDeployment into a retry closure and captures the applied Deployment and changed flag from the retry wrapper.
OAuth clients & secret
pkg/console/controllers/oauthclients/oauthclients.go, pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
Wraps oauth apply and console-oauth-config Secret apply in retry-on-transient wrappers; adds util import to secret controller.
PodDisruptionBudget apply
pkg/console/controllers/poddisruptionbudget/controller.go
Wraps ApplyPodDisruptionBudget in retry wrapper and uses the wrapped error when recording PDBSync conditions and flushing status.
Service and ServiceAccount applies
pkg/console/controllers/service/controller.go, pkg/console/controllers/serviceaccounts/controller.go
Retries Service apply (including redirect service delete/apply) and ServiceAccount apply using retry wrappers, mapping retry errors into existing failure-reason returns.
StorageVersionMigration & UpgradeNotification
pkg/console/controllers/storageversionmigration/controller.go, pkg/console/controllers/upgradenotification/controller.go
Adds util import and wraps SVM deletion and ConsoleNotification create/delete flows in retry wrappers, returning existing failure reasons when retry ultimately fails.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is comprehensive and detailed, covering root cause analysis, solution description, testing approach, and includes new utility tests. However, it does not follow the repository's required description template structure with all mandatory sections (Test setup, Test cases, Browser conformance, Reviewers and assignees). While the content is substantive, consider restructuring the description to match the repository template with explicit sections for Test setup, Test cases, Browser conformance checklist, and Reviewers/assignees sections.
Test Structure And Quality ❓ Inconclusive The check specifies "Review Ginkgo test code" but this repo uses standard Go testing (testing.T), not Ginkgo. The PR's new retry_test.go file uses standard Go patterns, not Ginkgo. Clarify if check applies to standard Go tests or only Ginkgo-based tests. Repository uses only standard Go testing—no Ginkgo in go.mod or anywhere in codebase.
✅ Passed checks (12 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding retry logic for transient API errors to prevent Degraded status flips, matching the core objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the new retry_test.go file are static, deterministic, and descriptive with no dynamic content (UUIDs, timestamps, pod names, etc.).
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. New file retry_test.go contains standard Go unit tests, not Ginkgo patterns. Changes are controller modifications wrapping API calls with retry logic.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR contains no new Ginkgo e2e tests. It adds only Go unit tests (TestIsRetryableError, TestRetryOnTransientError) and modifies controllers with retry logic for transient errors.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces only retry logic for transient API errors. No new scheduling constraints added; pre-existing nodeSelector constraints are from earlier commits.
Ote Binary Stdout Contract ✅ Passed Only logging added is klog.V(4).Infof in retry.go, which writes to stderr by default. All calls are in reconciliation methods, not process-level code like main/init/TestMain.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only includes Go unit tests (retry_test.go) and utility/controller modifications to add retry logic for transient API errors.
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto, or insecure secret comparisons found. All crypto usage relies on standard Go crypto/rand.
Container-Privileges ✅ Passed PR contains only Go code changes to add retry logic. No container privilege flags, privileged mode, hostPID/Network/IPC, SYS_ADMIN caps, or root-related settings in manifests or code.
No-Sensitive-Data-In-Logs ✅ Passed The PR adds a single debug-level (V(4)) log statement that logs Kubernetes API errors, which contain only status codes and error reasons, not sensitive data like passwords, tokens, or PII.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jhadvig and spadgett May 27, 2026 18:31
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sg00dwin
Once this PR has been reviewed and has the lgtm label, please assign jhadvig for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sg00dwin
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

The console ClusterOperator intermittently flips Degraded=True during upgrade and CI runs when transient API errors (conflicts, timeouts, connection refused) hit resourceapply.Apply*() calls. The error flows directly to the status handler with no retry, setting Degraded=True for ~1 minute until the next sync resolves it. This causes CI test failures (8.7% failure rate in OCP 5.0).

This PR adds a shared retry utility and wraps all API write operations that can trigger Degraded conditions, absorbing transient errors before they reach status reporting.

Changes

  • New: pkg/console/controllers/util/retry.go — shared RetryOnTransientError() with backoff (3 steps, 500ms base, 2x factor, ~3.5s max). Retries all errors except permanent ones (Forbidden, Invalid, AlreadyExists, etc.).
  • New: pkg/console/controllers/util/retry_test.go — 10 unit tests covering error classification and retry behavior.
  • New: pkg/console/operator/retry.go — thin wrapper for the operator package.
  • Modified: Wrapped 16 API write call sites across 10 files:
  • sync_v400.go — ConfigMap, ServiceCA, TrustedCA, Deployment, ConsoleConfig, PublicConfig, SessionSecret
  • Individual controllers — PDB, Service, CLIDownloads, DownloadsDeployment, ServiceAccounts, OAuthClients, OAuthClientSecret, UpgradeNotification, StorageVersionMigration
  • Not modified: Healthcheck controller (already has its own retry from OCPBUGS-24041)

Testing

  • All existing unit tests pass
  • New retry utility tests cover: retryable vs permanent error classification, retry-then-succeed, retry exhaustion, generic network error retry
  • make test-unit passes clean
  • gofmt / go vet clean

Summary by CodeRabbit

  • Bug Fixes

  • Enhanced reliability by adding automatic retry logic for transient failures across multiple resource operations (create, update, delete, apply) in Kubernetes API interactions, reducing false failures.

  • Tests

  • Added comprehensive test coverage validating transient error retry behavior and backoff policies.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/console/controllers/service/controller.go (1)

144-149: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate redirect sync errors instead of shadowing svcErr.

At Line 144, redirectSvcErrReason, svcErr := ... shadows the outer svcErr. Line 148 then returns the old outer value, so redirect failures are not returned from Sync.

Suggested fix
-	if c.serviceName == api.OpenShiftConsoleServiceName {
-		redirectSvcErrReason, svcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
-		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, svcErr))
-	}
-
-	return statusHandler.FlushAndReturn(svcErr)
+	if c.serviceName == api.OpenShiftConsoleServiceName {
+		redirectSvcErrReason, redirectSvcErr := c.SyncRedirectService(ctx, routeConfig, controllerContext)
+		statusHandler.AddConditions(status.HandleProgressingOrDegraded("RedirectServiceSync", redirectSvcErrReason, redirectSvcErr))
+		if redirectSvcErr != nil {
+			return statusHandler.FlushAndReturn(redirectSvcErr)
+		}
+	}
+
+	return statusHandler.FlushAndReturn(svcErr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/controllers/service/controller.go` around lines 144 - 149, The
redirect sync call is shadowing the outer svcErr (redirectSvcErrReason, svcErr
:= c.SyncRedirectService...), causing the function to return the old svcErr;
change the declaration to assign to the existing outer variable (use = instead
of :=) or use a distinct temporary name and then set the outer svcErr = the
returned error; ensure the call to SyncRedirectService (and the variables
redirectSvcErrReason and svcErr) updates the same svcErr that FlushAndReturn
uses and keep statusHandler.AddConditions using the updated svcErr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/console/operator/sync_v400.go`:
- Around line 517-527: The Create retry may succeed server-side then later
return apierrors.IsAlreadyExists, causing SyncServiceCAConfigMap and
SyncTrustedCAConfigMap to surface a create error and flip Degraded; update the
createErr handling after retryOnTransientError (the block that sets actual and
createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.
- Around line 264-270: The retry logic reuses stale object instances causing
Conflict retries to fail; inside retryOnTransientError closures for
SyncConsoleConfig (UpdateStatus) and for SyncServiceCAConfigMap /
SyncTrustedCAConfigMap (ConfigMaps Update) refetch the latest resource from the
API (not reuse the original updated/existing struct) before calling UpdateStatus
or Update so each attempt uses current resourceVersion; additionally, in the
Create wrappers used in those Sync* functions treat AlreadyExists as success by
fetching the existing object and proceeding (instead of returning FailedCreate),
and keep IsRetryableError behavior but ensure Update/UpdateStatus retries
operate on freshly fetched objects rather than the initial stale instance.

---

Outside diff comments:
In `@pkg/console/controllers/service/controller.go`:
- Around line 144-149: The redirect sync call is shadowing the outer svcErr
(redirectSvcErrReason, svcErr := c.SyncRedirectService...), causing the function
to return the old svcErr; change the declaration to assign to the existing outer
variable (use = instead of :=) or use a distinct temporary name and then set the
outer svcErr = the returned error; ensure the call to SyncRedirectService (and
the variables redirectSvcErrReason and svcErr) updates the same svcErr that
FlushAndReturn uses and keep statusHandler.AddConditions using the updated
svcErr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 12887e8d-b1d9-4882-b7b9-b035820dd2e0

📥 Commits

Reviewing files that changed from the base of the PR and between c54e288 and 1a8cda1.

📒 Files selected for processing (13)
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/operator/retry.go
  • pkg/console/operator/sync_v400.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{yaml,yml,go}

📄 CodeRabbit inference engine (Custom checks)

Topology-Aware Scheduling Compatibility: Deployment manifests, operator code, and controllers should not introduce scheduling constraints that assume a standard HA topology with 3+ control-plane nodes and dedicated workers. Flag required pod anti-affinity without topology checks, topology spread constraints with DoNotSchedule that exceed schedulable node counts, replica counts derived from node count without considering SNO/TNF/TNA, nodeSelector targeting control-plane nodes, broad tolerations that could schedule to arbiter nodes on TNA, code targeting only worker nodes, or PodDisruptionBudgets designed for 3+ nodes.

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

  • Controller struct types (e.g., type *Controller struct)
  • func New*Controller( factory functions
  • factory.New().WithFilteredEventsInformers( pattern
  • .ToController( method calls
  • Sync(ctx context.Context, controllerContext factory.SyncContext) methods
  • operatorConfig.Spec.ManagementState checks
  • status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

  • Main operator sync functions (e.g., sync_v400.go content)
  • Sequential resource syncing with early returns
  • Incremental reconciliation loops
  • Multiple resourceapply.Apply*() calls in sequence
  • Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
  • Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

  • Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll
  • Deprecated patterns: Dial without DialContext
  • Error handling: missing %w in fmt.Errorf
  • Code smells: deep nesting (4+ levels), functions >100 lines
  • Magic values: unexplained numbers/strings
  • Context propagation: context.Background() instead of passed ctx
  • Missing godoc on exported functions

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

  • SQL: parameterized queries only; no string concatenation
  • Command: no shell=True, os.system, or backtick exec with user input
  • LDAP/XPath: escape special characters in filters
  • Path traversal: canonicalize paths, reject ../
  • Deserialization: no pickle/yaml.load()/eval on untrusted data
  • Prototype pollution: no recursive merge of untrusted objects
  • Validate at trust boundaries with allow-lists, not deny-lists
  • Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

  • pkg/console/controllers/downloadsdeployment/controller.go
  • pkg/console/controllers/upgradenotification/controller.go
  • pkg/console/controllers/serviceaccounts/controller.go
  • pkg/console/operator/retry.go
  • pkg/console/controllers/storageversionmigration/controller.go
  • pkg/console/controllers/oauthclients/oauthclients.go
  • pkg/console/controllers/service/controller.go
  • pkg/console/controllers/util/retry.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/controllers/clidownloads/controller.go
  • pkg/console/controllers/poddisruptionbudget/controller.go
  • pkg/console/controllers/util/retry_test.go
  • pkg/console/controllers/oauthclientsecret/oauthclientsecret.go
**/*_test.go

📄 CodeRabbit inference engine (Custom checks)

**/*_test.go: IPv6 and Disconnected Network Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that contain IPv4 assumptions such as hardcoded IPv4 addresses, IPv4-only parsing logic, IPv4-specific service/endpoint creation, or hardcoded IPv4 network policies. Also flag tests requiring external connectivity to public internet services, pulling images from public registries, or DNS resolution of public hostnames.
MicroShift Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that reference OpenShift-specific APIs unavailable on MicroShift (except Route and SecurityContextConstraints), such as Project/ProjectRequest, Build/BuildConfig, DeploymentConfig, ClusterOperator, ClusterVersion, Etcd operator, CSV/OLM, MachineSet/Machine, ClusterAutoscaler, Console, Monitoring components, ImageRegistry operator, Samples operator, OperatorHub, CloudCredential, or Storage/Network operators.
OTE Binary Stdout Contract: OpenShift Tests Extension (OTE) binaries must communicate with openshift-tests via JSON on stdout only. Flag any stdout writes in process-level code (main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers). Common violations include klog writing to stdout (must redirect to stderr via klog.SetOutput(os.Stderr) or klog.LogToStderr(true)), and fmt.Print/Println/Printf to stdout in main or suite setup.
Single Node OpenShift (SNO) Test Compatibility: When new Ginkgo e2e tests are added (It(), Describe(), Context(), When(), etc.), flag tests that make multi-node or HA assumptions such as expecting multiple control-plane nodes, multiple worker nodes, pod anti-affinity across nodes, node-to-node communication patterns, leader election failover across replicas, pod rescheduling to different nodes, node scaling operations, or separate node roles.
Stable and Deterministic Te...

Files:

  • pkg/console/controllers/util/retry_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Review test code for quality and patterns.

Refer to /unit-test-review when test is in pkg//*_test.go:**

  • Table-driven test structure with test cases
  • Use of go-test/deep for struct comparisons
  • Test naming conventions (TestFunctionName)
  • Error handling with wantErr pattern
  • Edge case coverage (nil, empty, boundary values)
  • Proper assertions with helpful error messages
  • Test isolation (no shared mutable state)

Refer to /e2e-test-review when test contains:

  • framework.MustNewClientset(t, nil) or similar e2e framework usage
  • wait.Poll or wait.PollImmediate patterns
  • retry.RetryOnConflict for updates
  • Cleanup via defer functions
  • Console/operator CR manipulations
  • Test assertions on cluster state

Suggest to use /e2e-test-review when:

  • PR adds new feature requiring e2e coverage
  • Test file is empty or skeleton
  • Comments indicate "TODO: add test"

Review for common issues:

  • Missing cleanup (defer statements)
  • Using time.Sleep instead of wait.Poll
  • Missing context timeouts
  • Vague error messages in assertions
  • Tests without table-driven structure when testing multiple cases
  • Ignoring errors with _
  • Tests without assertions

Files:

  • pkg/console/controllers/util/retry_test.go
🔇 Additional comments (12)
pkg/console/controllers/util/retry.go (1)

13-51: LGTM!

pkg/console/controllers/util/retry_test.go (1)

11-165: LGTM!

pkg/console/controllers/clidownloads/controller.go (1)

231-240: LGTM!

Also applies to: 257-266

pkg/console/controllers/downloadsdeployment/controller.go (1)

121-134: LGTM!

pkg/console/controllers/oauthclients/oauthclients.go (1)

229-233: LGTM!

pkg/console/controllers/oauthclientsecret/oauthclientsecret.go (1)

164-167: LGTM!

pkg/console/controllers/poddisruptionbudget/controller.go (1)

100-103: LGTM!

pkg/console/controllers/serviceaccounts/controller.go (1)

155-166: LGTM!

pkg/console/controllers/storageversionmigration/controller.go (1)

17-17: LGTM!

Also applies to: 107-109

pkg/console/controllers/upgradenotification/controller.go (1)

134-147: LGTM!

pkg/console/operator/retry.go (1)

8-10: LGTM!

pkg/console/operator/sync_v400.go (1)

277-285: LGTM!

Also applies to: 323-334, 434-441, 844-849

Comment thread pkg/console/operator/sync_v400.go
Comment on lines +517 to +527
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify create retry call sites.
rg -n "retryOnTransientError\\(func\\(\\) error \\{[\\s\\S]*?Create\\(" pkg/console/operator/sync_v400.go -U

# Verify how AlreadyExists is classified in retry utility.
rg -n "AlreadyExists|IsAlreadyExists|IsRetryableError" pkg/console/controllers/util/retry.go -C3

Repository: openshift/console-operator

Length of output: 11743


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show exact code around the flagged create retry blocks.
sed -n '500,590p' pkg/console/operator/sync_v400.go

# Locate and show retry helper definitions (name + implementation).
rg -n "retryOnTransientError|RetryOnTransientError" pkg/console/operator/sync_v400.go pkg/console/controllers/util/retry.go
sed -n '1,120p' pkg/console/controllers/util/retry.go

Repository: openshift/console-operator

Length of output: 5929


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "retryOnTransientError" pkg/console/operator/sync_v400.go pkg/console -S

Repository: openshift/console-operator

Length of output: 1718


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,160p' pkg/console/operator/retry.go

Repository: openshift/console-operator

Length of output: 276


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "\"FailedCreate\"|FailedCreate" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 233


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "SyncServiceCAConfigMap\\(|SyncTrustedCAConfigMap\\(" pkg/console/operator -S

Repository: openshift/console-operator

Length of output: 756


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '120,220p' pkg/console/operator/sync_v400.go

Repository: openshift/console-operator

Length of output: 4300


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func HandleProgressingOrDegraded\\(" -S pkg/console-operator pkg || true
rg -n "HandleProgressingOrDegraded\\(" -S pkg/console/operator pkg/console/controllers pkg | head

Repository: openshift/console-operator

Length of output: 2161


🏁 Script executed:

sed -n '80,180p' pkg/console/status/status.go

Repository: openshift/console-operator

Length of output: 4251


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "func IsSyncError\\(|IsSyncError\\(" pkg/console -S

Repository: openshift/console-operator

Length of output: 205


🏁 Script executed:

sed -n '1,120p' pkg/console/errors/sync_errors.go

Repository: openshift/console-operator

Length of output: 920


Avoid Degraded blips when ConfigMap Create succeeded but retry hits AlreadyExists

  • SyncServiceCAConfigMap and SyncTrustedCAConfigMap retry ConfigMaps(...).Create() via retryOnTransientError; apierrors.IsAlreadyExists is treated as non-retryable, so a second attempt that returns AlreadyExists leads to returning "FailedCreate".
  • With a transient failure after server-side success, this returns a non-nil error into status.HandleProgressingOrDegraded(...), setting Degraded=True for ServiceCASync/TrustedCASync even though the ConfigMap exists.
Suggested fix pattern
  createErr := retryOnTransientError(func() error {
    var e error
    actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
    return e
  })
+ if apierrors.IsAlreadyExists(createErr) {
+   actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
+   if createErr == nil {
+     return actual, "", nil
+   }
+ }
  if createErr == nil {
    return actual, "", nil
  }

Apply the same AlreadyExists -> Get handling to both the service-ca and trusted-ca Create retries in pkg/console/operator/sync_v400.go.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", err
} else {
return actual, "FailedCreate", err
return actual, "", nil
}
return actual, "FailedCreate", createErr
var actual *corev1.ConfigMap
createErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Create(ctx, required, metav1.CreateOptions{})
return e
})
if apierrors.IsAlreadyExists(createErr) {
actual, createErr = co.configMapClient.ConfigMaps(required.Namespace).Get(ctx, required.Name, metav1.GetOptions{})
if createErr == nil {
return actual, "", nil
}
}
if createErr == nil {
klog.V(4).Infoln("service-ca configmap created")
return actual, "", nil
}
return actual, "FailedCreate", createErr
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/console/operator/sync_v400.go` around lines 517 - 527, The Create retry
may succeed server-side then later return apierrors.IsAlreadyExists, causing
SyncServiceCAConfigMap and SyncTrustedCAConfigMap to surface a create error and
flip Degraded; update the createErr handling after retryOnTransientError (the
block that sets actual and createErr from retryOnTransientError calling
configMapClient.ConfigMaps(...).Create) to detect apierrors.IsAlreadyExists and
in that case call configMapClient.ConfigMaps(...).Get(ctx, required.Name,
metav1.GetOptions{}) to load the existing ConfigMap into actual and clear the
error (return actual, "", nil); apply the same AlreadyExists->Get pattern in
both SyncServiceCAConfigMap and SyncTrustedCAConfigMap so transient post-success
create races do not produce FailedCreate.

Copy link
Copy Markdown
Member Author

@sg00dwin sg00dwin May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] Same reasoning as the reply above — the Create paths here are only reached when the lister returned NotFound. AlreadyExists is already classified as non-retryable in IsRetryableError, so the retry stops immediately on that error with no wasted attempts.

The phantom-success scenario (server-side Create succeeds but client gets a timeout, retry hits AlreadyExists) is a narrow race that self-heals on the next 60-second reconcile. Adding a Get-on-AlreadyExists fallback would add complexity for an edge case that resolves itself within one sync cycle. The original code before this PR had the same behavior — AlreadyExists on Create was always returned as a failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sg00dwin, fair point — since AlreadyExists is already classified as non-retryable, the retry wrapper exits immediately on that error without looping, and the phantom-success race is both narrow and self-correcting within one reconcile cycle. The added complexity of a Get-on-AlreadyExists fallback isn't justified for such an edge case. Understood, withdrawing the suggestion.

@sg00dwin
Copy link
Copy Markdown
Member Author

/test e2e-aws-console

1 similar comment
@sg00dwin
Copy link
Copy Markdown
Member Author

/test e2e-aws-console

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
  blips

  Wrap all API write operations (Apply, Create, Update, Delete) with
  retry logic to absorb transient errors (conflicts, timeouts, connection
  refused) during upgrades before they reach status reporting. Without
  retry, a single transient failure sets Degraded=True for ~1 minute
  until the next sync resolves it, causing CI test failures.

  Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max).
  Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried.

  Assisted by Claude Code (Opus 4.6)
@sg00dwin sg00dwin force-pushed the OCPBUGS-38676-degraded-blips-fix branch from 1a8cda1 to 58cf789 Compare May 29, 2026 13:58
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026
Comment thread pkg/console/operator/sync_v400.go Outdated
@@ -263,14 +262,27 @@ func (co *consoleOperator) SyncConsoleConfig(ctx context.Context, consoleConfig
klog.V(4).Infof("updating console.config.openshift.io with url: %v", consoleURL)
updated := consoleConfig.DeepCopy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be in the closure, otherwise e.g on a 409 every retry will use the stale (conflicting data)

Comment thread pkg/console/operator/sync_v400.go Outdated
var actual *corev1.ConfigMap
updateErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Update(ctx, existing, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I think if existing is stale we will fail on 409s

Comment thread pkg/console/operator/sync_v400.go Outdated
var actual *corev1.ConfigMap
updateErr := retryOnTransientError(func() error {
var e error
actual, e = co.configMapClient.ConfigMaps(required.Namespace).Update(ctx, existing, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update can also be stale on retry here

var actualCLIDownloads *v1.ConsoleCLIDownload
updateErr := controllersutil.RetryOnTransientError(func() error {
var e error
actualCLIDownloads, e = consoleClient.Update(ctx, existingCLIDownloadsCopy, metav1.UpdateOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything changes this underneath us we will 409, and the retry doesn't support another get :)

…de the retry closure before calling Update/UpdateStatus. Each retry attempt now uses the current resourceVersion instead of the original stale copy.

  - SyncConsoleConfig: re-fetches consoleConfig before UpdateStatus
  - SyncServiceCAConfigMap: re-fetches configmap and re-applies metadata before Update
  - SyncTrustedCAConfigMap: same pattern
  - ApplyCLIDownloads: re-fetches CLI downloads resource, re-applies spec and metadata before Update

Assisted by: Claude (Opus 4.6)
@sg00dwin
Copy link
Copy Markdown
Member Author

sg00dwin commented Jun 2, 2026

@theobarberbany
Updated all 4 sites to re-fetch the latest object via a live API Get inside the retry closure before calling
Update/UpdateStatus. Each retry attempt now uses the current resourceVersion instead of the original stale copy.

  • SyncConsoleConfig: re-fetches consoleConfig before UpdateStatus
  • SyncServiceCAConfigMap: re-fetches configmap and re-applies metadata before Update
  • SyncTrustedCAConfigMap: same pattern
  • ApplyCLIDownloads: re-fetches CLI downloads resource, re-applies spec and metadata before Updat

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@sg00dwin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-console 71b7e78 link true /test e2e-aws-console

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants