Skip to content

fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502

Open
yeya24 wants to merge 1 commit intomasterfrom
fix-runtime-config-flaky-test
Open

fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502
yeya24 wants to merge 1 commit intomasterfrom
fix-runtime-config-flaky-test

Conversation

@yeya24
Copy link
Copy Markdown
Contributor

@yeya24 yeya24 commented May 11, 2026

What this PR does

Fixes a flaky integration test Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad that intermittently fails with:

another service with the same name 'distributor' has already been started

Root Cause

The test intentionally starts services (querier, ruler, distributor) with invalid config (-distributor.shard-by-all-labels=false) expecting them to fail, then retries with valid config. When StartAndWaitReady fails during WaitReady (container starts but crashes), the service remains registered in the scenario's services slice. The subsequent attempt to start a new service with the same name fails because isRegistered() returns true.

This is a race condition: if the container crashes fast enough that Start() itself fails, the service is never registered and the retry works. But if the container starts successfully and then crashes during WaitReady, it stays registered.

Fix

  1. Test fix: Call s.Stop() after each expected StartAndWaitReady failure to properly unregister the service before retrying.
  2. Framework fix: Make ConcreteService.Stop() and Kill() tolerant of already-removed containers (started with --rm flag) by treating "No such container" errors as successful operations.

How was this tested

  • go build ./integration/... passes
  • go test ./integration/e2e/... -count=1 -short passes

Observed in: https://github.com/cortexproject/cortex/actions/runs/25643063397/job/75267056134

…untimeConfigLoad

When a service fails during WaitReady (container starts but crashes due to
runtime config validation), it remains registered in the scenario's services
slice. The next attempt to start a service with the same name then fails with
"another service with the same name has already been started".

Fix by:
1. Calling s.Stop() after expected StartAndWaitReady failures to unregister
   the service before retrying with a new instance.
2. Making ConcreteService.Stop() and Kill() tolerant of already-removed
   containers (started with --rm flag) by treating "No such container"
   errors as successful stops.

Signed-off-by: Ben Ye <benye@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant