fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502
Open
fix(e2e): Fix flaky Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoad#7502
Conversation
…untimeConfigLoad When a service fails during WaitReady (container starts but crashes due to runtime config validation), it remains registered in the scenario's services slice. The next attempt to start a service with the same name then fails with "another service with the same name has already been started". Fix by: 1. Calling s.Stop() after expected StartAndWaitReady failures to unregister the service before retrying with a new instance. 2. Making ConcreteService.Stop() and Kill() tolerant of already-removed containers (started with --rm flag) by treating "No such container" errors as successful stops. Signed-off-by: Ben Ye <benye@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Fixes a flaky integration test
Test_VerifyComponentsAreSuccessfullyStarted_WithRuntimeConfigLoadthat intermittently fails with:Root Cause
The test intentionally starts services (querier, ruler, distributor) with invalid config (
-distributor.shard-by-all-labels=false) expecting them to fail, then retries with valid config. WhenStartAndWaitReadyfails duringWaitReady(container starts but crashes), the service remains registered in the scenario'sservicesslice. The subsequent attempt to start a new service with the same name fails becauseisRegistered()returns true.This is a race condition: if the container crashes fast enough that
Start()itself fails, the service is never registered and the retry works. But if the container starts successfully and then crashes duringWaitReady, it stays registered.Fix
s.Stop()after each expectedStartAndWaitReadyfailure to properly unregister the service before retrying.ConcreteService.Stop()andKill()tolerant of already-removed containers (started with--rmflag) by treating "No such container" errors as successful operations.How was this tested
go build ./integration/...passesgo test ./integration/e2e/... -count=1 -shortpassesObserved in: https://github.com/cortexproject/cortex/actions/runs/25643063397/job/75267056134