Add heartbeat_max_test_duration to cap per-test heartbeat duration#397
Open
Add heartbeat_max_test_duration to cap per-test heartbeat duration#397
Conversation
…ost-test reclamation A stuck test would heartbeat forever (since PR #384 removed the countdown), preventing other workers from reclaiming it via reserve_lost. Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat thread stops ticking after N seconds per test. Once ticking stops the ZSET score goes stale and reserve_lost can steal the entry. The started_at timestamp is passed through the tick state from with_heartbeat so elapsed is measured from when the test actually started, not from when the heartbeat thread woke up (which could be up to 1 second late due to the @cond.wait(1) timeout causing a skewed stale threshold). Fixes #395
- test_heartbeat_cap_resets_between_tests: verifies :reset clears capped between consecutive tests so subsequent stuck tests are still reclaimable - test_heartbeat_cap_doesnt_affect_fast_test: verifies cap is a no-op when the test finishes before max_duration fires - test_heartbeat_max_test_duration_defaults: pins the default value logic (timeout*10 when heartbeat enabled, nil when disabled, explicit overrides)
- test_heartbeat_cap_resets_between_tests: verifies that the :reset command clears the capped flag between consecutive tests, so the second test's heartbeat starts fresh - test_heartbeat_cap_doesnt_affect_fast_tests: integration test confirming the cap is a no-op for fast tests (no warnings, correct count) - test_heartbeat_max_test_duration_defaults: unit test for the timeout*10 default and nil-when-disabled behavior
Adds test_heartbeat_cap_resets_between_tests as an integration test: - Worker 0 runs test_alpha (cap fires but test completes normally), then :reset clears the capped flag and worker 0 picks up test_beta - A thief (worker 1) starts only after test_beta is in `running` so it cannot grab it from the queue; it can only steal if test_beta goes stale - With reset working: test_beta heartbeat ticks until cap, finishes before stale → 0 warnings - With broken reset: test_beta has no heartbeat ticks, goes stale → stolen → 1 warning Also adds the consecutive_capped_tests.rb fixture (test_alpha=2s, test_beta=2.5s) sized for the cap=1s/heartbeat=2s parameter window.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After #384 removed the heartbeat countdown (to fix the poisoning bug), stuck tests
ended up heartbeating forever -- other workers could never reclaim them via
reserve_lost. This PR adds a per-test heartbeat cap so the entry eventually goesstale and can be stolen.
What changed
--heartbeat-max-test-duration SECONDSCLI flag -- once a test has been runningfor this long, the heartbeat thread stops ticking. The ZSET score then goes stale
after
max_missed_heartbeat_secondsand another worker can pick it up.timeout * 10when heartbeat is enabled, so existing setups getreasonable behavior out of the box.
Timing gotcha
There was a subtle bug I hit while building this: the heartbeat thread can miss the
initial
:tickbroadcast (the condition variable fires before the thread callswait), so the first tick can land up to 1 second late. If you measure elapsed from"when the thread first woke up", the stale threshold ends up skewed by that 1 second
-- which can put it right at the moment the stuck test naturally finishes, leaving no
steal window.
The fix is to pass
started_at = Process.clock_gettime(CLOCK_MONOTONIC)through thetick state from
with_heartbeat, so the elapsed calculation is always anchored towhen the test actually started.
closes #395