Skip to content

Add heartbeat_max_test_duration to cap per-test heartbeat duration#397

Open
ianks wants to merge 4 commits intomainfrom
heartbeat-max-test-duration
Open

Add heartbeat_max_test_duration to cap per-test heartbeat duration#397
ianks wants to merge 4 commits intomainfrom
heartbeat-max-test-duration

Conversation

@ianks
Copy link
Copy Markdown
Contributor

@ianks ianks commented Apr 9, 2026

After #384 removed the heartbeat countdown (to fix the poisoning bug), stuck tests
ended up heartbeating forever -- other workers could never reclaim them via
reserve_lost. This PR adds a per-test heartbeat cap so the entry eventually goes
stale and can be stolen.

What changed

  • --heartbeat-max-test-duration SECONDS CLI flag -- once a test has been running
    for this long, the heartbeat thread stops ticking. The ZSET score then goes stale
    after max_missed_heartbeat_seconds and another worker can pick it up.
  • Defaults to timeout * 10 when heartbeat is enabled, so existing setups get
    reasonable behavior out of the box.

Timing gotcha

There was a subtle bug I hit while building this: the heartbeat thread can miss the
initial :tick broadcast (the condition variable fires before the thread calls
wait), so the first tick can land up to 1 second late. If you measure elapsed from
"when the thread first woke up", the stale threshold ends up skewed by that 1 second
-- which can put it right at the moment the stuck test naturally finishes, leaving no
steal window.

The fix is to pass started_at = Process.clock_gettime(CLOCK_MONOTONIC) through the
tick state from with_heartbeat, so the elapsed calculation is always anchored to
when the test actually started.

closes #395

ianks added 4 commits April 9, 2026 17:55
…ost-test reclamation

A stuck test would heartbeat forever (since PR #384 removed the countdown),
preventing other workers from reclaiming it via reserve_lost.

Add --heartbeat-max-test-duration N (defaults to timeout*10) so the heartbeat
thread stops ticking after N seconds per test. Once ticking stops the ZSET
score goes stale and reserve_lost can steal the entry.

The started_at timestamp is passed through the tick state from with_heartbeat
so elapsed is measured from when the test actually started, not from when the
heartbeat thread woke up (which could be up to 1 second late due to the
@cond.wait(1) timeout causing a skewed stale threshold).

Fixes #395
- test_heartbeat_cap_resets_between_tests: verifies :reset clears capped
  between consecutive tests so subsequent stuck tests are still reclaimable
- test_heartbeat_cap_doesnt_affect_fast_test: verifies cap is a no-op when
  the test finishes before max_duration fires
- test_heartbeat_max_test_duration_defaults: pins the default value logic
  (timeout*10 when heartbeat enabled, nil when disabled, explicit overrides)
- test_heartbeat_cap_resets_between_tests: verifies that the :reset
  command clears the capped flag between consecutive tests, so the
  second test's heartbeat starts fresh
- test_heartbeat_cap_doesnt_affect_fast_tests: integration test
  confirming the cap is a no-op for fast tests (no warnings, correct count)
- test_heartbeat_max_test_duration_defaults: unit test for the
  timeout*10 default and nil-when-disabled behavior
Adds test_heartbeat_cap_resets_between_tests as an integration test:
- Worker 0 runs test_alpha (cap fires but test completes normally), then :reset
  clears the capped flag and worker 0 picks up test_beta
- A thief (worker 1) starts only after test_beta is in `running` so it cannot
  grab it from the queue; it can only steal if test_beta goes stale
- With reset working: test_beta heartbeat ticks until cap, finishes before stale → 0 warnings
- With broken reset: test_beta has no heartbeat ticks, goes stale → stolen → 1 warning

Also adds the consecutive_capped_tests.rb fixture (test_alpha=2s, test_beta=2.5s)
sized for the cap=1s/heartbeat=2s parameter window.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add max_test_duration cap to heartbeat thread

1 participant