Skip to content

Add max_test_duration cap to heartbeat thread #395

@ianks

Description

@ianks

In #384 we removed the heartbeat countdown to fix the poisoning bug, but that means a stuck test will heartbeat forever now. The old countdown was the right idea -- it just used the wrong unit (config.timeout, the staleness threshold) instead of max_test_duration.

The heartbeat thread is the natural place to enforce this. If it stops heartbeating after max_test_duration, the entry goes stale in the running ZSET, reserve_lost reclaims it, and the system self-heals through the existing lease mechanism. No new coordination needed.

Right now we're relying on the application layer (max_test_duration at the app level) and supervisor-side timeouts (report_timeout / inactive_workers_timeout) as safety nets, but having the heartbeat thread own this directly would be cleaner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions