fix(tasks): reap ancient queued runs whose updated_at keeps bumping#66391
Conversation
The stale-queued janitor selects candidates by `updated_at < now-24h`, but the `TasksAgentStaleQueuedRuns` alert measures the oldest queued run by `created_at`. A run that stays QUEUED while its `updated_at` keeps advancing (a stuck workflow writing run state, never progressing to IN_PROGRESS) never trips the `updated_at` rule, so the janitor never reaps it and the alert fires indefinitely. Add a `created_at` backstop to the janitor: a run older than 48h that has also been untouched for over an hour is reaped regardless of `updated_at`. The one-hour grace preserves the existing behaviour of sparing a run that `prepare_for_cloud_handoff` re-queued moments ago and is about to resume, so long-lived interactive runs are unaffected.
There was a problem hiding this comment.
Focused janitor-task fix with correct OR-query logic, backward-compatible signature change, single call site updated in the same PR, and good test coverage for both the reap and grace-window paths. No data model, API contract, or security surface touched.
|
Reviews (1): Last reviewed commit: "fix(tasks): reap ancient queued runs who..." | Re-trigger Greptile |
|
Retaining stamphog approval — delta since last review classified as |
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
|
Retaining stamphog approval — delta since last review classified as |
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
Problem
The
TasksAgentStaleQueuedRunsalert fires when the oldest queued cloudTaskRun(measured bycreated_at) is older than 24h. The janitor that's supposed to clear these (kill_stale_queued_task_runs) selects candidates by a different clock —updated_at < now-24h.That divergence leaves a gap: a run that stays QUEUED while something keeps bumping its
updated_at(a stuck workflow writing run state, never progressing to IN_PROGRESS) never trips the janitor'supdated_atrule. The janitor never reaps it, so the alert fires indefinitely and is not actionable.This was caught by a live
prod-usalert: a singleuser_createdcloud run stuck in QUEUED,created_at~24.7h old, that survived the hourly janitor tick because itsupdated_atwas materially newer than itscreated_at.Changes
Add a
created_athard-cap backstop to the janitor, alongside the existingupdated_atrule:get_stale_queued_task_run_idsgains an optionalcreated_hard_cap(and ahard_cap_min_queuedgrace, default 1h). A run is now a candidate if itsupdated_atis older than the primary window OR itscreated_atis older than the cap AND it has been untouched for longer than the grace.created_hard_cap=48h, so a run older than 48h that hasn't been touched in over an hour is reaped regardless ofupdated_at.The one-hour grace is what preserves the original intent of keying on
updated_at: a run thatprepare_for_cloud_handoffre-queued moments ago (and is about to resume) has a freshupdated_atand is still spared, so long-lived interactive runs are unaffected. Worst-case cleanup for a genuinely stuck run is now bounded at 48h instead of never.