Skip to content

fix(tasks): reap ancient queued runs whose updated_at keeps bumping#66391

Merged
tatoalo merged 3 commits into
masterfrom
fix/tasks-janitor-created-at-hard-cap
Jun 26, 2026
Merged

fix(tasks): reap ancient queued runs whose updated_at keeps bumping#66391
tatoalo merged 3 commits into
masterfrom
fix/tasks-janitor-created-at-hard-cap

Conversation

@tatoalo

@tatoalo tatoalo commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Problem

The TasksAgentStaleQueuedRuns alert fires when the oldest queued cloud TaskRun (measured by created_at) is older than 24h. The janitor that's supposed to clear these (kill_stale_queued_task_runs) selects candidates by a different clock — updated_at < now-24h.

That divergence leaves a gap: a run that stays QUEUED while something keeps bumping its updated_at (a stuck workflow writing run state, never progressing to IN_PROGRESS) never trips the janitor's updated_at rule. The janitor never reaps it, so the alert fires indefinitely and is not actionable.

This was caught by a live prod-us alert: a single user_created cloud run stuck in QUEUED, created_at ~24.7h old, that survived the hourly janitor tick because its updated_at was materially newer than its created_at.

Changes

Add a created_at hard-cap backstop to the janitor, alongside the existing updated_at rule:

  • get_stale_queued_task_run_ids gains an optional created_hard_cap (and a hard_cap_min_queued grace, default 1h). A run is now a candidate if its updated_at is older than the primary window OR its created_at is older than the cap AND it has been untouched for longer than the grace.
  • The janitor passes created_hard_cap=48h, so a run older than 48h that hasn't been touched in over an hour is reaped regardless of updated_at.

The one-hour grace is what preserves the original intent of keying on updated_at: a run that prepare_for_cloud_handoff re-queued moments ago (and is about to resume) has a fresh updated_at and is still spared, so long-lived interactive runs are unaffected. Worst-case cleanup for a genuinely stuck run is now bounded at 48h instead of never.

The stale-queued janitor selects candidates by `updated_at < now-24h`, but
the `TasksAgentStaleQueuedRuns` alert measures the oldest queued run by
`created_at`. A run that stays QUEUED while its `updated_at` keeps advancing
(a stuck workflow writing run state, never progressing to IN_PROGRESS) never
trips the `updated_at` rule, so the janitor never reaps it and the alert fires
indefinitely.

Add a `created_at` backstop to the janitor: a run older than 48h that has also
been untouched for over an hour is reaped regardless of `updated_at`. The
one-hour grace preserves the existing behaviour of sparing a run that
`prepare_for_cloud_handoff` re-queued moments ago and is about to resume, so
long-lived interactive runs are unaffected.
@tatoalo tatoalo self-assigned this Jun 26, 2026
@tatoalo tatoalo marked this pull request as ready for review June 26, 2026 16:23
@tatoalo tatoalo added the stamphog Request AI approval (no full review) label Jun 26, 2026
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team June 26, 2026 16:23

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focused janitor-task fix with correct OR-query logic, backward-compatible signature change, single call site updated in the same PR, and good test coverage for both the reap and grace-window paths. No data model, API contract, or security surface touched.

@tatoalo tatoalo enabled auto-merge (squash) June 26, 2026 16:26
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "fix(tasks): reap ancient queued runs who..." | Re-trigger Greptile

Comment thread products/tasks/backend/tests/test_facade.py Outdated
@stamphog

stamphog Bot commented Jun 26, 2026

Copy link
Copy Markdown

Retaining stamphog approval — delta since last review classified as trivial_paths.

@tests-posthog

tests-posthog Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

⏭️ Skipped snapshot commit because branch advanced to ea7475f while workflow was testing ca0d336.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@stamphog

stamphog Bot commented Jun 26, 2026

Copy link
Copy Markdown

Retaining stamphog approval — delta since last review classified as mixed_trivial.

@tests-posthog

tests-posthog Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

⏭️ Skipped snapshot commit because branch advanced to 236daa9 while workflow was testing ea7475f.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@tatoalo tatoalo merged commit fbdea52 into master Jun 26, 2026
232 checks passed
@tatoalo tatoalo deleted the fix/tasks-janitor-created-at-hard-cap branch June 26, 2026 17:21
@deployment-status-posthog

deployment-status-posthog Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-26 17:47 UTC Run
prod-us ✅ Deployed 2026-06-26 18:13 UTC Run
prod-eu ✅ Deployed 2026-06-26 18:15 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stamphog Request AI approval (no full review)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant