fix(tasks): reap ancient queued runs whose updated_at keeps bumping by tatoalo · Pull Request #66391 · PostHog/posthog

tatoalo · 2026-06-26T16:22:52Z

Problem

The TasksAgentStaleQueuedRuns alert fires when the oldest queued cloud TaskRun (measured by created_at) is older than 24h. The janitor that's supposed to clear these (kill_stale_queued_task_runs) selects candidates by a different clock — updated_at < now-24h.

That divergence leaves a gap: a run that stays QUEUED while something keeps bumping its updated_at (a stuck workflow writing run state, never progressing to IN_PROGRESS) never trips the janitor's updated_at rule. The janitor never reaps it, so the alert fires indefinitely and is not actionable.

This was caught by a live prod-us alert: a single user_created cloud run stuck in QUEUED, created_at ~24.7h old, that survived the hourly janitor tick because its updated_at was materially newer than its created_at.

Changes

Add a created_at hard-cap backstop to the janitor, alongside the existing updated_at rule:

get_stale_queued_task_run_ids gains an optional created_hard_cap (and a hard_cap_min_queued grace, default 1h). A run is now a candidate if its updated_at is older than the primary window OR its created_at is older than the cap AND it has been untouched for longer than the grace.
The janitor passes created_hard_cap=48h, so a run older than 48h that hasn't been touched in over an hour is reaped regardless of updated_at.

The one-hour grace is what preserves the original intent of keying on updated_at: a run that prepare_for_cloud_handoff re-queued moments ago (and is about to resume) has a fresh updated_at and is still spared, so long-lived interactive runs are unaffected. Worst-case cleanup for a genuinely stuck run is now bounded at 48h instead of never.

The stale-queued janitor selects candidates by `updated_at < now-24h`, but the `TasksAgentStaleQueuedRuns` alert measures the oldest queued run by `created_at`. A run that stays QUEUED while its `updated_at` keeps advancing (a stuck workflow writing run state, never progressing to IN_PROGRESS) never trips the `updated_at` rule, so the janitor never reaps it and the alert fires indefinitely. Add a `created_at` backstop to the janitor: a run older than 48h that has also been untouched for over an hour is reaped regardless of `updated_at`. The one-hour grace preserves the existing behaviour of sparing a run that `prepare_for_cloud_handoff` re-queued moments ago and is about to resume, so long-lived interactive runs are unaffected.

github-actions

Focused janitor-task fix with correct OR-query logic, backward-compatible signature change, single call site updated in the same PR, and good test coverage for both the reap and grace-window paths. No data model, API contract, or security surface touched.

greptile-apps · 2026-06-26T16:27:02Z

_{Reviews (1): Last reviewed commit: "fix(tasks): reap ancient queued runs who..." | Re-trigger Greptile}

stamphog · 2026-06-26T16:29:56Z

Retaining stamphog approval — delta since last review classified as trivial_paths.

tests-posthog · 2026-06-26T16:41:56Z

⏭️ Skipped snapshot commit because branch advanced to ea7475f while workflow was testing ca0d336.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

Merge master into your branch, or
Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

stamphog · 2026-06-26T16:57:50Z

Retaining stamphog approval — delta since last review classified as mixed_trivial.

tests-posthog · 2026-06-26T16:59:02Z

⏭️ Skipped snapshot commit because branch advanced to 236daa9 while workflow was testing ea7475f.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

Merge master into your branch, or
Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

deployment-status-posthog · 2026-06-26T17:47:12Z

Deploy status

Environment	Status	Deployed At	Workflow
dev	✅ Deployed	2026-06-26 17:47 UTC	Run
prod-us	✅ Deployed	2026-06-26 18:13 UTC	Run
prod-eu	✅ Deployed	2026-06-26 18:15 UTC	Run

tatoalo self-assigned this Jun 26, 2026

tatoalo marked this pull request as ready for review June 26, 2026 16:23

tatoalo added the stamphog Request AI approval (no full review) label Jun 26, 2026

assign-reviewers-posthog Bot requested a review from a team June 26, 2026 16:23

github-actions Bot approved these changes Jun 26, 2026

View reviewed changes

tatoalo enabled auto-merge (squash) June 26, 2026 16:26

greptile-apps Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread products/tasks/backend/tests/test_facade.py Outdated

chore(tasks): isolate hard-cap facade assertions into own test

ea7475f

Merge branch 'master' into fix/tasks-janitor-created-at-hard-cap

236daa9

tatoalo merged commit fbdea52 into master Jun 26, 2026
232 checks passed

tatoalo deleted the fix/tasks-janitor-created-at-hard-cap branch June 26, 2026 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tasks): reap ancient queued runs whose updated_at keeps bumping#66391

fix(tasks): reap ancient queued runs whose updated_at keeps bumping#66391
tatoalo merged 3 commits into
masterfrom
fix/tasks-janitor-created-at-hard-cap

tatoalo commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Uh oh!

Uh oh!

stamphog Bot commented Jun 26, 2026

Uh oh!

tests-posthog Bot commented Jun 26, 2026

Uh oh!

stamphog Bot commented Jun 26, 2026

Uh oh!

tests-posthog Bot commented Jun 26, 2026

Uh oh!

Uh oh!

deployment-status-posthog Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tatoalo commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 26, 2026

Uh oh!

Uh oh!

stamphog Bot commented Jun 26, 2026

Uh oh!

tests-posthog Bot commented Jun 26, 2026

Uh oh!

stamphog Bot commented Jun 26, 2026

Uh oh!

tests-posthog Bot commented Jun 26, 2026

Uh oh!

Uh oh!

deployment-status-posthog Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tatoalo commented Jun 26, 2026 •

edited

Loading

deployment-status-posthog Bot commented Jun 26, 2026 •

edited

Loading