Skip to content

fix: mark jobs as failed when command or cluster is not found in memory#105

Open
Will Graham (wlggraham) wants to merge 1 commit into
mainfrom
fix/unknown-command-cluster-job-status
Open

fix: mark jobs as failed when command or cluster is not found in memory#105
Will Graham (wlggraham) wants to merge 1 commit into
mainfrom
fix/unknown-command-cluster-job-status

Conversation

@wlggraham
Copy link
Copy Markdown
Contributor

Problem

During rolling deployments, a race condition causes async jobs to become permanently orphaned in NEW status with no error message.

Root cause: Heimdall uses in-memory maps (h.Commands, h.Clusters) loaded from YAML at startup for all job routing and execution — the DB is only written to for record-keeping. When a new cluster or command is added and deployed, an old instance still running during the rollout will pick up jobs for the new cluster/command from active_jobs, fail to find them in its in-memory maps, and call updateAsyncJobStatus without setting j.Status or j.Error on the job struct first.

Since updateAsyncJobStatus writes j.Status and j.Error to the DB (not the jobError parameter), the job is written back with:

  • status = NEW (the original status, never updated in memory)
  • error = "" (empty, the error message is silently dropped)
  • Row deleted from active_jobs

The result: the job appears stuck in NEW in the UI with no error, is no longer in active_jobs, and can never be picked up by any worker again. The client polling for job status has no way to know anything went wrong.

Fix

Follow the same pattern used in runJob — set j.Status = Failed and j.Error = err.Error() on the struct before calling updateAsyncJobStatus, so the error is correctly persisted to the DB and visible to the client.

Test plan

  • Submit an async job targeting a cluster that exists in the DB but not in the running instance's config
  • Verify the job transitions to FAILED with a descriptive error message rather than remaining stuck in NEW

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant