fix: mark jobs as failed when command or cluster is not found in memory by wlggraham · Pull Request #105 · patterninc/heimdall

Will Graham (wlggraham) · 2026-05-26T21:52:47Z

Problem

During rolling deployments, a race condition causes async jobs to become permanently orphaned in NEW status with no error message.

Root cause: Heimdall uses in-memory maps (h.Commands, h.Clusters) loaded from YAML at startup for all job routing and execution — the DB is only written to for record-keeping. When a new cluster or command is added and deployed, an old instance still running during the rollout will pick up jobs for the new cluster/command from active_jobs, fail to find them in its in-memory maps, and call updateAsyncJobStatus without setting j.Status or j.Error on the job struct first.

Since updateAsyncJobStatus writes j.Status and j.Error to the DB (not the jobError parameter), the job is written back with:

status = NEW (the original status, never updated in memory)
error = "" (empty, the error message is silently dropped)
Row deleted from active_jobs

The result: the job appears stuck in NEW in the UI with no error, is no longer in active_jobs, and can never be picked up by any worker again. The client polling for job status has no way to know anything went wrong.

Fix

Follow the same pattern used in runJob — set j.Status = Failed and j.Error = err.Error() on the struct before calling updateAsyncJobStatus, so the error is correctly persisted to the DB and visible to the client.

Test plan

Submit an async job targeting a cluster that exists in the DB but not in the running instance's config
Verify the job transitions to FAILED with a descriptive error message rather than remaining stuck in NEW

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: mark jobs as failed when command or cluster is not found in memory

a2815f4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: mark jobs as failed when command or cluster is not found in memory#105

fix: mark jobs as failed when command or cluster is not found in memory#105
Will Graham (wlggraham) wants to merge 1 commit into
mainfrom
fix/unknown-command-cluster-job-status

Will Graham (wlggraham) commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Will Graham (wlggraham) commented May 26, 2026

Problem

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant