fix: mark jobs as failed when command or cluster is not found in memory#105
Open
Will Graham (wlggraham) wants to merge 1 commit into
Open
fix: mark jobs as failed when command or cluster is not found in memory#105Will Graham (wlggraham) wants to merge 1 commit into
Will Graham (wlggraham) wants to merge 1 commit into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During rolling deployments, a race condition causes async jobs to become permanently orphaned in
NEWstatus with no error message.Root cause: Heimdall uses in-memory maps (
h.Commands,h.Clusters) loaded from YAML at startup for all job routing and execution — the DB is only written to for record-keeping. When a new cluster or command is added and deployed, an old instance still running during the rollout will pick up jobs for the new cluster/command fromactive_jobs, fail to find them in its in-memory maps, and callupdateAsyncJobStatuswithout settingj.Statusorj.Erroron the job struct first.Since
updateAsyncJobStatuswritesj.Statusandj.Errorto the DB (not thejobErrorparameter), the job is written back with:status = NEW(the original status, never updated in memory)error = ""(empty, the error message is silently dropped)active_jobsThe result: the job appears stuck in
NEWin the UI with no error, is no longer inactive_jobs, and can never be picked up by any worker again. The client polling for job status has no way to know anything went wrong.Fix
Follow the same pattern used in
runJob— setj.Status = Failedandj.Error = err.Error()on the struct before callingupdateAsyncJobStatus, so the error is correctly persisted to the DB and visible to the client.Test plan
FAILEDwith a descriptive error message rather than remaining stuck inNEW🤖 Generated with Claude Code