Skip to content

Sandbox resume 409: suspended sandbox reports no checkpoint data after prior checkpoint-received events #1343

Description

@rblalock

Summary

A Coder Hub sandbox session failed to resume. The Hub surfaced a generic error, but a direct manual resume via the Agentuity CLI reproduced a platform-side 409 with a more specific message:

Cannot resume sandbox: it is suspended but has no checkpoint data

This looks inconsistent with the sandbox lifecycle history, which shows prior evacuation/checkpoint activity for the same sandbox, including one event with a concrete checkpoint id. The expectation is that this sandbox should be resumable, or at minimum the platform state should be self-consistent.

Affected IDs

  • Hub session: codesess_938b33a6bf93
  • Sandbox: sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c
  • Driver job: job_d9f56dde7973877928a81271

What I checked

Hub side

  • Hub session row still exists and is paused.
  • In-memory runtime is gone, which is expected after paused eviction.
  • Hub sandbox tracker latched this error:
    • Sandbox resume failed while platform status is suspended (HTTP 409 conflict). This looks like a platform checkpoint/resume failure.
  • Replay / DB activity shows the last persisted assistant reply completed cleanly; no new user_prompt, turn_start, task, or tool activity was recorded after that.
  • This means the failed prompt never made it into a live turn; failure happened during wake/resume, before agent execution.

Platform side

Current sandbox status:

  • suspended

Manual resume via CLI:

agentuity cloud sandbox resume sbx_aa2c0c5d2c92b74e8f890ad57ca57f454ad8b52f05461d0f0ad23263e88c \
  --org-id org_2u8RgDTwcZWrZrZ3sZh24T5FCtz --json

Result:

{
  "error": {
    "code": "API_ERROR",
    "message": "Cannot resume sandbox: it is suspended but has no checkpoint data",
    "exitCode": 14,
    "details": {
      "tag": "APIErrorResponse",
      "status": 409,
      "sessionId": "sess_c3233f64db553bfd2dfa6d9cc4fcb095"
    }
  }
}

Sandbox status remained suspended after the manual resume attempt.

Relevant sandbox lifecycle timeline

All times UTC on 2026-04-03.

  • 14:55:42Z sandbox created / started
  • 14:55:44Z tracked driver job created and reported running
  • 15:17:19Z lifecycle:suspended + evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)
  • 15:18:23Z another suspend event emitted with:
    • checkpoint_id=ckpt_fc6d480abf57f5f3
    • checkpoint_bucket=ago-d066e3-checkpoints
    • checkpoint_size=65519891
  • 15:19:21Z lifecycle:reconcile(previous_status=suspended)
  • 15:23:07Z lifecycle:resumed
  • 15:24:03Z another evacuation/suspend sequence:
    • lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)
    • evacuation:state-update(status=suspended, evacuation_phase=checkpoint-received)
    • another lifecycle:suspended(phase=pre-suspend, suspension_reason=evacuation)

Additional suspicious signals

  • The platform still reports the original tracked driver job as running with no completion or replacement job.
  • A direct GET /sandbox/:id succeeds normally.
  • A direct GET /sandbox/checkpoints/:id?orgId=... timed out after 12s with zero bytes returned.
    • I am not over-claiming on this one, but it smells related given the resume error.

Why this looks wrong

The platform is simultaneously telling us:

  • the sandbox is suspended
  • earlier lifecycle events included checkpoint-received
  • one suspend emitted a concrete checkpoint id
  • a manual resume now fails because the sandbox allegedly has no checkpoint data

That combination should not happen in a healthy checkpoint/resume flow.

Expected behavior

One of these should be true:

  1. the sandbox resumes successfully from its latest checkpoint, or
  2. the sandbox transitions into a terminal/invalid state with a consistent reason and the checkpoint/event surfaces agree about the missing checkpoint, or
  3. the resume endpoint returns a more precise failure mode tied to the actual checkpoint object that is missing/corrupt/unreadable.

Notes

Coder Hub currently collapses any resume 409 that leaves the sandbox in suspended into a generic message, so the direct CLI/manual resume result above is the most useful raw signal I found.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions