Skip to content

fix: soft resync — recover a desynced world cache without losing agent state#796

Open
atiweb wants to merge 1 commit into
mindcraft-bots:developfrom
atiweb:fix/soft-resync-preserve-state
Open

fix: soft resync — recover a desynced world cache without losing agent state#796
atiweb wants to merge 1 commit into
mindcraft-bots:developfrom
atiweb:fix/soft-resync-preserve-state

Conversation

@atiweb

@atiweb atiweb commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

When the world cache desyncs — lag-induced "ghost blocks", where the bot's cached chunks no longer match the server (long-standing upstream issue PrismarineJS/mineflayer#2600) — the pathfinder plans against blocks that aren't really there and the agent can wedge itself in ways no in-game recovery fixes. Today the only escape is a full process restart (cleanKill), which re-downloads chunks but also throws away everything the agent was doing: conversation history, memory, the self-prompt goal, the current task.

The same heavy hammer is used in a far more common place: after !smeltItem succeeds, the code does a full restart just to refresh mineflayer's inventory cachecleanKill('Safely restarting to update inventory.').

Fix

Add Agent.softResync(): quit and reconnect the mineflayer bot in place. A fresh connection re-downloads chunks — the same thing a restart does to clear the desync — but the Node process, and therefore all agent state, survives.

!smeltItem now calls softResync('refresh inventory after smelting') instead of restarting. If a resync ever fails it falls back to the existing cleanKill restart, so behavior is never worse than today.

Why the diff is larger than a one-liner

_start originally assumed one connection for the lifetime of the process: the connection handlers, the NPC controller and the update loop were all set up inline, once. Reconnecting in place means generalizing that assumption, and that is most of the diff:

  • _bindConnectionHandlers() — the kicked / end / error / login handlers are extracted out of _start so they can be re-bound to the new bot instead of duplicated.
  • _eventsInitialized guard — the NPC controller and the update loop are process-global, not per-connection, so they must start exactly once; otherwise a resync would spin up a second update loop.
  • _reconnecting flag — a deliberate disconnect must not be treated as a crash by _onDisconnect, and update() must skip its tick while this.bot is being swapped out.
  • isReconnect param on _setupEventHandlers — on a reconnect we skip the greeting / memory reload / task bootstrap so the bot quietly rejoins with the state it already has.

Each piece is as small as I could make it; the size comes from centralizing the connection lifecycle, not from added features.

Validation — please read this part

I want to be upfront about how far this has and hasn't been tested, because it touches the connection lifecycle.

Tested live on a real server (Paper 1.21.x), running the exact code in this PR. I connected the bot through a local TCP proxy so I could inject faults at the network layer without touching the code under test, and ran it autonomously with another real player online:

  • softResync, in place (the smelt trigger). A real !smeltItem completed (Successfully smelted raw_iron, got 1 iron_ingot), then softResync fired: the bot quit and reconnected through the proxy in ~3s and logged Soft resync complete; world cache refreshed, agent state preserved. The process never exited (no restart), and the agent immediately continued its self-prompt loop with its goal and history intact — the exact win over the old full-restart path.
  • Connection refactor under real network drops. I destroyed the bot's TCP connection at the proxy several times — while idle and while mid-action (pathfinding). Each time the refactored handlers detected the drop, the process exited cleanly, the mindserver respawned it, and it reconnected and resumed its goal within ~5s. No hang, no double update loop, no orphaned listeners.
  • The desync case, via fault injection. I can't summon real lag-induced ghost blocks on demand, so I also exercised the resync with a harness that drops block_change / multi_block_change packets to force the world cache out of sync, then triggers softResync() and verifies the cache is correct again afterwards.

What I have NOT done, and where I'd genuinely value help:

  • Only one server / one latency profile (Paper). Not validated on vanilla / Spigot / Fabric, across mineflayer versions, or with multiple agents.
  • I have not observed softResync recover a genuine organic desync in the wild (only fault-injected). The fixed 2.5s settle before respawn is a guess that may need tuning on slower/faster servers.

I'm opening this so others can run it on different servers and conditions, surface the cases I can't reproduce, and suggest improvements — rather than sitting on it until I've personally covered every environment. Happy to gate it behind a setting until it's proven more broadly, or share the fault-injection harness if it helps review.

…ing agent state

The only recovery from a desynced world cache (lag-induced ghost blocks,
PrismarineJS/mineflayer #2600) was a full process restart, which throws away the
agent's conversation history, memory and self-prompt goal. smeltItem also forced a
full restart just to refresh mineflayer's inventory cache after smelting.

Add Agent.softResync(): quit and reconnect the mineflayer bot in place. A fresh
connection re-downloads chunks -- the same fix a restart applies -- but the Node
process (and all agent state) survives. To reconnect in place safely:

- extract the connection handlers into _bindConnectionHandlers() so they can be
  rebound to the new bot (previously inline in _start, which assumed a single
  connection for the whole process lifetime);
- guard the NPC controller and the update loop behind _eventsInitialized so a
  resync re-attaches per-bot listeners without spinning up a second update loop;
- mark the deliberate disconnect (_reconnecting) so _onDisconnect does not treat
  it as a crash, and skip the update tick while the bot is being swapped out;
- skip the greeting / memory reload / task bootstrap on a reconnect (isReconnect).

Wire smeltItem to softResync instead of cleanKill, giving the feature an organic
trigger. Falls back to a full restart if the reconnect fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant