fix: soft resync — recover a desynced world cache without losing agent state#796
Open
atiweb wants to merge 1 commit into
Open
fix: soft resync — recover a desynced world cache without losing agent state#796atiweb wants to merge 1 commit into
atiweb wants to merge 1 commit into
Conversation
…ing agent state The only recovery from a desynced world cache (lag-induced ghost blocks, PrismarineJS/mineflayer #2600) was a full process restart, which throws away the agent's conversation history, memory and self-prompt goal. smeltItem also forced a full restart just to refresh mineflayer's inventory cache after smelting. Add Agent.softResync(): quit and reconnect the mineflayer bot in place. A fresh connection re-downloads chunks -- the same fix a restart applies -- but the Node process (and all agent state) survives. To reconnect in place safely: - extract the connection handlers into _bindConnectionHandlers() so they can be rebound to the new bot (previously inline in _start, which assumed a single connection for the whole process lifetime); - guard the NPC controller and the update loop behind _eventsInitialized so a resync re-attaches per-bot listeners without spinning up a second update loop; - mark the deliberate disconnect (_reconnecting) so _onDisconnect does not treat it as a crash, and skip the update tick while the bot is being swapped out; - skip the greeting / memory reload / task bootstrap on a reconnect (isReconnect). Wire smeltItem to softResync instead of cleanKill, giving the feature an organic trigger. Falls back to a full restart if the reconnect fails.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the world cache desyncs — lag-induced "ghost blocks", where the bot's cached chunks no longer match the server (long-standing upstream issue PrismarineJS/mineflayer#2600) — the pathfinder plans against blocks that aren't really there and the agent can wedge itself in ways no in-game recovery fixes. Today the only escape is a full process restart (
cleanKill), which re-downloads chunks but also throws away everything the agent was doing: conversation history, memory, the self-prompt goal, the current task.The same heavy hammer is used in a far more common place: after
!smeltItemsucceeds, the code does a full restart just to refresh mineflayer's inventory cache —cleanKill('Safely restarting to update inventory.').Fix
Add
Agent.softResync(): quit and reconnect the mineflayer bot in place. A fresh connection re-downloads chunks — the same thing a restart does to clear the desync — but the Node process, and therefore all agent state, survives.!smeltItemnow callssoftResync('refresh inventory after smelting')instead of restarting. If a resync ever fails it falls back to the existingcleanKillrestart, so behavior is never worse than today.Why the diff is larger than a one-liner
_startoriginally assumed one connection for the lifetime of the process: the connection handlers, the NPC controller and the update loop were all set up inline, once. Reconnecting in place means generalizing that assumption, and that is most of the diff:_bindConnectionHandlers()— thekicked/end/error/loginhandlers are extracted out of_startso they can be re-bound to the new bot instead of duplicated._eventsInitializedguard — the NPC controller and the update loop are process-global, not per-connection, so they must start exactly once; otherwise a resync would spin up a second update loop._reconnectingflag — a deliberate disconnect must not be treated as a crash by_onDisconnect, andupdate()must skip its tick whilethis.botis being swapped out.isReconnectparam on_setupEventHandlers— on a reconnect we skip the greeting / memory reload / task bootstrap so the bot quietly rejoins with the state it already has.Each piece is as small as I could make it; the size comes from centralizing the connection lifecycle, not from added features.
Validation — please read this part
I want to be upfront about how far this has and hasn't been tested, because it touches the connection lifecycle.
Tested live on a real server (Paper 1.21.x), running the exact code in this PR. I connected the bot through a local TCP proxy so I could inject faults at the network layer without touching the code under test, and ran it autonomously with another real player online:
!smeltItemcompleted (Successfully smelted raw_iron, got 1 iron_ingot), thensoftResyncfired: the bot quit and reconnected through the proxy in ~3s and loggedSoft resync complete; world cache refreshed, agent state preserved. The process never exited (no restart), and the agent immediately continued its self-prompt loop with its goal and history intact — the exact win over the old full-restart path.block_change/multi_block_changepackets to force the world cache out of sync, then triggerssoftResync()and verifies the cache is correct again afterwards.What I have NOT done, and where I'd genuinely value help:
I'm opening this so others can run it on different servers and conditions, surface the cases I can't reproduce, and suggest improvements — rather than sitting on it until I've personally covered every environment. Happy to gate it behind a setting until it's proven more broadly, or share the fault-injection harness if it helps review.