Skip to content

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112

Open
On1x wants to merge 28 commits into
masterfrom
snapshot-fix
Open

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112
On1x wants to merge 28 commits into
masterfrom
snapshot-fix

Conversation

@On1x
Copy link
Copy Markdown
Member

@On1x On1x commented May 21, 2026

…cies

  • Add sanity check during export to warn if validators exist but validator votes are absent
  • Log warning about possible chainbase type-enum mismatch causing incomplete snapshot
  • Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty
  • Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts

On1x added 28 commits May 21, 2026 06:41
…cies

- Add sanity check during export to warn if validators exist but validator votes are absent
- Log warning about possible chainbase type-enum mismatch causing incomplete snapshot
- Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty
- Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts
…options

- Deleted all mentions of `LOW_MEMORY_NODE` from build scripts, environment variables, and documentation
- Removed low-memory node build instructions and flags from Linux, macOS, and Windows build guides
- Updated CMake options and environment variables to exclude low-memory settings
- Simplified Docker image CMake flags by removing `LOW_MEMORY_NODE`
- Cleared low-memory related config references in node setup and getting started guides
- Cleaned up example config files by removing deprecated plugins and options related to low-memory builds
- Delete config_debug_mongo.ini to clean up obsolete debug mongo configuration
- Remove config_mongo.ini to eliminate outdated mongo production configuration
- Simplify project configuration by removing unused or legacy mongo ini files
- Changed info-level logs (ilog) to debug-level logs (dlog) when connecting to peers and sending DLT hello messages
- Updated rate-limit notification from ilog to dlog for peer exchange requests
- Ensured logging reflects appropriate verbosity level for peer communication events
- Handle CORS preflight by responding to OPTIONS method with proper headers
- Append Access-Control-Allow-Origin header to all HTTP responses
- Add Access-Control-Allow-Methods, Allow-Headers, and Max-Age headers for OPTIONS responses
- Ensure CORS headers are included on error and success responses
- Prevent CORS issues for cross-origin API calls through the webserver plugin
- Add check to skip logging if disconnect is already in progress for a peer
- Avoid re-entrance in send_message calls during handle_disconnect coroutine
- Prevent excessive log entries when send queue is at max depth and peer disconnects
…ad fiber

- Close socket first to unblock pending I/O and avoid multi-second hangs
- Erase connection after closing to prevent dangling shared_ptr references
- Cancel read fiber only after socket is closed to ensure immediate exit
- Retain reentrancy guard to keep peer state valid during disconnect handling
- Adjust order of operations to fix deadlock when multiple peers disconnect simultaneously
- Introduced Ƶ as the short symbol for VIZ chosen by the community
- Explained common practice of showing balances with 2 decimal places
- Noted that even staked funds (SHARES) are displayed as Ƶ with staking notes
- Clarified symbol usage in wallets, explorers, and applications

docs(webserver): document native CORS support in webserver plugin

- Detailed handling of browser cross-origin requests without reverse proxy
- Specified preflight (OPTIONS) response headers and values
- Confirmed all other responses include Access-Control-Allow-Origin: *
- Mentioned compatibility with production setups using nginx proxy
- Highlighted use cases for browser-based wallets and dApps calling JSON-RPC endpoints directly
After shared-memory corruption triggers attempt_auto_recovery(), the
function sets currently_syncing=true so the validator plugin defers
block production during the wipe / snapshot import / dlt_block_log
replay sequence.  Once the database is rebuilt and P2P is resumed,
the flag was expected to self-clear on the next applied block via
plugin_impl::accept_block(), which stores the caller-supplied
sync_mode flag whenever a block is successfully pushed.

That self-clearing path never runs on the DLT pipeline.  The DLT P2P
delegate (dlt_delegate::accept_block in plugins/p2p/p2p_plugin.cpp)
calls chain.db().push_block() directly and bypasses
plugin_impl::accept_block() entirely, so neither broadcast blocks nor
gap-fill replies ever update currently_syncing.  The only remaining
clearer is transition_to_forward(), but a node that was in FORWARD
mode at the moment of corruption stays in FORWARD throughout
pause/resume — transition_to_forward() is never invoked, so the flag
is permanently stuck at true.

The validator gate at plugins/validator/validator.cpp checks
chain().is_syncing() in DLT mode and returns not_synced, producing
the observed indefinite "Block production deferred: not_synced
(head=#X, catching_up=false)" loop where head keeps advancing via
P2P but no local block is produced.

Fix: explicitly clear currently_syncing immediately after
do_snapshot_load(data_dir, true) returns successfully in
attempt_auto_recovery().  Post-recovery catchup remains correctly
gated by _catchup_after_pause in the P2P layer, which the periodic
task clears once no peer is ahead of our head.
The deferred-snapshot wake-up in on_applied_block previously used
head_block_time() >= pending_snapshot_safe_after_time, which fires on
the very block the local validator just produced.

The applied_block signal is dispatched synchronously from _push_block
inside db.generate_block(), and the validator only calls
p2p().broadcast_block() after generate_block() returns. So firing the
snapshot on the same block let the snapshot read-lock start before the
produced block had been broadcast to peers.

Change the condition to strictly greater than: the deferred snapshot
now waits until a SUBSEQUENT block is applied. That block is built by
another validator on top of ours, proving our block was produced,
applied locally, and propagated through the network. Only then does
the snapshot start reading state.

Cost is ~one block interval of additional delay, and only on slots
where the local validator was the deferral target. The non-producer
path is unchanged: snapshots still fire immediately at the originating
block when is_validator_producing_soon() is false.

Also expanded the surrounding comment block and updated the wake-up
log messages to reflect the new semantics.
Replace hardcoded b.validator with get_scheduled_validator(i + 2) so each
missed block line shows the validator scheduled for the slot immediately
after the miss, instead of repeating the current block producer for every
line.
…FORWARD oscillation

The static atomic recovery_in_progress flag in attempt_auto_recovery() was
never reset to false after successful recovery, making any subsequent
corruption event permanently unrecoverable ("already in progress, skipping
duplicate attempt").  Reset it after P2P resume so the node can recover from
future corruption events.

Add a consecutive recovery counter (max 3 within 5 minutes) to prevent
infinite recovery loops when the snapshot or block log is itself corrupted.

In request_gap_fill(), remove the SYNC transition and peer request loop from
the "no peer available" fallback path.  When no peer has a higher head,
transition_to_sync() followed by request_blocks_from_peer() immediately
detects all peers as "caught up" and calls transition_to_forward(), producing
rapid SYNC->FORWARD oscillation every 5 seconds.  Instead, just log and let
the periodic task retry when new peers connect.
Node crashes silently between DLT block log open and "Done opening
block log" with no error output. Add step-by-step ilog() calls to
every major operation in the critical path so the exact failing
step is visible in the next crash log:

- block_log and dlt_block_log head after open
- Before/after undo_all() with revision values
- Revision mismatch detection with values
- Before reading head block from block_log
- fork_db seeding start in both normal and DLT modes
- Before/after init_hardforks() (second call)
- Before validator schedule integrity check

Also add db.open() success log in chain plugin_startup.
All reads and writes to the currently_syncing atomic flag used relaxed
ordering, which does not guarantee cross-thread visibility on non-x86
architectures.  The recovery thread writes currently_syncing=false after
rebuilding the database, and the validator production thread reads it to
decide whether to produce blocks.  Upgrade to release/acquire ordering to
ensure the store is visible to the reader on all platforms.

store  → memory_order_release (3 sites)
load   → memory_order_acquire (1 site)
exchange → memory_order_acq_rel (1 site)
undo_all() in database::open() causes a silent SIGSEGV when shared
memory is corrupted after a hard crash. Since segfaults bypass all
C++ exception handlers, the node enters an infinite restart loop in
Docker without ever reaching the recovery path.

Introduce a marker file (state/undo_all_in_progress) that is created
before undo_all() and removed after it completes. If the process
crashes inside undo_all(), the marker survives and triggers
database_revision_exception on the next startup, which activates
the existing snapshot recovery path.

Marker cleanup is added to:
- database::open() — removed after successful undo_all()
- database::open_from_snapshot() — cleaned before snapshot import
- database::wipe() — cleaned during shared memory wipe
- Detect resize_in_progress crash marker on startup, throw
  database_revision_exception to trigger recovery path
- Add post-resize validation: verify max_memory() increased and
  dynamic_global_property_object survived the remap in both
  _resize() (immediate) and apply_pending_resize() (deferred)
- Fix bad_alloc -> std::terminate in _push_block: heap-allocate undo
  session and explicitly destroy before exception unwinding
- Clean up resize crash markers in database::wipe()
- Update shared-memory.md docs and RU/ZH-CN translations with
  safety mechanisms, updated startup sequence, and recovery scenarios
validator, webserver, network_broadcast_api, database_api,
account_by_key, and custom_protocol_api all connected to database
signals (applied_block, pre/post_apply_operation) but never stored
or disconnected the connection handles. On shutdown, callbacks into
partially-destroyed plugin state could cause use-after-free crashes.

Each plugin now stores boost::signals2::connection members and calls
.disconnect() in plugin_shutdown() before releasing owned resources.
validator additionally disconnects before stopping the production
io_service to prevent the callback racing the thread join.
…ount_history

Both account_history and operation_history registered the same option
name in the shared appbase cli options_description, causing boost to
throw "option is ambiguous" on startup. account_history already reads
the value registered by operation_history, so re-registration is
unnecessary.
apply_pending_resize() is called from two threads before their
respective write locks: the P2P thread (push_block) and the validator
thread (generate_block).  Both can see _pending_resize==true
simultaneously, both pass begin_resize_barrier() (which releases its
internal mutex on return), and both call resize() concurrently —
resulting in simultaneous _segment.reset()+open()+add_index() on the
same database object, corrupting the chainbase B-tree indices.

Add _apply_resize_mutex with a double-check pattern so only one thread
performs the resize; the second thread exits early after seeing
_pending_resize already cleared.

This is the root cause of periodic shared memory corruption (~20-30h
intervals) in DLT mode: account_history pruning (history-count-blocks)
creates constant alloc/free cycles in the boost::interprocess heap,
accumulating fragmentation until free_memory() drops below the resize
threshold.  More frequent resizes increase the probability of the race
hitting.  In classic mode without pruning, resizes are rare enough
that the race is practically unreachable.
After snapshot-based auto-recovery, soft-bans accumulated before the
corruption were kept alive (up to 3600s), including bans on peers that
carry the majority fork.  The node would then gap-fill from the only
available (minority-fork) peer and get stuck.

Add reset_peers_after_recovery() that calls emergency_peer_reset() on
the P2P thread: clears all banned→disconnected and resets reconnect
backoff to zero so majority-fork peers reconnect immediately.

Call it in attempt_auto_recover() before resume_block_processing().
…ork switch

When on_dlt_gap_fill_reply detects a dead-fork block from a peer that
is ahead (our fork is losing), the old code immediately called
transition_to_sync() + request_blocks_from_peer() which fell back to
LIB only if gap > FORWARD_FALLBEHIND_THRESHOLD.  For small gaps the
request started from our_head (wrong fork), the peer returned its
version of our_head as a dead-fork block again — infinite 5-second loop.

Fix: on "our fork is losing" detection, set _gap_fill_fork_override_start
= our_lib.  The next request_gap_fill() call uses LIB as the start
instead of our_head.  Blocks between LIB and the divergence point are
ALREADY_KNOWN; blocks after the divergence point land in fork_db as
FORK_DB_ONLY (majority chain).  Once the majority chain accumulates
sufficient height, the normal fork switch fires.  Override is one-shot:
cleared to 0 after use so subsequent gap fills resume from our_head.
…e_block

get_dynamic_global_properties() returns a const& directly into the
shared memory segment.  Between line 1280 (where the ref was obtained)
and line 1629 (where op_guard is created), _active_operations can be 0.
A concurrent P2P push_block calling apply_pending_resize() would see
_active_operations==0, pass begin_resize_barrier(), call _segment.reset()
and remap the segment — leaving the dangling ref to produce a SIGSEGV on
the next access to dgp.emergency_consensus_active.

Read emergency_consensus_active into a local bool under with_weak_read_lock
(which acquires an op_guard internally) before any other shared memory
access.  Replace all six uses of dgp.emergency_consensus_active in
maybe_validate_block with the local copy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant