fix(snapshot): detect and recover validator vote snapshot inconsisten… by On1x · Pull Request #112 · VIZ-Blockchain/viz-cpp-node

On1x · 2026-05-21T02:43:46Z

…cies

Add sanity check during export to warn if validators exist but validator votes are absent
Log warning about possible chainbase type-enum mismatch causing incomplete snapshot
Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty
Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts

…cies - Add sanity check during export to warn if validators exist but validator votes are absent - Log warning about possible chainbase type-enum mismatch causing incomplete snapshot - Implement fallback during import to recover validator votes from legacy witness_vote key if validator_vote is empty - Improve snapshot integrity by handling potential silent corruption cases due to type enum shifts

…options - Deleted all mentions of `LOW_MEMORY_NODE` from build scripts, environment variables, and documentation - Removed low-memory node build instructions and flags from Linux, macOS, and Windows build guides - Updated CMake options and environment variables to exclude low-memory settings - Simplified Docker image CMake flags by removing `LOW_MEMORY_NODE` - Cleared low-memory related config references in node setup and getting started guides - Cleaned up example config files by removing deprecated plugins and options related to low-memory builds

- Delete config_debug_mongo.ini to clean up obsolete debug mongo configuration - Remove config_mongo.ini to eliminate outdated mongo production configuration - Simplify project configuration by removing unused or legacy mongo ini files

- Changed info-level logs (ilog) to debug-level logs (dlog) when connecting to peers and sending DLT hello messages - Updated rate-limit notification from ilog to dlog for peer exchange requests - Ensured logging reflects appropriate verbosity level for peer communication events

- Handle CORS preflight by responding to OPTIONS method with proper headers - Append Access-Control-Allow-Origin header to all HTTP responses - Add Access-Control-Allow-Methods, Allow-Headers, and Max-Age headers for OPTIONS responses - Ensure CORS headers are included on error and success responses - Prevent CORS issues for cross-origin API calls through the webserver plugin

- Add check to skip logging if disconnect is already in progress for a peer - Avoid re-entrance in send_message calls during handle_disconnect coroutine - Prevent excessive log entries when send queue is at max depth and peer disconnects

…ad fiber - Close socket first to unblock pending I/O and avoid multi-second hangs - Erase connection after closing to prevent dangling shared_ptr references - Cancel read fiber only after socket is closed to ensure immediate exit - Retain reentrancy guard to keep peer state valid during disconnect handling - Adjust order of operations to fix deadlock when multiple peers disconnect simultaneously

- Introduced Ƶ as the short symbol for VIZ chosen by the community - Explained common practice of showing balances with 2 decimal places - Noted that even staked funds (SHARES) are displayed as Ƶ with staking notes - Clarified symbol usage in wallets, explorers, and applications docs(webserver): document native CORS support in webserver plugin - Detailed handling of browser cross-origin requests without reverse proxy - Specified preflight (OPTIONS) response headers and values - Confirmed all other responses include Access-Control-Allow-Origin: * - Mentioned compatibility with production setups using nginx proxy - Highlighted use cases for browser-based wallets and dApps calling JSON-RPC endpoints directly

After shared-memory corruption triggers attempt_auto_recovery(), the function sets currently_syncing=true so the validator plugin defers block production during the wipe / snapshot import / dlt_block_log replay sequence. Once the database is rebuilt and P2P is resumed, the flag was expected to self-clear on the next applied block via plugin_impl::accept_block(), which stores the caller-supplied sync_mode flag whenever a block is successfully pushed. That self-clearing path never runs on the DLT pipeline. The DLT P2P delegate (dlt_delegate::accept_block in plugins/p2p/p2p_plugin.cpp) calls chain.db().push_block() directly and bypasses plugin_impl::accept_block() entirely, so neither broadcast blocks nor gap-fill replies ever update currently_syncing. The only remaining clearer is transition_to_forward(), but a node that was in FORWARD mode at the moment of corruption stays in FORWARD throughout pause/resume — transition_to_forward() is never invoked, so the flag is permanently stuck at true. The validator gate at plugins/validator/validator.cpp checks chain().is_syncing() in DLT mode and returns not_synced, producing the observed indefinite "Block production deferred: not_synced (head=#X, catching_up=false)" loop where head keeps advancing via P2P but no local block is produced. Fix: explicitly clear currently_syncing immediately after do_snapshot_load(data_dir, true) returns successfully in attempt_auto_recovery(). Post-recovery catchup remains correctly gated by _catchup_after_pause in the P2P layer, which the periodic task clears once no peer is ahead of our head.

The deferred-snapshot wake-up in on_applied_block previously used head_block_time() >= pending_snapshot_safe_after_time, which fires on the very block the local validator just produced. The applied_block signal is dispatched synchronously from _push_block inside db.generate_block(), and the validator only calls p2p().broadcast_block() after generate_block() returns. So firing the snapshot on the same block let the snapshot read-lock start before the produced block had been broadcast to peers. Change the condition to strictly greater than: the deferred snapshot now waits until a SUBSEQUENT block is applied. That block is built by another validator on top of ours, proving our block was produced, applied locally, and propagated through the network. Only then does the snapshot start reading state. Cost is ~one block interval of additional delay, and only on slots where the local validator was the deferral target. The non-producer path is unchanged: snapshots still fire immediately at the originating block when is_validator_producing_soon() is false. Also expanded the surrounding comment block and updated the wake-up log messages to reflect the new semantics.

Replace hardcoded b.validator with get_scheduled_validator(i + 2) so each missed block line shows the validator scheduled for the slot immediately after the miss, instead of repeating the current block producer for every line.

…FORWARD oscillation The static atomic recovery_in_progress flag in attempt_auto_recovery() was never reset to false after successful recovery, making any subsequent corruption event permanently unrecoverable ("already in progress, skipping duplicate attempt"). Reset it after P2P resume so the node can recover from future corruption events. Add a consecutive recovery counter (max 3 within 5 minutes) to prevent infinite recovery loops when the snapshot or block log is itself corrupted. In request_gap_fill(), remove the SYNC transition and peer request loop from the "no peer available" fallback path. When no peer has a higher head, transition_to_sync() followed by request_blocks_from_peer() immediately detects all peers as "caught up" and calls transition_to_forward(), producing rapid SYNC->FORWARD oscillation every 5 seconds. Instead, just log and let the periodic task retry when new peers connect.

Node crashes silently between DLT block log open and "Done opening block log" with no error output. Add step-by-step ilog() calls to every major operation in the critical path so the exact failing step is visible in the next crash log: - block_log and dlt_block_log head after open - Before/after undo_all() with revision values - Revision mismatch detection with values - Before reading head block from block_log - fork_db seeding start in both normal and DLT modes - Before/after init_hardforks() (second call) - Before validator schedule integrity check Also add db.open() success log in chain plugin_startup.

All reads and writes to the currently_syncing atomic flag used relaxed ordering, which does not guarantee cross-thread visibility on non-x86 architectures. The recovery thread writes currently_syncing=false after rebuilding the database, and the validator production thread reads it to decide whether to produce blocks. Upgrade to release/acquire ordering to ensure the store is visible to the reader on all platforms. store → memory_order_release (3 sites) load → memory_order_acquire (1 site) exchange → memory_order_acq_rel (1 site)

undo_all() in database::open() causes a silent SIGSEGV when shared memory is corrupted after a hard crash. Since segfaults bypass all C++ exception handlers, the node enters an infinite restart loop in Docker without ever reaching the recovery path. Introduce a marker file (state/undo_all_in_progress) that is created before undo_all() and removed after it completes. If the process crashes inside undo_all(), the marker survives and triggers database_revision_exception on the next startup, which activates the existing snapshot recovery path. Marker cleanup is added to: - database::open() — removed after successful undo_all() - database::open_from_snapshot() — cleaned before snapshot import - database::wipe() — cleaned during shared memory wipe

- Detect resize_in_progress crash marker on startup, throw database_revision_exception to trigger recovery path - Add post-resize validation: verify max_memory() increased and dynamic_global_property_object survived the remap in both _resize() (immediate) and apply_pending_resize() (deferred) - Fix bad_alloc -> std::terminate in _push_block: heap-allocate undo session and explicitly destroy before exception unwinding - Clean up resize crash markers in database::wipe() - Update shared-memory.md docs and RU/ZH-CN translations with safety mechanisms, updated startup sequence, and recovery scenarios

validator, webserver, network_broadcast_api, database_api, account_by_key, and custom_protocol_api all connected to database signals (applied_block, pre/post_apply_operation) but never stored or disconnected the connection handles. On shutdown, callbacks into partially-destroyed plugin state could cause use-after-free crashes. Each plugin now stores boost::signals2::connection members and calls .disconnect() in plugin_shutdown() before releasing owned resources. validator additionally disconnects before stopping the production io_service to prevent the callback racing the thread join.

…ount_history Both account_history and operation_history registered the same option name in the shared appbase cli options_description, causing boost to throw "option is ambiguous" on startup. account_history already reads the value registered by operation_history, so re-registration is unnecessary.

apply_pending_resize() is called from two threads before their respective write locks: the P2P thread (push_block) and the validator thread (generate_block). Both can see _pending_resize==true simultaneously, both pass begin_resize_barrier() (which releases its internal mutex on return), and both call resize() concurrently — resulting in simultaneous _segment.reset()+open()+add_index() on the same database object, corrupting the chainbase B-tree indices. Add _apply_resize_mutex with a double-check pattern so only one thread performs the resize; the second thread exits early after seeing _pending_resize already cleared. This is the root cause of periodic shared memory corruption (~20-30h intervals) in DLT mode: account_history pruning (history-count-blocks) creates constant alloc/free cycles in the boost::interprocess heap, accumulating fragmentation until free_memory() drops below the resize threshold. More frequent resizes increase the probability of the race hitting. In classic mode without pruning, resizes are rare enough that the race is practically unreachable.

After snapshot-based auto-recovery, soft-bans accumulated before the corruption were kept alive (up to 3600s), including bans on peers that carry the majority fork. The node would then gap-fill from the only available (minority-fork) peer and get stuck. Add reset_peers_after_recovery() that calls emergency_peer_reset() on the P2P thread: clears all banned→disconnected and resets reconnect backoff to zero so majority-fork peers reconnect immediately. Call it in attempt_auto_recover() before resume_block_processing().

…ork switch When on_dlt_gap_fill_reply detects a dead-fork block from a peer that is ahead (our fork is losing), the old code immediately called transition_to_sync() + request_blocks_from_peer() which fell back to LIB only if gap > FORWARD_FALLBEHIND_THRESHOLD. For small gaps the request started from our_head (wrong fork), the peer returned its version of our_head as a dead-fork block again — infinite 5-second loop. Fix: on "our fork is losing" detection, set _gap_fill_fork_override_start = our_lib. The next request_gap_fill() call uses LIB as the start instead of our_head. Blocks between LIB and the divergence point are ALREADY_KNOWN; blocks after the divergence point land in fork_db as FORK_DB_ONLY (majority chain). Once the majority chain accumulates sufficient height, the normal fork switch fires. Override is one-shot: cleared to 0 after use so subsequent gap fills resume from our_head.

…e_block get_dynamic_global_properties() returns a const& directly into the shared memory segment. Between line 1280 (where the ref was obtained) and line 1629 (where op_guard is created), _active_operations can be 0. A concurrent P2P push_block calling apply_pending_resize() would see _active_operations==0, pass begin_resize_barrier(), call _segment.reset() and remap the segment — leaving the dangling ref to produce a SIGSEGV on the next access to dgp.emergency_consensus_active. Read emergency_consensus_active into a local bool under with_weak_read_lock (which acquires an op_guard internally) before any other shared memory access. Replace all six uses of dgp.emergency_consensus_active in maybe_validate_block with the local copy.

On1x added 28 commits May 21, 2026 06:41

docs: remove CORS from nginx examples

3af3eba

update chainbase

be747e0

add json api spec for plugins and json rpc methods

602e3f3

update info about account_history plugin in docs + fix purge

44cd2eb

update docker log rotation recomendations

e162c93

add try-catch sections to limit log spam

21d38d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112

fix(snapshot): detect and recover validator vote snapshot inconsisten…#112
On1x wants to merge 28 commits into
masterfrom
snapshot-fix

On1x commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

On1x commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant