Skip to content

SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP#892

Open
JacksonYao287 wants to merge 2 commits into
eBay:stable/v7.xfrom
JacksonYao287:add-UT-for-index-crash-recovery
Open

SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP#892
JacksonYao287 wants to merge 2 commits into
eBay:stable/v7.xfrom
JacksonYao287:add-UT-for-index-crash-recovery

Conversation

@JacksonYao287

@JacksonYao287 JacksonYao287 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

During index CP recovery, if a btree node was both created (e.g. via a split) and freed (e.g. via a subsequent merge) within the same CP that crashed before any data or bitmap was flushed, the recovery path would incorrectly call free_blk on the node's blkid. If the repair phase had already reused that blkid for a new live interior node, the free_blk would silently corrupt it.

The first recovery pass processes all journal-recorded nodes and decides which blkids to release via deleted_bufs. For freed nodes the code first checked was_node_committed, then inside the "not committed" branch checked m_created_cp_id:

if (buf->m_node_freed) {
    if (was_node_committed(buf)) {
        deleted_bufs.push_back(buf);        // correct for prev-CP nodes
    } else {
        if (buf->m_created_cp_id != icp_ctx->id()) {
            m_vdev->commit_blk(buf->m_blkid);  // correct
        } else {
            deleted_bufs.push_back(buf);        // BUG
        }
        prune_from_up_buffer(buf);
    }
}

A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id()) and later freed in the same CP (m_node_freed == true) always landed in deleted_bufs, triggering free_blk during recovery cleanup.

The same bug also existed in the was_node_committed == true branch: a same-CP created+freed node whose parent had already been flushed to disk would also be added to deleted_bufs via line 624.

A node created within the crashed CP only had its blkid reserved via reserve_on_disk; the blkid was never committed to the allocator bitmap. After restart the allocator reloads the bitmap from the last fully committed CP and already treats that blkid as free. During the repair phase (repair_links -> alloc_interior_node), the allocator may reassign that exact blkid to a new live interior node. Recovery then calls free_blk(blkid) from deleted_bufs, releasing a node that is still in active use. Any subsequent allocation of that blkid overwrites the repair node's data, silently corrupting the index.

Committed CP:  Parent P(100) -> Leaf L(200) with keys [10,20,30,40]

Crashed CP (journal flushed, no node data or bitmap flush):
  Step 1: insert 25 -> L splits, new node N(300) created
          N: m_created_cp_id=current_cp, blkid 300 reserve_on_disk only
  Step 2: delete 25,30,40 -> N empties, merge frees N
          N: m_node_freed=true
  Crash

Recovery:
  - N: m_node_freed=true AND m_created_cp_id==icp_ctx->id()
  - Old code puts N in deleted_bufs - repair_links calls alloc_interior_node() -> allocator picks blkid 300 (free in the committed bitmap) - Cleanup: free_blk(300) from deleted_bufs -> live repair node released - Next allocation at blkid 300 overwrites repair node -> index corrupted

Check m_created_cp_id == icp_ctx->id() before was_node_committed so that same-CP created+freed nodes are handled as a distinct case and always bypass deleted_bufs:

if (buf->m_node_freed) {
    if (buf->m_created_cp_id == icp_ctx->id()) {
        // blkid never reached the bitmap; repair may already hold it
        if (was_node_committed(buf))
            pending_bufs.push_back(buf->m_up_buffer);
        else
            prune_from_up_buffer(buf);
        continue;                   // never touches deleted_bufs
    }
    if (was_node_committed(buf)) {
        // prev-CP node flushed to disk: schedule blkid release
        deleted_bufs.push_back(buf);
        pending_bufs.push_back(buf->m_up_buffer);
    } else {
        // prev-CP node freed but bitmap not flushed: recommit blkid
        m_vdev->commit_blk(buf->m_blkid);
        prune_from_up_buffer(buf);
    }
}

For same-CP nodes whose parent was already flushed (was_node_committed true), the parent still needs repair to remove the stale child link, so its up_buffer is pushed to pending_bufs. When the parent was not flushed, the in-memory child link is simply pruned. In both cases the blkid is never freed.

Also extracted the repeated three-step prune pattern (remove_down_buffer + prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer lambda to eliminate duplication in the same block.

…created and freed in the same crashed CP

During index CP recovery, if a btree node was both created (e.g. via a
split) and freed (e.g. via a subsequent merge) within the same CP that
crashed before any data or bitmap was flushed, the recovery path would
incorrectly call free_blk on the node's blkid. If the repair phase had
already reused that blkid for a new live interior node, the free_blk
would silently corrupt it.

The first recovery pass processes all journal-recorded nodes and decides
which blkids to release via deleted_bufs. For freed nodes the code first
checked was_node_committed, then inside the "not committed" branch checked
m_created_cp_id:

    if (buf->m_node_freed) {
        if (was_node_committed(buf)) {
            deleted_bufs.push_back(buf);        // correct for prev-CP nodes
        } else {
            if (buf->m_created_cp_id != icp_ctx->id()) {
                m_vdev->commit_blk(buf->m_blkid);  // correct
            } else {
                deleted_bufs.push_back(buf);        // BUG
            }
            prune_from_up_buffer(buf);
        }
    }

A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id())
and later freed in the same CP (m_node_freed == true) always landed in
deleted_bufs, triggering free_blk during recovery cleanup.

The same bug also existed in the was_node_committed == true branch: a
same-CP created+freed node whose parent had already been flushed to disk
would also be added to deleted_bufs via line 624.

A node created within the crashed CP only had its blkid reserved via
reserve_on_disk; the blkid was never committed to the allocator bitmap.
After restart the allocator reloads the bitmap from the last fully
committed CP and already treats that blkid as free. During the repair
phase (repair_links -> alloc_interior_node), the allocator may reassign
that exact blkid to a new live interior node. Recovery then calls
free_blk(blkid) from deleted_bufs, releasing a node that is still in
active use. Any subsequent allocation of that blkid overwrites the
repair node's data, silently corrupting the index.

    Committed CP:  Parent P(100) -> Leaf L(200) with keys [10,20,30,40]

    Crashed CP (journal flushed, no node data or bitmap flush):
      Step 1: insert 25 -> L splits, new node N(300) created
              N: m_created_cp_id=current_cp, blkid 300 reserve_on_disk only
      Step 2: delete 25,30,40 -> N empties, merge frees N
              N: m_node_freed=true
      Crash

    Recovery:
      - N: m_node_freed=true AND m_created_cp_id==icp_ctx->id()
      - Old code puts N in deleted_bufs
      - repair_links calls alloc_interior_node() -> allocator picks blkid 300
        (free in the committed bitmap)
      - Cleanup: free_blk(300) from deleted_bufs -> live repair node released
      - Next allocation at blkid 300 overwrites repair node -> index corrupted

Check m_created_cp_id == icp_ctx->id() before was_node_committed so that
same-CP created+freed nodes are handled as a distinct case and always
bypass deleted_bufs:

    if (buf->m_node_freed) {
        if (buf->m_created_cp_id == icp_ctx->id()) {
            // blkid never reached the bitmap; repair may already hold it
            if (was_node_committed(buf))
                pending_bufs.push_back(buf->m_up_buffer);
            else
                prune_from_up_buffer(buf);
            continue;                   // never touches deleted_bufs
        }
        if (was_node_committed(buf)) {
            // prev-CP node flushed to disk: schedule blkid release
            deleted_bufs.push_back(buf);
            pending_bufs.push_back(buf->m_up_buffer);
        } else {
            // prev-CP node freed but bitmap not flushed: recommit blkid
            m_vdev->commit_blk(buf->m_blkid);
            prune_from_up_buffer(buf);
        }
    }

For same-CP nodes whose parent was already flushed (was_node_committed
true), the parent still needs repair to remove the stale child link, so
its up_buffer is pushed to pending_bufs. When the parent was not flushed,
the in-memory child link is simply pruned. In both cases the blkid is
never freed.

Also extracted the repeated three-step prune pattern (remove_down_buffer
+ prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer
lambda to eliminate duplication in the same block.
@JacksonYao287 JacksonYao287 changed the title SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is … SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP Jun 8, 2026
@codecov-commenter

codecov-commenter commented Jun 8, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 47.05882% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v7.x@ef9ab9d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/index/wb_cache.cpp 47.05% 0 Missing and 9 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v7.x     #892   +/-   ##
==============================================
  Coverage               ?   48.33%           
==============================================
  Files                  ?      110           
  Lines                  ?    12960           
  Branches               ?     6226           
==============================================
  Hits                   ?     6264           
  Misses                 ?     2552           
  Partials               ?     4144           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the add-UT-for-index-crash-recovery branch from 893744e to 42881a2 Compare June 10, 2026 04:06
@JacksonYao287 JacksonYao287 force-pushed the add-UT-for-index-crash-recovery branch from 42881a2 to 27aed5f Compare June 11, 2026 02:47
Comment on lines +1728 to +1732
auto const recovered_nodes = inspectable_bt()->collect_node_ids();
for (const auto id : m_recovery_freed_node_ids) {
ASSERT_FALSE(recovered_nodes.contains(id))
<< "created_freed_node " << id << " was freed and also recovered as a live node";
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the critical case that this PR want to fix.

a bnode has already be allocated to a btree(recovered_nodes holds all the nodes that belongs to this btree ATM), but this node is finally freed (m_recovery_freed_node_ids holds all the freed nodes , which exists in the deleted_bufs) at the end of recovery.

this case will lead to corruption if this node is allocated to other btree node

Comment on lines +1734 to +1739
// 2 all the created+freed blkids should not be freed during recovery ( should not appear in
// m_recovery_freed_node_ids)
for (auto const id : created_freed_node_ids) {
ASSERT_FALSE(m_recovery_freed_node_ids.contains(id))
<< "created_freed_node " << id << " was freed during recovery";
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check if a created_freed_node will be freed after recovery.

// is used for reproduce this issue(the current one) , the other is used for consuming blks(receives lots of put
// request and lead to a lot of blk allocation). commit_index_free_queue_until is used to consume free blks , just
// like what the second btree does.
commit_index_free_queue_until(crash_cp_free_queue_blks);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit_index_free_queue_until is used to consume blks.

yes , we can do it by insert a lot of keys , but that will be very slow. so , here I directly consume blks from vdev.


if (buf->m_node_freed) {
LOGTRACEMOD(wbcache, "recovering free buf {}", buf->to_string());
if (buf->m_created_cp_id == icp_ctx->id()) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a node is freed and the node is created in the crash cp (buf->m_created_cp_id == icp_ctx->id()), do not put it into the deleted_bufs.

this is the main fix of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants