SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP by JacksonYao287 · Pull Request #892 · eBay/HomeStore

JacksonYao287 · 2026-06-08T07:22:24Z

During index CP recovery, if a btree node was both created (e.g. via a split) and freed (e.g. via a subsequent merge) within the same CP that crashed before any data or bitmap was flushed, the recovery path would incorrectly call free_blk on the node's blkid. If the repair phase had already reused that blkid for a new live interior node, the free_blk would silently corrupt it.

The first recovery pass processes all journal-recorded nodes and decides which blkids to release via deleted_bufs. For freed nodes the code first checked was_node_committed, then inside the "not committed" branch checked m_created_cp_id:

if (buf->m_node_freed) {
    if (was_node_committed(buf)) {
        deleted_bufs.push_back(buf);        // correct for prev-CP nodes
    } else {
        if (buf->m_created_cp_id != icp_ctx->id()) {
            m_vdev->commit_blk(buf->m_blkid);  // correct
        } else {
            deleted_bufs.push_back(buf);        // BUG
        }
        prune_from_up_buffer(buf);
    }
}

A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id()) and later freed in the same CP (m_node_freed == true) always landed in deleted_bufs, triggering free_blk during recovery cleanup.

The same bug also existed in the was_node_committed == true branch: a same-CP created+freed node whose parent had already been flushed to disk would also be added to deleted_bufs via line 624.

A node created within the crashed CP only had its blkid reserved via reserve_on_disk; the blkid was never committed to the allocator bitmap. After restart the allocator reloads the bitmap from the last fully committed CP and already treats that blkid as free. During the repair phase (repair_links -> alloc_interior_node), the allocator may reassign that exact blkid to a new live interior node. Recovery then calls free_blk(blkid) from deleted_bufs, releasing a node that is still in active use. Any subsequent allocation of that blkid overwrites the repair node's data, silently corrupting the index.

Committed CP:  Parent P(100) -> Leaf L(200) with keys [10,20,30,40]

Crashed CP (journal flushed, no node data or bitmap flush):
  Step 1: insert 25 -> L splits, new node N(300) created
          N: m_created_cp_id=current_cp, blkid 300 reserve_on_disk only
  Step 2: delete 25,30,40 -> N empties, merge frees N
          N: m_node_freed=true
  Crash

Recovery:
  - N: m_node_freed=true AND m_created_cp_id==icp_ctx->id()
  - Old code puts N in deleted_bufs - repair_links calls alloc_interior_node() -> allocator picks blkid 300 (free in the committed bitmap) - Cleanup: free_blk(300) from deleted_bufs -> live repair node released - Next allocation at blkid 300 overwrites repair node -> index corrupted

Check m_created_cp_id == icp_ctx->id() before was_node_committed so that same-CP created+freed nodes are handled as a distinct case and always bypass deleted_bufs:

if (buf->m_node_freed) {
    if (buf->m_created_cp_id == icp_ctx->id()) {
        // blkid never reached the bitmap; repair may already hold it
        if (was_node_committed(buf))
            pending_bufs.push_back(buf->m_up_buffer);
        else
            prune_from_up_buffer(buf);
        continue;                   // never touches deleted_bufs
    }
    if (was_node_committed(buf)) {
        // prev-CP node flushed to disk: schedule blkid release
        deleted_bufs.push_back(buf);
        pending_bufs.push_back(buf->m_up_buffer);
    } else {
        // prev-CP node freed but bitmap not flushed: recommit blkid
        m_vdev->commit_blk(buf->m_blkid);
        prune_from_up_buffer(buf);
    }
}

For same-CP nodes whose parent was already flushed (was_node_committed true), the parent still needs repair to remove the stale child link, so its up_buffer is pushed to pending_bufs. When the parent was not flushed, the in-memory child link is simply pruned. In both cases the blkid is never freed.

Also extracted the repeated three-step prune pattern (remove_down_buffer + prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer lambda to eliminate duplication in the same block.

…created and freed in the same crashed CP During index CP recovery, if a btree node was both created (e.g. via a split) and freed (e.g. via a subsequent merge) within the same CP that crashed before any data or bitmap was flushed, the recovery path would incorrectly call free_blk on the node's blkid. If the repair phase had already reused that blkid for a new live interior node, the free_blk would silently corrupt it. The first recovery pass processes all journal-recorded nodes and decides which blkids to release via deleted_bufs. For freed nodes the code first checked was_node_committed, then inside the "not committed" branch checked m_created_cp_id: if (buf->m_node_freed) { if (was_node_committed(buf)) { deleted_bufs.push_back(buf); // correct for prev-CP nodes } else { if (buf->m_created_cp_id != icp_ctx->id()) { m_vdev->commit_blk(buf->m_blkid); // correct } else { deleted_bufs.push_back(buf); // BUG } prune_from_up_buffer(buf); } } A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id()) and later freed in the same CP (m_node_freed == true) always landed in deleted_bufs, triggering free_blk during recovery cleanup. The same bug also existed in the was_node_committed == true branch: a same-CP created+freed node whose parent had already been flushed to disk would also be added to deleted_bufs via line 624. A node created within the crashed CP only had its blkid reserved via reserve_on_disk; the blkid was never committed to the allocator bitmap. After restart the allocator reloads the bitmap from the last fully committed CP and already treats that blkid as free. During the repair phase (repair_links -> alloc_interior_node), the allocator may reassign that exact blkid to a new live interior node. Recovery then calls free_blk(blkid) from deleted_bufs, releasing a node that is still in active use. Any subsequent allocation of that blkid overwrites the repair node's data, silently corrupting the index. Committed CP: Parent P(100) -> Leaf L(200) with keys [10,20,30,40] Crashed CP (journal flushed, no node data or bitmap flush): Step 1: insert 25 -> L splits, new node N(300) created N: m_created_cp_id=current_cp, blkid 300 reserve_on_disk only Step 2: delete 25,30,40 -> N empties, merge frees N N: m_node_freed=true Crash Recovery: - N: m_node_freed=true AND m_created_cp_id==icp_ctx->id() - Old code puts N in deleted_bufs - repair_links calls alloc_interior_node() -> allocator picks blkid 300 (free in the committed bitmap) - Cleanup: free_blk(300) from deleted_bufs -> live repair node released - Next allocation at blkid 300 overwrites repair node -> index corrupted Check m_created_cp_id == icp_ctx->id() before was_node_committed so that same-CP created+freed nodes are handled as a distinct case and always bypass deleted_bufs: if (buf->m_node_freed) { if (buf->m_created_cp_id == icp_ctx->id()) { // blkid never reached the bitmap; repair may already hold it if (was_node_committed(buf)) pending_bufs.push_back(buf->m_up_buffer); else prune_from_up_buffer(buf); continue; // never touches deleted_bufs } if (was_node_committed(buf)) { // prev-CP node flushed to disk: schedule blkid release deleted_bufs.push_back(buf); pending_bufs.push_back(buf->m_up_buffer); } else { // prev-CP node freed but bitmap not flushed: recommit blkid m_vdev->commit_blk(buf->m_blkid); prune_from_up_buffer(buf); } } For same-CP nodes whose parent was already flushed (was_node_committed true), the parent still needs repair to remove the stale child link, so its up_buffer is pushed to pending_bufs. When the parent was not flushed, the in-memory child link is simply pruned. In both cases the blkid is never freed. Also extracted the repeated three-step prune pattern (remove_down_buffer + prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer lambda to eliminate duplication in the same block.

codecov-commenter · 2026-06-08T08:05:06Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 47.05882% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v7.x@ef9ab9d). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/lib/index/wb_cache.cpp	47.05%	0 Missing and 9 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@              Coverage Diff               @@
##             stable/v7.x     #892   +/-   ##
==============================================
  Coverage               ?   48.33%           
==============================================
  Files                  ?      110           
  Lines                  ?    12960           
  Branches               ?     6226           
==============================================
  Hits                   ?     6264           
  Misses                 ?     2552           
  Partials               ?     4144

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JacksonYao287 · 2026-06-11T02:59:29Z

+    auto const recovered_nodes = inspectable_bt()->collect_node_ids();
+    for (const auto id : m_recovery_freed_node_ids) {
+        ASSERT_FALSE(recovered_nodes.contains(id))
+            << "created_freed_node " << id << " was freed and also recovered as a live node";
+    }


this is the critical case that this PR want to fix.

a bnode has already be allocated to a btree(recovered_nodes holds all the nodes that belongs to this btree ATM), but this node is finally freed (m_recovery_freed_node_ids holds all the freed nodes , which exists in the deleted_bufs) at the end of recovery.

this case will lead to corruption if this node is allocated to other btree node

JacksonYao287 · 2026-06-11T03:00:54Z

+    // 2 all the created+freed blkids should not be freed during recovery ( should not appear in
+    // m_recovery_freed_node_ids)
+    for (auto const id : created_freed_node_ids) {
+        ASSERT_FALSE(m_recovery_freed_node_ids.contains(id))
+            << "created_freed_node " << id << " was freed during recovery";
+    }


we check if a created_freed_node will be freed after recovery.

JacksonYao287 · 2026-06-11T03:02:44Z

+    // is used for reproduce this issue(the current one) , the other is used for consuming blks(receives lots of put
+    // request and lead to a lot of  blk allocation). commit_index_free_queue_until is used to consume free blks , just
+    // like what the second btree does.
+    commit_index_free_queue_until(crash_cp_free_queue_blks);


commit_index_free_queue_until is used to consume blks.

yes , we can do it by insert a lot of keys , but that will be very slow. so , here I directly consume blks from vdev.

JacksonYao287 · 2026-06-11T03:04:07Z


        if (buf->m_node_freed) {
            LOGTRACEMOD(wbcache, "recovering free buf {}", buf->to_string());
+            if (buf->m_created_cp_id == icp_ctx->id()) {


if a node is freed and the node is created in the crash cp (buf->m_created_cp_id == icp_ctx->id()), do not put it into the deleted_bufs.

this is the main fix of this PR.

JacksonYao287 changed the title ~~SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is …~~ SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP Jun 8, 2026

JacksonYao287 force-pushed the add-UT-for-index-crash-recovery branch from 893744e to 42881a2 Compare June 10, 2026 04:06

JacksonYao287 requested a review from shosseinimotlagh June 10, 2026 06:18

update

27aed5f

JacksonYao287 force-pushed the add-UT-for-index-crash-recovery branch from 42881a2 to 27aed5f Compare June 11, 2026 02:47

JacksonYao287 commented Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP#892

SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP#892
JacksonYao287 wants to merge 2 commits into
eBay:stable/v7.xfrom
JacksonYao287:add-UT-for-index-crash-recovery

JacksonYao287 commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 8, 2026 •

edited

Loading

Uh oh!

JacksonYao287 Jun 11, 2026

Uh oh!

JacksonYao287 Jun 11, 2026

Uh oh!

JacksonYao287 Jun 11, 2026

Uh oh!

JacksonYao287 Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JacksonYao287 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JacksonYao287 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacksonYao287 commented Jun 8, 2026 •

edited

Loading

codecov-commenter commented Jun 8, 2026 •

edited

Loading