SDSTOR-21939: index/wb_cache: Fix recovery corruption when a node is created and freed in the same crashed CP#892
Conversation
…created and freed in the same crashed CP
During index CP recovery, if a btree node was both created (e.g. via a
split) and freed (e.g. via a subsequent merge) within the same CP that
crashed before any data or bitmap was flushed, the recovery path would
incorrectly call free_blk on the node's blkid. If the repair phase had
already reused that blkid for a new live interior node, the free_blk
would silently corrupt it.
The first recovery pass processes all journal-recorded nodes and decides
which blkids to release via deleted_bufs. For freed nodes the code first
checked was_node_committed, then inside the "not committed" branch checked
m_created_cp_id:
if (buf->m_node_freed) {
if (was_node_committed(buf)) {
deleted_bufs.push_back(buf); // correct for prev-CP nodes
} else {
if (buf->m_created_cp_id != icp_ctx->id()) {
m_vdev->commit_blk(buf->m_blkid); // correct
} else {
deleted_bufs.push_back(buf); // BUG
}
prune_from_up_buffer(buf);
}
}
A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id())
and later freed in the same CP (m_node_freed == true) always landed in
deleted_bufs, triggering free_blk during recovery cleanup.
The same bug also existed in the was_node_committed == true branch: a
same-CP created+freed node whose parent had already been flushed to disk
would also be added to deleted_bufs via line 624.
A node created within the crashed CP only had its blkid reserved via
reserve_on_disk; the blkid was never committed to the allocator bitmap.
After restart the allocator reloads the bitmap from the last fully
committed CP and already treats that blkid as free. During the repair
phase (repair_links -> alloc_interior_node), the allocator may reassign
that exact blkid to a new live interior node. Recovery then calls
free_blk(blkid) from deleted_bufs, releasing a node that is still in
active use. Any subsequent allocation of that blkid overwrites the
repair node's data, silently corrupting the index.
Committed CP: Parent P(100) -> Leaf L(200) with keys [10,20,30,40]
Crashed CP (journal flushed, no node data or bitmap flush):
Step 1: insert 25 -> L splits, new node N(300) created
N: m_created_cp_id=current_cp, blkid 300 reserve_on_disk only
Step 2: delete 25,30,40 -> N empties, merge frees N
N: m_node_freed=true
Crash
Recovery:
- N: m_node_freed=true AND m_created_cp_id==icp_ctx->id()
- Old code puts N in deleted_bufs
- repair_links calls alloc_interior_node() -> allocator picks blkid 300
(free in the committed bitmap)
- Cleanup: free_blk(300) from deleted_bufs -> live repair node released
- Next allocation at blkid 300 overwrites repair node -> index corrupted
Check m_created_cp_id == icp_ctx->id() before was_node_committed so that
same-CP created+freed nodes are handled as a distinct case and always
bypass deleted_bufs:
if (buf->m_node_freed) {
if (buf->m_created_cp_id == icp_ctx->id()) {
// blkid never reached the bitmap; repair may already hold it
if (was_node_committed(buf))
pending_bufs.push_back(buf->m_up_buffer);
else
prune_from_up_buffer(buf);
continue; // never touches deleted_bufs
}
if (was_node_committed(buf)) {
// prev-CP node flushed to disk: schedule blkid release
deleted_bufs.push_back(buf);
pending_bufs.push_back(buf->m_up_buffer);
} else {
// prev-CP node freed but bitmap not flushed: recommit blkid
m_vdev->commit_blk(buf->m_blkid);
prune_from_up_buffer(buf);
}
}
For same-CP nodes whose parent was already flushed (was_node_committed
true), the parent still needs repair to remove the stale child link, so
its up_buffer is pushed to pending_bufs. When the parent was not flushed,
the in-memory child link is simply pruned. In both cases the blkid is
never freed.
Also extracted the repeated three-step prune pattern (remove_down_buffer
+ prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer
lambda to eliminate duplication in the same block.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stable/v7.x #892 +/- ##
==============================================
Coverage ? 48.33%
==============================================
Files ? 110
Lines ? 12960
Branches ? 6226
==============================================
Hits ? 6264
Misses ? 2552
Partials ? 4144 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
893744e to
42881a2
Compare
42881a2 to
27aed5f
Compare
| auto const recovered_nodes = inspectable_bt()->collect_node_ids(); | ||
| for (const auto id : m_recovery_freed_node_ids) { | ||
| ASSERT_FALSE(recovered_nodes.contains(id)) | ||
| << "created_freed_node " << id << " was freed and also recovered as a live node"; | ||
| } |
There was a problem hiding this comment.
this is the critical case that this PR want to fix.
a bnode has already be allocated to a btree(recovered_nodes holds all the nodes that belongs to this btree ATM), but this node is finally freed (m_recovery_freed_node_ids holds all the freed nodes , which exists in the deleted_bufs) at the end of recovery.
this case will lead to corruption if this node is allocated to other btree node
| // 2 all the created+freed blkids should not be freed during recovery ( should not appear in | ||
| // m_recovery_freed_node_ids) | ||
| for (auto const id : created_freed_node_ids) { | ||
| ASSERT_FALSE(m_recovery_freed_node_ids.contains(id)) | ||
| << "created_freed_node " << id << " was freed during recovery"; | ||
| } |
There was a problem hiding this comment.
we check if a created_freed_node will be freed after recovery.
| // is used for reproduce this issue(the current one) , the other is used for consuming blks(receives lots of put | ||
| // request and lead to a lot of blk allocation). commit_index_free_queue_until is used to consume free blks , just | ||
| // like what the second btree does. | ||
| commit_index_free_queue_until(crash_cp_free_queue_blks); |
There was a problem hiding this comment.
commit_index_free_queue_until is used to consume blks.
yes , we can do it by insert a lot of keys , but that will be very slow. so , here I directly consume blks from vdev.
|
|
||
| if (buf->m_node_freed) { | ||
| LOGTRACEMOD(wbcache, "recovering free buf {}", buf->to_string()); | ||
| if (buf->m_created_cp_id == icp_ctx->id()) { |
There was a problem hiding this comment.
if a node is freed and the node is created in the crash cp (buf->m_created_cp_id == icp_ctx->id()), do not put it into the deleted_bufs.
this is the main fix of this PR.
During index CP recovery, if a btree node was both created (e.g. via a split) and freed (e.g. via a subsequent merge) within the same CP that crashed before any data or bitmap was flushed, the recovery path would incorrectly call free_blk on the node's blkid. If the repair phase had already reused that blkid for a new live interior node, the free_blk would silently corrupt it.
The first recovery pass processes all journal-recorded nodes and decides which blkids to release via deleted_bufs. For freed nodes the code first checked was_node_committed, then inside the "not committed" branch checked m_created_cp_id:
A node that was created in the crashed CP (m_created_cp_id == icp_ctx->id()) and later freed in the same CP (m_node_freed == true) always landed in deleted_bufs, triggering free_blk during recovery cleanup.
The same bug also existed in the was_node_committed == true branch: a same-CP created+freed node whose parent had already been flushed to disk would also be added to deleted_bufs via line 624.
A node created within the crashed CP only had its blkid reserved via reserve_on_disk; the blkid was never committed to the allocator bitmap. After restart the allocator reloads the bitmap from the last fully committed CP and already treats that blkid as free. During the repair phase (repair_links -> alloc_interior_node), the allocator may reassign that exact blkid to a new live interior node. Recovery then calls free_blk(blkid) from deleted_bufs, releasing a node that is still in active use. Any subsequent allocation of that blkid overwrites the repair node's data, silently corrupting the index.
Check m_created_cp_id == icp_ctx->id() before was_node_committed so that same-CP created+freed nodes are handled as a distinct case and always bypass deleted_bufs:
For same-CP nodes whose parent was already flushed (was_node_committed true), the parent still needs repair to remove the stale child link, so its up_buffer is pushed to pending_bufs. When the parent was not flushed, the in-memory child link is simply pruned. In both cases the blkid is never freed.
Also extracted the repeated three-step prune pattern (remove_down_buffer + prune_up_buffers + clear m_up_buffer) into a prune_from_up_buffer lambda to eliminate duplication in the same block.