Racing between create/seal shard and GC

SH test found a new racing case:
T1: Follower receives a create_shard log entry and allocates space (vchunk_id=18, pchunk_id=274) for it.

```
[04/27/26 03:54:15.994] [storage_mgr] [trace] [32] [raft_state_machine.cpp:62:localize_journal_entry_prepare] [traceID=1611437756867722868] [rdev1:bd717553-da8b-48a0-81b7-299ded6800df] Raft Channel: Localizing Raft log_entry: server_id=896390092, term=3, journal_entry=[version=1.1, code=HS_DATA_LINKED, server_id=896390092, dsn=18590, header_size=1170, key_size=0, value_size=8]
[04/27/26 03:54:15.994] [storage_mgr] [debug] [32] [replication_state_machine.cpp:234:get_blk_alloc_hints] tid=1611437756867722868, get_blk_alloc_hint for creating shard, select vchunk_id=18 for pg=1, shardID=281474976710680
...
[04/27/26 03:54:16.189] [storage_mgr] [trace] [18] [raft_repl_dev.cpp:1644:handle_fetch_data_response] [traceID=1611437756867722868] [rdev1:bd717553-da8b-48a0-81b7-299ded6800df] Data Channel: Data fetched from remote: rreq=[dsn=18590 term=3 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=28702 count=1 chunk=274},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]], data_size: 4096, total_size: 4096, local_blkid: [{blk#=28702 count=1 chunk=274},]
[04/27/26 03:54:16.198] [storage_mgr] [debug] [18] [raft_repl_dev.cpp:1638:operator()] [traceID=1611437756867722868] [rdev1:bd717553-da8b-48a0-81b7-299ded6800df] Data Channel: Data Write completed rreq=[dsn=18590 term=3 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=28702 count=1 chunk=274},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | ]], data_write_latency_us=8613, total_write_latency_us=149512, write_num_pieces=1
```

T2: GC runs on chunk 274, moves its data to chunk 282, reclaims 22551 blocks, updates the vchunk_id=18 → pchunk mapping from 274 to 282, and marks pchunk 274 as AVAILABLE.

```
[04/27/26 03:54:16.475] [storage_mgr] [debug] [241] [gc_manager.cpp:1199:process_gc_task] [gc_task_id=70, pg_id=1, shard_id=0xffffffffffffffff] start process gc task for move_from_chunk=274 with priority=1
[04/27/26 03:54:16.475] [storage_mgr] [debug] [241] [gc_manager.cpp:1218:process_gc_task] [gc_task_id=70, pg_id=1, shard_id=0xffffffffffffffff] task for move_from_chunk=274 to move_to_chunk=282 with priority=1 start copying data
[...
[04/27/26 03:54:19.890] [storage_mgr] [debug] [241] [gc_manager.cpp:1367:process_after_gc_metablk_persisted] [gc_task_id=70, pg_id=1, shard_id=0xffffffffffffffff] vchunk_id=18 has been update from move_from_chunk=274 to move_to_chunk=282, 22551 blks are reclaimed, final state is updated to AVAILABLE
[04/27/26 03:54:19.890] [storage_mgr] [info] [241] [gc_manager.cpp:1299:process_gc_task] [gc_task_id=70, pg_id=1, shard_id=0xffffffffffffffff] task for move_from_chunk=274 to move_to_chunk=282 with priority=1 is completed!
```

T3: The create_shard log is committed. local_create_shard calls select_specific_chunk with [vchunk_id 18](https://github.com/eBay/HomeObject/blob/dc63b19cfc67faa776edb90db4ecd1c585591616/src/lib/homestore_backend/hs_shard_manager.cpp#L458), but the mapping vhunk_id->pchunk_id has been change, [so the old pchunk 274 is still in AVAILABLE, while the new pchunk changed to IN_USE](https://github.com/eBay/HomeObject/blob/dc63b19cfc67faa776edb90db4ecd1c585591616/src/lib/homestore_backend/heap_chunk_selector.cpp#L129-L136).

`[04/27/26 03:54:20.736] [storage_mgr] [debug] [74] [hs_shard_manager.cpp:475:local_create_shard] [trace_id=1611437756867722868,shardID=0x1000000000018,pg=1,shard=0x18] local_create_shard 281474976710680, vchunk_id=18, p_chunk_id=274, pg_id=1`
T4: put blobs on the shard,[ use pchunk_id in shard as hint,](https://github.com/eBay/HomeObject/blob/dc63b19cfc67faa776edb90db4ecd1c585591616/src/lib/homestore_backend/hs_blob_manager.cpp#L495) so all blobs are written to the now-AVAILABLE pchunk 274
`[04/27/26 03:54:32.092] [storage_mgr] [trace] [74] [hs_blob_manager.cpp:222:local_add_blob_info] [traceID=5298027153442811482,shardID=0x1000000000018,pg=1,shard=0x18,blob=10132] blob put commit, exist_already=false, status=success, pbas=[{blk#=31778 count=1025 chunk=274},]
`
T5: pchunk 274, still marked AVAILABLE, is allocated to other shards, causing data corruption.

More details in https://github.com/eBay/HomeStore/issues/870

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Racing between create/seal shard and GC #433

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Racing between create/seal shard and GC #433

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions