Fix #581: Add CRC32 checksum verification for pushdata/fetchdata data channels by OmDoshi13 · Pull Request #887 · eBay/HomeStore

OmDoshi13 · 2026-05-27T11:23:42Z

Summary

Fixes #581 — silent data corruption caused by network-layer corruption between leader and followers going undetected until user read/write time.

Root cause: Data left the leader intact but could be corrupted in the network layer. Followers accepted and stored whatever arrived; users only discovered corruption at read/write time.
Fix: Attach a CRC32 checksum to every data packet sent over the PushData and FetchData channels. The receiver recomputes the checksum and drops the packet on mismatch, letting Raft retry rather than writing corrupted bytes to disk.

Key implementation details

CRC32 computed over the full data payload at send time; verified before any write on the follower
FetchData response uses a self-describing magic-byte framing header (FETCH_DATA_RESPONSE_MAGIC) so receivers detect the header without consulting per-node config — safe under hotswap config asymmetry during rolling upgrades
data_checksum_enabled defaults to false; operators enable via hotswap after all nodes are running the new binary to avoid old nodes misinterpreting the framing header as block data
Added data_checksum_mismatch_cnt metric for observability
FlatBuffer verifier added on the receive path to guard against malformed headers
Per-entry lsn/dsn/raft_term attached to ResponseEntry for richer diagnostics

Correctness fixes (commit 3)

Three issues identified during review and fixed before merge:

F1 — Immediate re-fetch on FetchData checksum mismatch: Previously a mismatch silently dropped the entry and waited for the full data_receive_timeout_ms (10s) before Raft retried. Now mismatched rreqs are collected during the response loop and passed to check_and_fetch_remote_data() immediately after, eliminating the stall.
F2 — Unsigned underflow guard in handle_fetch_data_response: A truncated or malformed response could cause total_size -= data_size to wrap to ~4GB and advance raw_data past the buffer boundary in release builds. Added an explicit data_size > total_size guard with early return before each subtraction.
F3 — FlatBuffer verifier on PushData receive path: push_req fields were accessed before the buffer was verified in on_push_data_received. A corrupted or crafted push RPC could cause out-of-bounds reads via the unverified FlatBuffer accessor. Added flatbuffers::Verifier call before any field access, matching the pattern already present on the FetchData response path.

Test plan

Checksum_Enabled_PushData_Path — happy path with checksums on; writes commit correctly across all replicas
Checksum_Enabled_FetchData_Path — drops all push-data on followers forcing fetch path; verifies framing header is parsed correctly and data arrives intact
Existing Follower_Fetch_OnActive_ReplicaGroup still passes (no regression)

All 3 tests passed across all 4 CI build variants (Debug/Sanitize, Debug/Coverage, Release/tcmalloc prerelease, Release/tcmalloc) — 100% tests passed, 0 failed out of 21 total.

…ta channels (eBay#581) During SM long-running tests, followers observed hash mismatches and invalid headers during read-verify. Root cause: PushData and FetchData channels transmitted raw payloads with no integrity check, allowing in-flight corruption to be silently written to disk. Changes: - Add checksum: uint32 to PushDataRequest and ResponseEntry FlatBuffer schemas - Add data_checksum_enabled: bool = true (hotswap) to Consensus config so the overhead can be toggled off at runtime for performance benchmarking - Compute CRC32 over the data payload before sending on both push and fetch paths - Verify CRC32 on receipt; drop and let retry on mismatch rather than writing corrupt data to disk - FetchData response now optionally prepends a FetchDataResponse FlatBuffer header carrying per-entry checksums, gated on data_checksum_enabled to ensure backward compatibility during rolling upgrades

- Switch framing detection to magic-byte prefix (FETCH_DATA_RESPONSE_MAGIC) so receivers self-describe the header without consulting per-node config, making detection safe under hotswap config asymmetry between nodes - Default data_checksum_enabled to false; operators enable via hotswap after all nodes are running the new binary to avoid old nodes misinterpreting the framing header as block data - Add FlatBuffer verifier on the receive path to guard against malformed headers - Fix header memory management: heap-allocate with new/delete[] instead of iobuf - Attach per-entry lsn/dsn/raft_term to ResponseEntry for richer diagnostics - Increment data_checksum_mismatch_cnt metric on every mismatch - Add Checksum_Enabled_PushData_Path and Checksum_Enabled_FetchData_Path tests

F1: On FetchData checksum mismatch, immediately re-fetch via check_and_fetch_remote_data() instead of waiting for the full data_receive_timeout_ms (10s) stall. Mismatched rreqs are collected in the response loop and re-queued after processing the valid entries. F2: Guard against total_size unsigned underflow before each total_size -= data_size subtraction in handle_fetch_data_response. A truncated or malformed response could wrap total_size to ~4GB and advance raw_data past the blob boundary in release builds. F3: Add flatbuffers::Verifier call in on_push_data_received before any PushDataRequest field is accessed. Without this, a corrupted or crafted push RPC could cause out-of-bounds reads via the unverified FlatBuffer accessor, matching the verification already present on the FetchData response path.

xiaoxichen

@OmDoshi13 thanks for taking this and helping with the tech debt, It looks nice in general.

I think your changes can be categorised into 3 parts
A: Corner cases fixes, mainly in your commit 3.
B: Moving FetchDataResponse from raw to Flatbuffer
C: CRC implementation.

I dont have any problem with A and hoping A can be done in a separate PR to allow it be merged sooner.

For B, I saw you added a FETCH_DATA_RESPONSE_MAGIC as well as corresponding logic to support rolling upgrade and hybrid version, that is nice. And yes, It is totally a mistake in the past that skipping flatbuffer instead using raw format...

however, using the magic header is another dead-end, as we have to carry this "compatibility" tax forever, it doesnt help us in next release that removing the magic header and merge back to the native flat-buffer approach.

With this thinking, do you think we can directly using FlatBuffer w/o MAGIC, as flatbuffer is size-prefixed, we can read the first size_bytes, and check if that number makes sense , then proceed to flatbuffer::verify, if that is also passing it is very likely it is new format, then we check CRCs of raw data against what was in flatbuffer.

Basically I am trying to suggest try-and-fallback pattern.Note the raw data should be a homeobject blobs/shards where has a DataHeader with magic uint64_t magic{0x21fdffdba8d68fc6};, , that makes the flatbuffer prefixed size to be 2.6GB, which will very probably fails the basic size check.

    struct DataHeader {
        static constexpr uint8_t data_header_version = 0x02;
        static constexpr uint64_t data_header_magic = 0x21fdffdba8d68fc6; // echo "BlobHeader" | md5sum

        enum class data_type_t : uint32_t { SHARD_INFO = 1, BLOB_INFO = 2 };

        bool valid() const { return ((magic == data_header_magic) && (version <= data_header_version)); }

        uint64_t magic{data_header_magic};
        uint8_t version{data_header_version};
        data_type_t type{data_type_t::BLOB_INFO};
    };

xiaoxichen · 2026-05-28T04:55:08Z

For C:
Lets try to separate the enablement of CRC with the wire-format changes. If in B we moved to Flatbuffer, then checksum would have a default value of zero if not enabled, we probably want to skip the CRC checking if CRC field is zero.

A bit more context on the CRC: once we added length check, we never saw data corruption on wire anymore. Basically it is hard to see bit rot in TCP protocol as TCP protocol already provides its CRC in TCP-Header. Usually the error was caused by early terminating --- sender disconnected before sending full data, receiver received partial data and blindly using it without checking the prefix-size.

With this assumption and observation in production , I think we wont enable the CRC for generic use cases, only use in more durability sensitive environment if needed.

… detection Per xiaoxichen's review feedback on PR eBay#887: B — FetchData wire format: drop the FETCH_DATA_RESPONSE_MAGIC header approach. Senders now always emit a size-prefixed FetchDataResponse FlatBuffer (rolling upgrade safe: old raw-format senders are detected by receivers automatically). Receivers use try-and-fallback: read the 4-byte uoffset_t size prefix; if fb_hdr_size <= total_size and flatbuffers::Verifier passes it is new format, otherwise fall back to old raw format. Old senders' DataHeader magic (0x21fdffdba8d68fc6) reads as a ~3.69 GB uoffset_t and fails the size check, so the two formats are distinguishable without any explicit magic bytes. C — CRC decoupled from wire format: data_checksum_enabled now controls only the send side (whether to compute and fill in the checksum field). Receivers skip CRC verification whenever checksum == 0, regardless of their own config. Updated homestore_config.fbs comment to reflect this. CRC is not enabled by default: TCP already guards against partial-transfer corruption; opt in only for durability-sensitive environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… detection Per xiaoxichen's review feedback on PR eBay#887: B — FetchData wire format: drop the FETCH_DATA_RESPONSE_MAGIC header approach. Senders now always emit a size-prefixed FetchDataResponse FlatBuffer (rolling upgrade safe: old raw-format senders are detected by receivers automatically). Receivers use try-and-fallback: read the 4-byte uoffset_t size prefix; if fb_hdr_size <= total_size and flatbuffers::Verifier passes it is new format, otherwise fall back to old raw format. Old senders' DataHeader magic (0x21fdffdba8d68fc6) reads as a ~3.69 GB uoffset_t and fails the size check, so the two formats are distinguishable without any explicit magic bytes. C — CRC decoupled from wire format: data_checksum_enabled now controls only the send side (whether to compute and fill in the checksum field). Receivers skip CRC verification whenever checksum == 0, regardless of their own config. Updated homestore_config.fbs comment to reflect this. CRC is not enabled by default: TCP already guards against partial-transfer corruption; opt in only for durability-sensitive environments.

…lback Replace FETCH_DATA_RESPONSE_MAGIC framing with a size-prefix try-and-fallback to detect new vs old FetchDataResponse format on the receiver side. Sender (on_fetch_data_received): emit a size-prefixed FetchDataResponse FlatBuffer only when data_checksum_enabled=true. When disabled the old raw wire format is preserved unchanged, so old receivers are not affected during rolling upgrades. Receiver (handle_fetch_data_response): attempt to parse a FlatBuffer by reading the 4-byte uoffset_t prefix, checking fb_hdr_size <= total_size, and running flatbuffers::Verifier. Falls back to raw format on any failure. Raw block data from HomeObject begins with DataHeader magic (0x21fdffdba8d68fc6) whose first 4 bytes as a uoffset_t read as ~2.8 GB, reliably failing the size check. Additional fixes: - Move GetSizePrefixedPushDataRequest to after the flatbuffers::Verifier guard in on_push_data_received so no field access precedes buffer verification. - data_checksum_enabled now gates send-side only; receivers skip CRC when checksum field is zero (0 = not computed sentinel). - Widen uoffset_t arithmetic to uint64_t in on_push_data_received and handle_fetch_data_response to prevent integer wrap when ReadScalar returns a value near 0xFFFFFFFF. - Update homestore_config.fbs comment to document send-side-only semantics.

OmDoshi13 · 2026-06-02T21:36:11Z

Hey @xiaoxichen, the branch has been updated — 147fcc8a is the latest commit and addresses your review feedback on parts B and C. Please ignore the earlier push events in the timeline.

xiaoxichen

lgtm.
cc @Besroy note this "might" be a breaking change if the detect-and-fallback on receiver side doesn't works as expected.

codecov-commenter · 2026-06-05T11:31:01Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 23.63636% with 84 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v7.x@9c5bf78). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp	24.07%	43 Missing and 39 partials ⚠️
src/lib/replication/repl_dev/raft_repl_dev.h	0.00%	0 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@              Coverage Diff               @@
##             stable/v7.x     #887   +/-   ##
==============================================
  Coverage               ?   47.90%           
==============================================
  Files                  ?      110           
  Lines                  ?    13056           
  Branches               ?     6289           
==============================================
  Hits                   ?     6254           
  Misses                 ?     2634           
  Partials               ?     4168

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JacksonYao287 · 2026-06-10T07:00:46Z


+    // Try-and-fallback: attempt to parse a size-prefixed FetchDataResponse FlatBuffer at the start
+    // of the blob.  New senders (data_checksum_enabled=true) prepend this header; old senders emit
+    // raw block data with no prefix.  In HomeObject (the layer above HomeStore), raw block data


we should not make any assumption of the layer above homestore. raft_repl_dev is not only used by homeobject , it`s probably used by other storage applicaion in the future.

if another application built on homestore has a very small first 4 bytes(let`s say 0x10, for example), the following logic will confuse the old version with the new one

who is "another" application as of now? If there is no one, then it is safe. Tomorrow after the PR merged if there is new application on top of homestore they already on new code.

Thanks @JacksonYao287 — and that's a valid concern in principle, but the detection doesn't rely on the DataHeader magic at all. The code uses three independent gates before treating a response as the new format:

BufferHasIdentifier(raw_data + sizeof(uoffset_t), "FDRS") — bytes 4–7 must literally be the ASCII string FDRS. This is emitted only by FinishSizePrefixedBuffer(..., "FDRS") in our on_fetch_data_received, so it cannot accidentally appear in a future application's raw block data.

Parsed size must be ≤ data_fetch_max_size_kb (2 MB).

flatbuffers::Verifier — validates internal FlatBuffer structure.

Any future application's raw block data would need to accidentally have 0x46 0x44 0x52 0x53 at byte offset 4, then pass the size check, then satisfy the Verifier — all three independently. The DataHeader magic example @xiaoxichen gave was illustrating why the size check works for the current application, but the "FDRS" identifier gate is what makes this safe for any future application.

Also fully agree with @xiaoxichen: any new application built on top of HomeStore after this PR merges will already have the new receiver code, so there's no backward compatibility concern there either.

the detection doesn't rely on the DataHeader magic at all
yes, this is what I want. it makes sense to me to involve independent gate handling this case.

my thought is that , as you said, we should have independent gate and not rely on the header magic(defined by the application).

@OmDoshi13 thanks for the change

JacksonYao287 · 2026-06-10T07:09:28Z

        REGISTER_COUNTER(read_err_cnt, "total read error count", "read_err_cnt", {"op", "read"});
        REGISTER_COUNTER(write_err_cnt, "total write error count", "write_err_cnt", {"op", "write"});
        REGISTER_COUNTER(fetch_err_cnt, "total fetch data error count", "fetch_err_cnt", {"op", "fetch"});
+        REGISTER_COUNTER(data_checksum_mismatch_cnt, "CRC32 mismatches on push/fetch data channels",


can we distinguish push_data_checksum_mismatch_cnt and fetch_data_checksum_mismatch_cnt

Done — both push_data_checksum_mismatch_cnt and fetch_data_checksum_mismatch_cnt are now separate counters.

To answer @xiaoxichen's "reason?": push and fetch are distinct data paths with different failure modes. A mismatch on push (leader → follower direct RPC) narrows the problem to in-flight corruption or a sender-side CRC bug. A mismatch on fetch (follower pulling from leader) points to the same class of issue on a different code path. Keeping them separate lets operators immediately identify which channel is affected without having to correlate log timestamps or RPC traces.

- Split data_checksum_mismatch_cnt into push_data_checksum_mismatch_cnt and fetch_data_checksum_mismatch_cnt so operators can distinguish the two channels in metrics (addresses JacksonYao287 h:74 comment) - Add "FDRS" file identifier to FinishSizePrefixed on the fetch sender and switch receiver detection to BufferHasIdentifier — replacing the pure size-prefix heuristic that JacksonYao287 flagged at cpp:1695 - Extract kFetchDataRespIdentifier constant to avoid duplicated magic string and document why the identifier lives in code not the schema - Fix wrong log message in on_fetch_data_received ("PushData" → "FetchData") - Add minimum-size guard before ReadScalar in both on_push_data_received and on_fetch_data_received to prevent OOB reads on pathological RPCs - Add fb_size > incoming_buf.size() guard before Verifier construction in both receive paths so Verifier is never given a claimed size that exceeds the actual allocation - Add FlatBuffer verifier to on_fetch_data_received (leader side) to match the protection already present on the push receive path (F3) - Fix overclaiming comment at handle_fetch_data_response: the FDRS check is not strictly app-agnostic; note the collision risk is astronomically small and ruled out by the Verifier structural check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n edge cases - Null-guard optional FlatBuffer fields (user_header/user_key) in on_push_data_received before dereferencing; a well-formed sender always sets them but the Verifier only checks structural integrity - Use std::unique_ptr<uint8_t[]> for hdr_buf in on_fetch_data_received to prevent a leak on any exception between allocation and the lambda capture - Log at DEBUG when FlatBuffer Verifier fails despite matching "FDRS" identifier to aid diagnosis of format edge cases - Add corrupt_push_data_checksum and corrupt_fetch_data_checksum flip hooks (_PRERELEASE only) to enable fault-injection in the new mismatch tests - Add Checksum_Mismatch_PushData_Path test: enables checksum, injects corrupt_push_data_checksum on followers, verifies recovery via fetch - Add Checksum_Mismatch_FetchData_Path test: drops push-data, fires corrupt_fetch_data_checksum once, verifies immediate re-fetch succeeds - Fix #ifdef _PRERELEASE guard structure: add missing #endif after Follower_Fetch_OnActive_ReplicaGroup so Checksum_Enabled_PushData_Path compiles in all build variants; restore guard before Write_With_Diabled_Leader_Push_Data which was accidentally uncovered - Expand data_checksum_enabled comment to document the UNSAFE rolling upgrade direction (new sender -> old receiver = silent corruption) and add a DEPLOYMENT RULE for hotswap enablement

OmDoshi13 · 2026-06-10T23:05:17Z

@xiaoxichen — A, B, and C are all addressed in the current branch:

A (corner cases): The three fixes (F1 immediate re-fetch on mismatch, F2 underflow guard, F3 PushData Verifier null-guard) are all included.

B (FlatBuffer with try-and-fallback, no magic): The FETCH_DATA_RESPONSE_MAGIC approach has been replaced with the pattern you suggested. The receiver reads bytes 4–7 for the "FDRS" file identifier (set via FinishSizePrefixedBuffer), checks the parsed size against data_fetch_max_size_kb, then runs flatbuffers::Verifier. All three must pass before treating the response as the new format; otherwise it falls through to legacy raw-block processing. Old senders' DataHeader magic (0x21fdffdba8d68fc6) produces a uoffset_t of ~2.64 GB which fails the size check independently.

C (CRC with checksum=0 = disabled): The receiver skips CRC verification whenever the checksum field is zero (the FlatBuffer default). The config flag only controls the send side. The comment on data_checksum_enabled has been expanded to explicitly call out the UNSAFE direction (new sender with flag enabled → old receiver = silent corruption) and adds a DEPLOYMENT RULE: do not enable via hotswap until every node in the cluster has the try-and-fallback receiver.

odoshi13 added 2 commits May 25, 2026 13:00

OmDoshi13 changed the base branch from master to stable/v7.x May 27, 2026 12:21

OmDoshi13 marked this pull request as draft May 27, 2026 12:26

OmDoshi13 marked this pull request as ready for review May 27, 2026 12:28

xiaoxichen reviewed May 28, 2026

View reviewed changes

OmDoshi13 force-pushed the fix/checksum-pushdata-fetchdata-581 branch 2 times, most recently from 330c534 to 9a2ec70 Compare June 2, 2026 21:29

xiaoxichen previously approved these changes Jun 5, 2026

View reviewed changes

xiaoxichen requested a review from JacksonYao287 June 5, 2026 10:45

Bump version to 7.5.10

ef7a447

OmDoshi13 dismissed xiaoxichen’s stale review via ef7a447 June 6, 2026 17:52

JacksonYao287 reviewed Jun 10, 2026

View reviewed changes

odoshi13 and others added 2 commits June 10, 2026 23:07

Conversation

OmDoshi13 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key implementation details

Correctness fixes (commit 3)

Test plan

Uh oh!

xiaoxichen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmDoshi13 commented Jun 2, 2026

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JacksonYao287 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

OmDoshi13 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

OmDoshi13 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

OmDoshi13 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OmDoshi13 commented May 27, 2026 •

edited

Loading

xiaoxichen left a comment •

edited

Loading

xiaoxichen commented May 28, 2026 •

edited

Loading

codecov-commenter commented Jun 5, 2026 •

edited

Loading