docs(design): proposed txn secondary-commit idempotency (Jepsen 26198185540)#796
docs(design): proposed txn secondary-commit idempotency (Jepsen 26198185540)#796bootjp wants to merge 2 commits into
Conversation
…6198185540)
Triggered by the scheduled Jepsen run 26198185540 surfacing a
real :duplicate-elements anomaly on the Redis list-append
workload. The duplicate is a single client-level :append being
applied twice in storage with DIFFERENT commit_ts: the EXEC
body's retry loop (runTransaction's retryRedisWrite) regenerates
startTS / commitTS / list index on each iteration, so content-
addressing — which catches identical-input replay — does not
catch the regenerated-input retry-after-partial-success shape.
Design proposes a server-side per-txn applied marker keyed by an
adapter-allocated UUIDv7, written atomically with the commit
mutations and checked at FSM apply entry. TiKV's
CheckTxnStatus + content-addressed Write CF design is the
reference; this proposal lifts the same correctness guarantee
into the FSM layer because elastickv's RPUSH/LPUSH-derived
storage keys are NOT pure content-addressed (the seq number
depends on a meta read at attempt time).
Doc covers:
- Why the same-commit_ts replay does NOT duplicate
(content-addressed delta keys)
- Why the different-commit_ts replay DOES duplicate
(runTransaction's EXEC-body retry regenerates everything)
- Why applyCommitWithIdempotencyFallback does NOT catch this
(works only on identical-input replay)
- How TiKV's three layers (content-addressed writes +
CheckTxnStatus + LockResolver) compose, and which one
elastickv is missing
- Server-side marker design (key prefix, GC TTL, batch atomicity)
- Optional client-side commit-status check
- Six implementation milestones (M1 proto/UUID, M2 FSM marker,
M3 adapter sentinel handling, M4 GC, M5 optional check,
M6 Jepsen validation)
- Caller audit table covering all commit-request entry points
- Four risks (GC timing, proto overhead, reconstruction
intricacy, LockResolver interaction)
The doc is independent of PR #795 (full Jepsen log artifact);
the fix design works for any partial-commit code path that
might be the actual trigger.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This design document proposes a server-side transaction-applied marker and an optional client-side check to ensure exactly-once application of multi-shard transaction commits, addressing a duplicate-element anomaly found during Jepsen testing. The proposed solution involves using a UUIDv7 as a transaction identifier to detect and handle retries that might otherwise result in duplicate writes. Review feedback highlights the need for the new error sentinel to include the committed timestamp to aid result reconstruction, suggests optimizing the marker garbage collection by leveraging the time-ordered nature of UUIDv7 with DeleteRange, and warns about the operational risks of adding synchronous reads to the secondary commit retry path.
| // Primary already committed. If our secondary's marker is | ||
| // present, this is a true no-op — return success. Otherwise | ||
| // retry the secondary commit (it really did fail). | ||
| if secondary_marker_present(adapter_txn_id) { return Ok } |
There was a problem hiding this comment.
The secondary_marker_present check in milestone M5 introduces a significant operational risk: a synchronous read from the secondary shard before every retry. If the secondary shard is experiencing issues (e.g., leader election or network partition), this check will likely fail or time out, delaying the retry loop and impacting availability. As per repository guidelines for operational risks, the design should detail mitigation strategies; a 'blind' retry might be more robust as it avoids the added latency and dependency in the failure path.
References
- When a design document identifies a significant operational risk, it must also detail potential mitigation strategies to ensure system stability.
| 2. If present and the recorded commit_ts equals the request's commit_ts: | ||
| no-op (return nil). The mutations would be a duplicate. | ||
| 3. If present and the recorded commit_ts **differs**: return | ||
| `ErrTxnAlreadyCommittedAtDifferentTS` (a new sentinel). This catches the |
There was a problem hiding this comment.
To facilitate the result reconstruction described in milestone M3, the ErrTxnAlreadyCommittedAtDifferentTS sentinel should carry the recorded commit_ts in its response payload. Since the adapter regenerates startTS on every retry (as noted in the walkthrough), it will not know the original startTS used by the successful attempt and thus cannot look up the primary's commit record via txnCommitKey(primary_key, original_start_ts). Returning the commit_ts directly from the FSM marker eliminates this lookup dependency and supports smoother rolling upgrades by reducing the need for consistent state lookups across different node versions.
References
- Design documents should address operational risks such as rolling upgrades by detailing strategies that ensure compatibility across different node versions.
|
|
||
| A conservative TTL of `2 × (lockResolverInterval + defaultTxnLockTTLms)` | ||
| = 80 s covers both. Cleanup runs in `kv/compactor.go`'s existing periodic | ||
| sweep — add a new prefix scan that drops markers whose `applied_at` is |
There was a problem hiding this comment.
Since adapter_txn_id is a UUIDv7, it is naturally time-ordered (the most significant 48 bits contain a millisecond timestamp). Instead of a prefix scan that reads every value to check applied_at_wall_time, the compactor can use a more efficient DeleteRange operation. By calculating the UUIDv7 prefix corresponding to the TTL cutoff, you can delete all expired markers in a single atomic operation without the O(N) read overhead during the periodic sweep.
Summary
Design doc for fixing the
:duplicate-elementsanomaly surfaced by scheduled Jepsen run 26198185540.What's in the doc
:append 14 230from worker 3 produced TWO storage entries with value230, immediately visible in the next read. Same-commit_tsreplay is dedup'd by content-addressed list-delta keys; different-commit_tsreplay (regenerated byrunTransaction'sretryRedisWriteon partial-commit) is NOT, because RPUSH/LPUSH derive theirseqfrom a fresh meta read at attempt time.CheckTxnStatusbefore retry, LockResolver — and which one elastickv is missing.CheckTxnStatus) as a follow-up.Why now / why this layer
:duplicate-elementsis now visible to the checker rather than being silently classified as:info.runTransaction's retry loop AND the gRPCForwardredirect, and no client-supplied txn ID exists. Server-side marker is the natural enforcement point.applyCommitWithIdempotencyFallbackonly catches same-commit_tsreplay; the regenerated-input replay shape needs a logical-txn-identity marker.Companion PR
PR #795 ("upload full demo cluster log on Jepsen failure") is open in parallel. The design here is independent of which exact code path produces the partial-commit (that's what PR #795 will let us pinpoint); the fix works for any such path.
Status
Open for review. Implementation milestones (M1–M6) follow once this lands.