Skip to content

docs(design): proposed txn secondary-commit idempotency (Jepsen 26198185540)#796

Open
bootjp wants to merge 2 commits into
mainfrom
docs/txn-idempotency-design
Open

docs(design): proposed txn secondary-commit idempotency (Jepsen 26198185540)#796
bootjp wants to merge 2 commits into
mainfrom
docs/txn-idempotency-design

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 21, 2026

Summary

Design doc for fixing the :duplicate-elements anomaly surfaced by scheduled Jepsen run 26198185540.

What's in the doc

  • Root cause walkthrough: a single :append 14 230 from worker 3 produced TWO storage entries with value 230, immediately visible in the next read. Same-commit_ts replay is dedup'd by content-addressed list-delta keys; different-commit_ts replay (regenerated by runTransaction's retryRedisWrite on partial-commit) is NOT, because RPUSH/LPUSH derive their seq from a fresh meta read at attempt time.
  • TiKV reference: Percolator's three layers — content-addressed writes, CheckTxnStatus before retry, LockResolver — and which one elastickv is missing.
  • Proposed fix: server-side per-txn applied marker keyed by adapter-allocated UUIDv7, written atomically with the commit mutations and checked at FSM apply entry. Optional client-side commit-status check (TiKV-style CheckTxnStatus) as a follow-up.
  • Six implementation milestones with scope estimates.
  • Caller audit table covering every commit-request entry point.
  • Four risks with mitigations.

Why now / why this layer

  • PR fix(redis): NOTLEADER prefix for gRPC-wrapped leader errors (closes Jepsen run 26035515694 crash) #789 (NOTLEADER prefix) unmasked this anomaly — workers no longer crash, so the :duplicate-elements is now visible to the checker rather than being silently classified as :info.
  • A pure client-side fix is not viable: the trigger spans runTransaction's retry loop AND the gRPC Forward redirect, and no client-supplied txn ID exists. Server-side marker is the natural enforcement point.
  • Existing applyCommitWithIdempotencyFallback only catches same-commit_ts replay; the regenerated-input replay shape needs a logical-txn-identity marker.

Companion PR

PR #795 ("upload full demo cluster log on Jepsen failure") is open in parallel. The design here is independent of which exact code path produces the partial-commit (that's what PR #795 will let us pinpoint); the fix works for any such path.

Status

Open for review. Implementation milestones (M1–M6) follow once this lands.

…6198185540)

Triggered by the scheduled Jepsen run 26198185540 surfacing a
real :duplicate-elements anomaly on the Redis list-append
workload. The duplicate is a single client-level :append being
applied twice in storage with DIFFERENT commit_ts: the EXEC
body's retry loop (runTransaction's retryRedisWrite) regenerates
startTS / commitTS / list index on each iteration, so content-
addressing — which catches identical-input replay — does not
catch the regenerated-input retry-after-partial-success shape.

Design proposes a server-side per-txn applied marker keyed by an
adapter-allocated UUIDv7, written atomically with the commit
mutations and checked at FSM apply entry. TiKV's
CheckTxnStatus + content-addressed Write CF design is the
reference; this proposal lifts the same correctness guarantee
into the FSM layer because elastickv's RPUSH/LPUSH-derived
storage keys are NOT pure content-addressed (the seq number
depends on a meta read at attempt time).

Doc covers:
  - Why the same-commit_ts replay does NOT duplicate
    (content-addressed delta keys)
  - Why the different-commit_ts replay DOES duplicate
    (runTransaction's EXEC-body retry regenerates everything)
  - Why applyCommitWithIdempotencyFallback does NOT catch this
    (works only on identical-input replay)
  - How TiKV's three layers (content-addressed writes +
    CheckTxnStatus + LockResolver) compose, and which one
    elastickv is missing
  - Server-side marker design (key prefix, GC TTL, batch atomicity)
  - Optional client-side commit-status check
  - Six implementation milestones (M1 proto/UUID, M2 FSM marker,
    M3 adapter sentinel handling, M4 GC, M5 optional check,
    M6 Jepsen validation)
  - Caller audit table covering all commit-request entry points
  - Four risks (GC timing, proto overhead, reconstruction
    intricacy, LockResolver interaction)

The doc is independent of PR #795 (full Jepsen log artifact);
the fix design works for any partial-commit code path that
might be the actual trigger.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 38 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20648c60-3471-4b8c-bad3-8fc3852e0fa6

📥 Commits

Reviewing files that changed from the base of the PR and between fe96c88 and 22753ab.

📒 Files selected for processing (1)
  • docs/design/2026_05_21_proposed_txn_secondary_idempotency.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/txn-idempotency-design

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This design document proposes a server-side transaction-applied marker and an optional client-side check to ensure exactly-once application of multi-shard transaction commits, addressing a duplicate-element anomaly found during Jepsen testing. The proposed solution involves using a UUIDv7 as a transaction identifier to detect and handle retries that might otherwise result in duplicate writes. Review feedback highlights the need for the new error sentinel to include the committed timestamp to aid result reconstruction, suggests optimizing the marker garbage collection by leveraging the time-ordered nature of UUIDv7 with DeleteRange, and warns about the operational risks of adding synchronous reads to the secondary commit retry path.

// Primary already committed. If our secondary's marker is
// present, this is a true no-op — return success. Otherwise
// retry the secondary commit (it really did fail).
if secondary_marker_present(adapter_txn_id) { return Ok }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The secondary_marker_present check in milestone M5 introduces a significant operational risk: a synchronous read from the secondary shard before every retry. If the secondary shard is experiencing issues (e.g., leader election or network partition), this check will likely fail or time out, delaying the retry loop and impacting availability. As per repository guidelines for operational risks, the design should detail mitigation strategies; a 'blind' retry might be more robust as it avoids the added latency and dependency in the failure path.

References
  1. When a design document identifies a significant operational risk, it must also detail potential mitigation strategies to ensure system stability.

2. If present and the recorded commit_ts equals the request's commit_ts:
no-op (return nil). The mutations would be a duplicate.
3. If present and the recorded commit_ts **differs**: return
`ErrTxnAlreadyCommittedAtDifferentTS` (a new sentinel). This catches the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To facilitate the result reconstruction described in milestone M3, the ErrTxnAlreadyCommittedAtDifferentTS sentinel should carry the recorded commit_ts in its response payload. Since the adapter regenerates startTS on every retry (as noted in the walkthrough), it will not know the original startTS used by the successful attempt and thus cannot look up the primary's commit record via txnCommitKey(primary_key, original_start_ts). Returning the commit_ts directly from the FSM marker eliminates this lookup dependency and supports smoother rolling upgrades by reducing the need for consistent state lookups across different node versions.

References
  1. Design documents should address operational risks such as rolling upgrades by detailing strategies that ensure compatibility across different node versions.


A conservative TTL of `2 × (lockResolverInterval + defaultTxnLockTTLms)`
= 80 s covers both. Cleanup runs in `kv/compactor.go`'s existing periodic
sweep — add a new prefix scan that drops markers whose `applied_at` is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since adapter_txn_id is a UUIDv7, it is naturally time-ordered (the most significant 48 bits contain a millisecond timestamp). Instead of a prefix scan that reads every value to check applied_at_wall_time, the compactor can use a more efficient DeleteRange operation. By calculating the UUIDv7 prefix corresponding to the TTL cutoff, you can delete all expired markers in a single atomic operation without the O(N) read overhead during the periodic sweep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant