ops(rolling-update): bump raftadmin RPC timeout + retry transfer by bootjp · Pull Request #799 · bootjp/elastickv

bootjp · 2026-05-21T17:29:14Z

Summary

Make scripts/rolling-update.sh survive the post-restart catch-up window during multi-node rolling updates:

Raise default RAFTADMIN_RPC_TIMEOUT_SECONDS from 5 → 15 (single-RPC headroom).
Add LEADERSHIP_TRANSFER_RETRY_ATTEMPTS (default 3) and LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS (default 5). The targeted leadership_transfer_to_server RPC is now retried with backoff before falling back to the generic transfer; the generic fallback is only used after all targeted retries are exhausted.

Why

The 2026-05-21 production re-deploy reproduction:

==> [n2@192.168.0.211] start
node is leader; transferring leadership to n1@192.168.0.210:50051
targeted leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted
falling back to generic leadership transfer
generic leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted
[bailed out, cluster half-deployed]

n1 had been rolled-restarted ~10 s earlier and its log had not yet caught up. raft refused both the targeted and the generic transfer for the same reason. Manual recovery required RAFTADMIN_RPC_TIMEOUT_SECONDS=30 plus a hand-issued raftadmin call.

Caller audit

leadership_transfer_to_server retry: callers (maybe_transfer_leadership) interpret any return failure as a refusal to restart. The change only delays that decision under transient failure, never widens its scope.
RAFTADMIN_RPC_TIMEOUT_SECONDS: every raftadmin RPC respects this. Raising the default does not change which RPCs succeed — only widens the kill window for a slow RPC.

Test plan

bash -n scripts/rolling-update.sh — clean
Production re-run exercises retry path (would surface as attempt N/3 log lines if FailedPrecondition recurs)

2026-05-21 production rolling-update reproduction (`d.sh` re-run after main advanced) aborted on the n2 → n1 leadership transfer with: targeted leadership transfer RPC failed: rpc error: code = FailedPrecondition desc = etcd raft leadership transfer aborted n1 had just been rolled-restarted on the previous iteration (updated 17:16:25) and was still in its post-restart pre-stable state when n2 tried to hand leadership over. raft refused the transfer because the candidate's log was not yet caught up. The script then fell back to generic transfer, which ALSO returned the same FailedPrecondition, and bailed out — leaving the cluster half-deployed (n1 new, n2-n5 old). Manual recovery required a re-run with `RAFTADMIN_RPC_TIMEOUT_SECONDS=30` plus a manual raftadmin call to nudge leadership. Two changes: 1. Raise the default `RAFTADMIN_RPC_TIMEOUT_SECONDS` from 5 → 15. 5 seconds gave the transfer RPC no headroom over even a brief raft-internal abort. 15 s is still small enough that a truly stuck call surfaces fast, while comfortably covering the ~10 s catch-up window of a freshly-restarted candidate. 2. Retry the targeted `leadership_transfer_to_server` RPC up to `LEADERSHIP_TRANSFER_RETRY_ATTEMPTS` (default 3) times with `LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS` (default 5) of backoff between attempts. Only after all targeted retries are exhausted does the script fall back to generic transfer. The retry is on the targeted call, not the generic one, because the generic fallback chooses whatever the engine thinks is a healthy candidate — that decision can race with the same post-restart catch-up window and the same FailedPrecondition arrives again. Retrying the targeted call gives the original chosen candidate the few extra seconds it needs to catch up. env example updated to match the new defaults and document the new tunables. The new env vars are forwarded to the remote ssh sub-process so node-side script invocations see them. Caller audit: - `leadership_transfer_to_server` retry: behavior change is "may retry up to N times before failing". Callers (`maybe_transfer_leadership`) treat any return failure as a refusal to restart; the change only delays that decision under transient failure, never widens its scope. - `RAFTADMIN_RPC_TIMEOUT_SECONDS`: every raftadmin RPC respects this. Raising the default does not change which RPCs succeed — only widens the window before a slow RPC is killed. Safe. Test: bash -n scripts/rolling-update.sh -- clean. Production re-run pending (next d.sh invocation will exercise the retry path; logs will show `attempt N/3` lines if it fires).

coderabbitai · 2026-05-21T17:29:21Z

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41c4540f-fc74-49db-a6c8-a194488e9466

📥 Commits

Reviewing files that changed from the base of the PR and between b218dd3 and 12e5fdf.

📒 Files selected for processing (2)

scripts/rolling-update.env.example
scripts/rolling-update.sh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ops/rolling-update-transfer-retry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request improves the reliability of leadership transfers during rolling updates by increasing the default RPC timeout and introducing a retry mechanism for targeted transfers. The changes include new configuration variables for retry attempts and backoff duration, along with a retry loop in the update script. Feedback was provided to add input validation for these new variables to ensure they are valid non-negative integers and that the retry count is at least one.

gemini-code-assist · 2026-05-21T17:31:43Z

+LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
+LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"


The new configuration variables LEADERSHIP_TRANSFER_RETRY_ATTEMPTS and LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS are used in arithmetic contexts (( ... )) and sleep commands. They should be validated to ensure they are valid non-negative integers to prevent shell errors or potential arithmetic expansion vulnerabilities if misconfigured. Additionally, LEADERSHIP_TRANSFER_RETRY_ATTEMPTS should be at least 1 to ensure the targeted transfer is attempted at least once before falling back, as indicated in the comments.

Suggested change

LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"

LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"

LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"

LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"

for _int_var in LEADERSHIP_TRANSFER_RETRY_ATTEMPTS LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS; do

if [[ ! "${!_int_var}" =~ ^[0-9]+$ ]]; then

echo "rolling-update: ${_int_var} must be a non-negative integer, got '${!_int_var}'" >&2

exit 1

fi

done

if (( LEADERSHIP_TRANSFER_RETRY_ATTEMPTS < 1 )); then

echo "rolling-update: LEADERSHIP_TRANSFER_RETRY_ATTEMPTS must be at least 1" >&2

exit 1

fi

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops(rolling-update): bump raftadmin RPC timeout + retry transfer#799

ops(rolling-update): bump raftadmin RPC timeout + retry transfer#799
bootjp wants to merge 1 commit into
mainfrom
ops/rolling-update-transfer-retry

bootjp commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
		LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"

-LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
-LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"
+LEADERSHIP_TRANSFER_RETRY_ATTEMPTS="${LEADERSHIP_TRANSFER_RETRY_ATTEMPTS:-3}"
+LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS="${LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS:-5}"
+for _int_var in LEADERSHIP_TRANSFER_RETRY_ATTEMPTS LEADERSHIP_TRANSFER_RETRY_BACKOFF_SECONDS; do
+  if [[ ! "${!_int_var}" =~ ^[0-9]+$ ]]; then
+    echo "rolling-update: ${_int_var} must be a non-negative integer, got '${!_int_var}'" >&2
+    exit 1
+  fi
+done
+if (( LEADERSHIP_TRANSFER_RETRY_ATTEMPTS < 1 )); then
+  echo "rolling-update: LEADERSHIP_TRANSFER_RETRY_ATTEMPTS must be at least 1" >&2
+  exit 1
+fi

Conversation

bootjp commented May 21, 2026

Summary

Why

Caller audit

Test plan

Uh oh!

coderabbitai Bot commented May 21, 2026

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant