Pause apply workers during slot creation to fix data loss on new node.#416
Pause apply workers during slot creation to fix data loss on new node.#416ibrarahmad wants to merge 1 commit intov5_STABLEfrom
Conversation
An apply worker committing between ensure_replication_slot_snapshot and adjust_progress_info could advance spock.progress beyond the COPY snapshot boundary, causing the new node to skip those changes permanently. Add an atomic pause flag and ConditionVariable in SpockContext. Apply workers check the flag in begin_replication_step and after handle_commit, sleeping until copy_replication_sets_data calls resume_apply_workers() after adjust_progress_info completes. Add spock.pause_timeout GUC (default 10s) as a safety net against a crashed sync worker leaving workers paused indefinitely. Backport of PR #392 from main, adapted for v5_STABLE architecture.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
| Duplication | 0 |
TIP This summary will be updated as you push new changes. Give us feedback
An apply worker committing between ensure_replication_slot_snapshot and adjust_progress_info could advance spock.progress beyond the COPY snapshot boundary, causing the new node to skip those changes permanently.
Add an atomic pause flag and ConditionVariable in SpockContext. Apply workers check the flag in begin_replication_step and after handle_commit, sleeping until copy_replication_sets_data calls resume_apply_workers() after adjust_progress_info completes.
Add spock.pause_timeout GUC (default 10s) as a safety net against a crashed sync worker leaving workers paused indefinitely.
Backport of PR #392 from main, adapted for v5_STABLE architecture.