Skip to content

fix: retry transient GCS delete_many failures#90

Open
shunjiad wants to merge 1 commit into
mainfrom
shunjiad/fix-gcs-delete-many-retry
Open

fix: retry transient GCS delete_many failures#90
shunjiad wants to merge 1 commit into
mainfrom
shunjiad/fix-gcs-delete-many-retry

Conversation

@shunjiad

@shunjiad shunjiad commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Description

  • Retry transient GCS delete_many failures during batch deletes.
  • Preserve retryability for GCS batch delete 408, 429, and 5xx response statuses.
  • Add focused regression tests for delete_many retry behavior.

Changes

  • Added @retry to SingleStorageClient.delete_many.
  • Updated GCS batch delete response handling to raise RetryableError for transient response statuses.
  • Added unit tests covering delete_many retry and GCS batch delete 503 handling.

Closes NGCDP-8087

Checklist

  • Development PR
    • .release_notes/.unreleased.md
      • Notable changes to the client (i.e. not related to tooling, CI/CD, etc.) from this PR have been added.
  • Release PR
    • CI/CD
      • The default branch pipelines are passing in both GitHub + GitLab (latter for SwiftStack E2E tests).
    • multi-storage-client/pyproject.toml
      • The package version has been bumped.
    • .release_notes/.unreleased.md
      • This file's contents have been moved into a .release_notes/{bumped package version}.md file.

Summary by CodeRabbit

  • Bug Fixes

    • Bulk delete operations now retry automatically when temporary failures occur.
    • Google Cloud Storage batch deletes now treat timeout, throttling, and server errors as retryable, improving reliability.
    • Retryable errors are now passed through consistently instead of being converted into generic failures.
  • Tests

    • Added coverage for retry behavior during bulk deletes.
    • Added coverage for retryable Google Cloud Storage batch delete responses.

@shunjiad shunjiad self-assigned this Jun 29, 2026
@shunjiad shunjiad requested a review from a team June 29, 2026 23:09
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

SingleStorageClient.delete_many gains a @retry decorator. GoogleStorageProvider._translate_errors now re-raises RetryableError before the generic handler, and _delete_objects raises RetryableError for HTTP 408, 429, and 5xx batch responses. Tests verify both behaviors.

Changes

Delete Many Retry Support

Layer / File(s) Summary
GCS retryable error handling
src/multistorageclient/providers/gcs.py
_translate_errors re-raises RetryableError unchanged; _delete_objects raises RetryableError for 408, 429, and 5xx batch response codes instead of falling through to RuntimeError.
@retry on delete_many
src/multistorageclient/client/single.py
Adds @retry decorator to SingleStorageClient.delete_many.
Unit tests
tests/test_multistorageclient/unit/test_retry.py
Adds FakeDeleteManyStorageProvider, test_delete_many_retries_retryable_errors, and test_gcs_batch_delete_service_unavailable_response_is_retryable.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hop, hop, retry!
When GCS says "503,"
We bounce right back up,
No blob left behind, you see—
Delete many, fear not,
The rabbit retries with glee! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: retrying transient GCS delete_many failures.
Description check ✅ Passed The description includes the required sections, references the task ID, and covers the checklist items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shunjiad/fix-gcs-delete-many-retry

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
multi-storage-client/src/multistorageclient/providers/gcs.py (1)

527-536: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't let a transient response hide a permanent failure in the same batch.

This loop raises on the first failing response. If a batch comes back as 503 then 403, Line 531 turns the whole operation into a retryable failure and the permanent 403 is never surfaced. That can waste retries and return the wrong final error. Collect all non-2xx/404 responses first, raise RuntimeError if any non-retryable status is present, and only raise RetryableError when every failure in the batch is transient.

Suggested fix
-                    for response in batch._responses:
-                        status_code = response.status_code
-                        if 200 <= status_code < 300 or status_code == 404:
-                            continue
-                        if status_code in {408, 429} or 500 <= status_code < 600:
-                            raise RetryableError(
-                                f"GCS batch delete failed with status_code: {status_code}, response: {response.text}"
-                            )
-                        raise RuntimeError(
-                            f"GCS batch delete failed with status_code: {status_code}, response: {response.text}"
-                        )
+                    retryable_failures: list[str] = []
+                    non_retryable_failures: list[str] = []
+                    for response in batch._responses:
+                        status_code = response.status_code
+                        if 200 <= status_code < 300 or status_code == 404:
+                            continue
+
+                        message = (
+                            f"GCS batch delete failed with status_code: {status_code}, "
+                            f"response: {response.text}"
+                        )
+                        if status_code in {408, 429} or 500 <= status_code < 600:
+                            retryable_failures.append(message)
+                        else:
+                            non_retryable_failures.append(message)
+
+                    if non_retryable_failures:
+                        raise RuntimeError("; ".join(non_retryable_failures))
+                    if retryable_failures:
+                        raise RetryableError("; ".join(retryable_failures))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multi-storage-client/src/multistorageclient/providers/gcs.py` around lines
527 - 536, The batch delete handling in the GCS provider currently exits on the
first failing response inside the batch loop, which can hide a permanent failure
behind a transient one. Update the response processing in the batch delete logic
to collect all non-2xx/non-404 statuses from batch._responses first, then have
the GCS batch delete path raise RuntimeError if any non-retryable status is
present, and only raise RetryableError when every failure is transient. Keep the
change localized to the batch response handling in the GCS provider’s delete
flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@multi-storage-client/src/multistorageclient/providers/gcs.py`:
- Around line 527-536: The batch delete handling in the GCS provider currently
exits on the first failing response inside the batch loop, which can hide a
permanent failure behind a transient one. Update the response processing in the
batch delete logic to collect all non-2xx/non-404 statuses from batch._responses
first, then have the GCS batch delete path raise RuntimeError if any
non-retryable status is present, and only raise RetryableError when every
failure is transient. Keep the change localized to the batch response handling
in the GCS provider’s delete flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d1d9bfd-3c1a-41c0-b2f8-1b48b4c088f7

📥 Commits

Reviewing files that changed from the base of the PR and between ab9df36 and 6b50798.

📒 Files selected for processing (3)
  • multi-storage-client/src/multistorageclient/client/single.py
  • multi-storage-client/src/multistorageclient/providers/gcs.py
  • multi-storage-client/tests/test_multistorageclient/unit/test_retry.py

@NVIDIA NVIDIA deleted a comment from copy-pr-bot Bot Jun 29, 2026
@shunjiad

Copy link
Copy Markdown
Contributor Author

/ok to test 6b50798

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant