Skip to content

MusicBrainz outage circuit breaker — cap retry storm during multi-hour 503 windows #190

@dprodger

Description

@dprodger

Background

Surfaced during investigation of #180. During a multi-hour MusicBrainz outage (503 Server Maintenance), every Apple Music match call hits the MB API for the release tracklist, retries 3× with exponential backoff (1s + 4s + 8s = ~13s of waits), and finally gives up — falling back to the local recording_releases presence-check path.

For an interactive admin diagnose this adds ~12s of dead time per click. For the background apple/match_song worker it's amortized across the job, but the cumulative wait across an MB outage day is real.

Concrete server log from a single diagnose during the outage:

2026-05-11 18:07:50  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:50  MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:50  BACKOFF: tracklist fetch retry 2/3, waiting 4s before retry
2026-05-11 18:07:54  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:07:54  MusicBrainz service unavailable (503), will retry...
2026-05-11 18:07:54  BACKOFF: tracklist fetch retry 3/3, waiting 8s before retry
2026-05-11 18:08:02  Fetching MusicBrainz release tracklist: 398f5c4f-…
2026-05-11 18:08:02  All retry attempts failed (503)

13 seconds per diagnose call, and the next diagnose pays the same toll fresh.

Proposed fix

Add a process-local circuit breaker in integrations/musicbrainz/client.py (or a thin wrapper). When get_release_tracklist (or any MB request) exhausts its retries on 503/connection-error, stash a (failed_at, retry_after) marker on the class. While the marker is fresh (say 5 minutes), subsequent calls return None immediately instead of running through the retry ladder.

Sketch:

class MusicBrainzSearcher:
    _outage_marker = None  # class-level, shared across instances
    _OUTAGE_TTL_SECONDS = 300

    def _is_in_outage_window(self) -> bool:
        if not self._outage_marker:
            return False
        return time.time() - self._outage_marker < self._OUTAGE_TTL_SECONDS

    def get_release_tracklist(self, release_id, max_retries=3):
        if self._is_in_outage_window():
            logger.debug("MB circuit breaker open; short-circuiting tracklist fetch")
            return None
        # ... existing retry loop ...
        # on full failure:
        MusicBrainzSearcher._outage_marker = time.time()
        return None

Cost: one retry storm per worker process per ~5-minute window during an outage, instead of one per call.

Scope notes

  • Process-local, not cross-worker. A class-level dict is fine here — every worker process pays a fresh retry storm on its first call after restart, but that's acceptable. A shared/distributed circuit breaker (Redis, DB) is overengineering for this.
  • Short TTL. 5 minutes balances "don't keep retrying during a multi-hour outage" against "don't stay broken longer than needed when MB comes back." Could be tuned.
  • Respect Retry-After. When MB returns a 429 with a Retry-After header, prefer that over the fixed TTL.
  • Other MB endpoints. get_release_tracklist is the obvious caller, but get_release_details, search_releases, search_artists, etc. all share the same risk. Probably worth installing the breaker at the request layer (e.g. wrapping self.session.get) rather than per-endpoint.

Why this is "nice to have," not urgent

Correctness is fine — the gate falls back to the local presence check during MB outages (#180 work). This is purely a latency / log-noise optimization for outage windows.

Acceptance

  • Successive Apple Music diagnose calls during an MB outage do not each pay the full 13s retry ladder.
  • When MB comes back, the breaker reopens within the TTL window so we don't stay degraded longer than needed.
  • A debug log line indicates when the breaker is short-circuiting a call, so operators can tell it from a genuine MB outage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions