Skip to content

Schedule periodic Spotify duration-mismatch rematch sweep #171

@dprodger

Description

@dprodger

Set up a recurring schedule to run scripts/rematch_spotify_duration_mismatches.py on production so duration-mismatched Spotify links get re-evaluated as new releases land and the matcher's scoring evolves.

Background

scripts/rematch_spotify_duration_mismatches.py (added in 0317b98) walks songs whose Spotify links have a > 60s duration mismatch against the canonical recording duration, and enqueues per-song ('spotify', 'rematch_duration_mismatches') jobs onto the durable research queue. The worker drains them via SpotifyMatcher.

It's a one-shot today — kicked off via Render shell. Without a recurring trigger, new mismatches accumulate silently between manual runs.

Options

1. Render Cron Job (simpler — recommended starting point)

Add via Render dashboard or render.yaml:

services:
  - type: cron
    name: spotify-rematch-monthly
    runtime: python
    schedule: "0 4 1 * *"   # 04:00 UTC on the 1st of each month
    buildCommand: pip install -r backend/requirements.txt
    startCommand: cd backend && python scripts/rematch_spotify_duration_mismatches.py
    envVars:
      - fromGroup: <existing prod env-group>
  • Pro: native to Render, runs with the same DB creds as the worker, no extra auth.
  • Con: failures only visible in Render logs unless a notification is wired up.

2. Claude scheduled agent (more capable, more setup)

Agent runs on a cron, would need to call an admin HTTP endpoint that triggers the sweep, then post a summary back somewhere visible. Useful for the "summarise what got cleaned up" part but requires adding the admin endpoint first.

Decisions to make

  • Cadence: monthly is the default in option 1 above, but weekly might be appropriate if mismatch volume is high.
  • Threshold: production runs should default to --threshold-seconds 60 until volume is low enough to tighten to 30s.
  • Notification: do nothing (just rely on /admin/research/?source=spotify&job_type=rematch_duration_mismatches to spot-check), wire a Slack/email notification on cron failure, or build the summary-posting agent (option 2).

Out of scope

Auto-unlinking stubborn mismatches the matcher can't fix. That's a separate trust call — the existing /admin/duration-mismatches review page covers human-driven cleanup today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    data-cleanupprojected related to the underlying metadata, scrapers, ingesters, etc.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions