Skip to content

Worker RSS plateaus at ~80% during concurrent match_song on heavy songs #193

@dprodger

Description

@dprodger

Summary

The research worker (ApproachNote Worker on Render, 2GB instance) plateaus at ~80% memory (~1.6GB RSS) while processing popular jazz standards, with one prior OOM-kill observed. The DuckDB cap shipped in eef620b brought the peak from ~95% to ~80% — but the plateau is uncomfortably close to the limit on a single heavy song, and one concurrent allocation away from OOM.

This is not a leak. It's the steady-state working set of three handlers chewing on the same song at once.

Evidence

Observation 1 — Ain't Misbehavin' (yesterday, pre-DuckDB cap)

  • ~10-hour climb to 100% memory, then OOM at ~03:32 UTC.
  • 1438 releases linked to the song.
  • All three workers (spotify.match_song, apple.match_song, youtube.match_recording) running concurrently for the song.

Observation 2 — Summertime (today, post-DuckDB cap)

Worker logs around 2026-05-14T19:24:09:

INFO research_worker.loop.spotify.match_song.job4391: claimed target=song/872d7739-…  (Summertime)
INFO research_worker.loop.spotify: Found 3642 releases to process
INFO research_worker.loop.apple.match_song.job4392:   claimed target=song/872d7739-…  (Summertime)
INFO research_worker.loop.apple:   Found 3642 releases to process
INFO research_worker.loop.youtube.match_recording.job4393+: claimed target=recording/…

Memory chart for the same instance: baseline ~5% → climb starting ~19:23 → plateau ~80% by ~19:28, sustained while both match_song jobs iterate.

Where the memory goes

The worker (research_worker/run.py) spawns one thread per registered (source, job_type), so spotify.match_song, apple.match_song, and youtube.match_recording run in parallel inside the same process (run.py:83-91). On a heavy song, three working sets stack:

Tenant Source of working set Approx. size
Apple match_song DuckDB buffer pool over Apple Music catalog parquets capped at 512MB (post-eef620b)
Spotify match_song cur.fetchall() of N-row release×recording join with JSON-aggregated performers, plus parsed per-release Spotify API responses held live for the duration of the loop scales with releases — Summertime had 3642 rows
YouTube match_recording per-recording client + matcher state across many concurrent jobs tens of MB
Python heap Arenas don't return to the OS after GC sticky overhead

The Spotify get_releases_for_song query is the most stretchy of these — it materialises every row of a releases × recording_releases × recordings × release_streaming_links join with a JSON-aggregated performers subquery per row, then keeps the entire list reachable during the per-release loop in SpotifyMatcher.match_releases.

What's already shipped

  • eef620b — Cap DuckDB resource usage on the Apple Music catalog connection. Adds PRAGMA memory_limit='512MB' and PRAGMA threads=2 after every duckdb.connect(), with APPLE_DUCKDB_MEMORY_LIMIT / APPLE_DUCKDB_THREADS env overrides for ops tuning without a redeploy. Reduced observed peak from ~95% to ~80%.

Proposed fixes (cheapest first)

  1. Tighten the DuckDB cap further via env. Set APPLE_DUCKDB_MEMORY_LIMIT=256MB on the Render worker. Frees ~250MB headroom; Apple queries may spill to temp disk on heavy songs but the cap is already implemented. Zero code change, instantly reversible.

  2. Stream Spotify's release iteration. Replace cur.fetchall() in integrations/spotify/db.py::get_releases_for_song with a server-side cursor that yields batches of ~200 rows, and process per-batch inside SpotifyMatcher.match_releases. Caps Spotify's working set regardless of song popularity. ~30-50 LOC.

  3. Upgrade worker instance to 4GB on Render. No code change. Buys runway but doesn't address the underlying scaling problem — the next high-coverage standard with double the releases will reproduce.

A complementary improvement worth tracking separately: a wall-clock watchdog on match_song handlers (abort + reschedule after, say, 30 min) so a stuck job can't pin DuckDB indefinitely the way the Ain't Misbehavin' run did.

Repro

  1. Queue a deep refresh on a song with > ~2000 releases (Summertime, Ain't Misbehavin', Body and Soul, Stardust, etc).
  2. Watch /admin/research/ (filter source=spotify or apple, job_type=match_song) — both jobs claim within seconds of each other.
  3. Watch Render's memory chart for the worker; expect rapid climb to 70–85% during the in-flight window.

Acceptance criteria

  • Worker RSS stays below 70% (~1.4GB) during concurrent match_song runs on the heaviest songs in the catalog.
  • Two heavy songs queued back-to-back do not OOM the worker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceoptimization or performance problem

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions