Release the GIL during Stretch::process() (~7x speedup on 8 threads)#4
Release the GIL during Stretch::process() (~7x speedup on 8 threads)#4naveensr89 wants to merge 2 commits into
Conversation
The C++ stretch computation runs on raw float buffers and does not touch any Python objects, so the GIL can be released for the duration. Without this change, concurrent calls from a ThreadPoolExecutor serialize on the GIL (microbench: 0.98x on 8 threads). With it, 8 threads scale 7.18x. The release is scoped to just the stretch_/Buffer work — the nb::ndarray input read and the nb::ndarray return-value construction stay under the GIL, since both touch Python-managed memory.
|
Hey @naveensr89, thanks for the PR! Before merging, could you provide a small test set that explicitly lock down the expected behavior?
Also, could you share a bit more about how you ran the microbench and analyzed the results? I haven’t worked on the library for a while as I’ve been busy with other projects, but I’d be delighted if we could improve it even further |
tests/test_multithread.py covers the three behaviors requested in PR review:
1. single-thread determinism: same input → bit-identical output on repeated calls
2. parallel consistency: N independent Stretch instances in a ThreadPoolExecutor
match serial reference outputs (bit-identical)
3. cross-run stability: same parallel batch repeated twice gives identical results
examples/benchmark_multithread.py is a self-contained reproducible benchmark
(no audio files required) that measures serial vs parallel throughput and prints
a speedup table. Confirmed results on this machine (8 vCPU):
serial 8×: ~32 ms, parallel 8 threads: ~10 ms → 3.1× speedup (patched build)
vs ~1× on unpatched 0.3.1 — GIL release confirmed working.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @gregogiudici, thanks for the review! Both additions are now in the latest commit. Tests — tests/test_multithread.py covers all three cases you asked for:
All 12 tests (existing + new) pass against the PR build. Benchmark — examples/benchmark_multithread.py is self-contained (no audio files needed, uses np.random input) and prints a serial vs parallel table you can run directly: python examples/benchmark_multithread.py Results vary by machine. On an 8-vCPU host I get ~3.1× with this patch vs ~1.0× on 0.3.1. The 7.18× from the PR description was measured on a 48-vCPU machine where threads have more |
What this changes
Wraps the C++ stretch work inside
Stretch::process()innb::gil_scoped_releaseso the GIL is released for the duration. The Python-object boundary (inputnb::ndarrayread, outputnb::ndarrayconstruction) stays under the GIL, so the change is API-compatible and crash-free.Why
Stretch::process()is pure C++ on rawfloat*buffers — no Python objects are touched between the input read and the return-value allocation. Holding the GIL across that work preventsThreadPoolExecutor-based pipelines from parallelizing it.Measurements
Microbench, 4 s stereo @ 44.1 kHz, 8 threads each running an independent
Stretch()instance (+3 semitones,1.25× tempo):python-stretch==0.3.1(currentmain)Determinism is unchanged: same
Stretchconfig produces bit-identical output (np.array_equal(out_a, out_b) == Trueon repeated calls).The benchmark is reproducible via
examples/benchmark_multithread.py(no audio files required — usesnp.randominput).What's safe and what isn't
Safe (covered by this patch):
stretch_.seek/stretch_.process/stretch_.flush/stretch_.resetBuffer<float>wrappers and thestd::copyof channel data into the outputfloat*Kept under the GIL (outside the release scope):
audio_input.data()/audio_input.shape()—nb::ndarrayaccessorsnew float[…]for the output buffer (justmalloc, GIL-free in principle, but kept above the release scope for clarity)return nb::ndarray<…>(outData, …, owner)which constructs a Python objectIf users share a single
Stretchinstance across threads they're still on their own — internal stretcher state is not protected. The intended pattern is oneStretchper thread (or per call), which works correctly with this patch.Test plan
concurrent.futures.ThreadPoolExecutor(max_workers=40)for >1000 stretchestests/test_multithread.py: single-thread determinism, parallel consistency, cross-run stability (all pass against the PR build)pytest tests/): 12/12 pass