Skip to content

Release the GIL during Stretch::process() (~7x speedup on 8 threads)#4

Open
naveensr89 wants to merge 2 commits into
gregogiudici:mainfrom
naveensr89:gil-release
Open

Release the GIL during Stretch::process() (~7x speedup on 8 threads)#4
naveensr89 wants to merge 2 commits into
gregogiudici:mainfrom
naveensr89:gil-release

Conversation

@naveensr89
Copy link
Copy Markdown

@naveensr89 naveensr89 commented May 22, 2026

What this changes

Wraps the C++ stretch work inside Stretch::process() in nb::gil_scoped_release so the GIL is released for the duration. The Python-object boundary (input nb::ndarray read, output nb::ndarray construction) stays under the GIL, so the change is API-compatible and crash-free.

Why

Stretch::process() is pure C++ on raw float* buffers — no Python objects are touched between the input read and the return-value allocation. Holding the GIL across that work prevents ThreadPoolExecutor-based pipelines from parallelizing it.

Measurements

Microbench, 4 s stereo @ 44.1 kHz, 8 threads each running an independent Stretch() instance (+3 semitones, 1.25× tempo):

Build Serial 8× Parallel 8× Speedup
python-stretch==0.3.1 (current main) 424 ms 419 ms 0.98×
this PR 399 ms 55 ms 7.18×

Determinism is unchanged: same Stretch config produces bit-identical output (np.array_equal(out_a, out_b) == True on repeated calls).

The benchmark is reproducible via examples/benchmark_multithread.py (no audio files required — uses np.random input).

What's safe and what isn't

Safe (covered by this patch):

  • stretch_.seek / stretch_.process / stretch_.flush / stretch_.reset
  • The Buffer<float> wrappers and the std::copy of channel data into the output float*

Kept under the GIL (outside the release scope):

  • audio_input.data() / audio_input.shape()nb::ndarray accessors
  • new float[…] for the output buffer (just malloc, GIL-free in principle, but kept above the release scope for clarity)
  • The final return nb::ndarray<…>(outData, …, owner) which constructs a Python object

If users share a single Stretch instance across threads they're still on their own — internal stretcher state is not protected. The intended pattern is one Stretch per thread (or per call), which works correctly with this patch.

Test plan

  • Microbench: 8-thread parallel speedup goes from 0.98× to 7.18×
  • Output determinism: bit-identical across repeated calls
  • No crashes when called from concurrent.futures.ThreadPoolExecutor(max_workers=40) for >1000 stretches
  • tests/test_multithread.py: single-thread determinism, parallel consistency, cross-run stability (all pass against the PR build)
  • Full test suite (pytest tests/): 12/12 pass

The C++ stretch computation runs on raw float buffers and does not touch
any Python objects, so the GIL can be released for the duration. Without
this change, concurrent calls from a ThreadPoolExecutor serialize on the
GIL (microbench: 0.98x on 8 threads). With it, 8 threads scale 7.18x.

The release is scoped to just the stretch_/Buffer work — the nb::ndarray
input read and the nb::ndarray return-value construction stay under the
GIL, since both touch Python-managed memory.
@gregogiudici
Copy link
Copy Markdown
Owner

Hey @naveensr89, thanks for the PR!
It's really nice to see someone else wanting to work on this library. The performance improvement looks great!

Before merging, could you provide a small test set that explicitly lock down the expected behavior?
Something like a test_multithread.py with:

  1. Single-thread determinism: same input -> bit identical outputs across repeated calls.
  2. Parallel consistency: multiple indipendent Stretch instances run in a thread pool -> match with a single-thread reference output.
  3. Cross-run stability: repeating the same parallel batch gives identical results across different runs.

Also, could you share a bit more about how you ran the microbench and analyzed the results?
It would be great if I could verify the same results in tests on my own local setup.

I haven’t worked on the library for a while as I’ve been busy with other projects, but I’d be delighted if we could improve it even further

tests/test_multithread.py covers the three behaviors requested in PR review:
  1. single-thread determinism: same input → bit-identical output on repeated calls
  2. parallel consistency: N independent Stretch instances in a ThreadPoolExecutor
     match serial reference outputs (bit-identical)
  3. cross-run stability: same parallel batch repeated twice gives identical results

examples/benchmark_multithread.py is a self-contained reproducible benchmark
(no audio files required) that measures serial vs parallel throughput and prints
a speedup table. Confirmed results on this machine (8 vCPU):
  serial 8×: ~32 ms, parallel 8 threads: ~10 ms → 3.1× speedup (patched build)
  vs ~1× on unpatched 0.3.1 — GIL release confirmed working.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@naveensr89
Copy link
Copy Markdown
Author

Hi @gregogiudici, thanks for the review! Both additions are now in the latest commit.

Tests — tests/test_multithread.py covers all three cases you asked for:

  1. Single-thread determinism — same input → np.array_equal on repeated calls
  2. Parallel consistency — 8 independent Stretch instances in a ThreadPoolExecutor → each matches its serial reference output (bit-identical)
  3. Cross-run stability — same parallel batch run twice → identical results

All 12 tests (existing + new) pass against the PR build.

Benchmark — examples/benchmark_multithread.py is self-contained (no audio files needed, uses np.random input) and prints a serial vs parallel table you can run directly:

python examples/benchmark_multithread.py

Results vary by machine. On an 8-vCPU host I get ~3.1× with this patch vs ~1.0× on 0.3.1. The 7.18× from the PR description was measured on a 48-vCPU machine where threads have more
independent cores to spread across — both confirm the GIL is genuinely released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants