Skip to content

refactor: SVS shared thread pool with rental model (MOD-9881)#925

Open
meiravgri wants to merge 19 commits intomainfrom
meiravg_adjustable_thpool
Open

refactor: SVS shared thread pool with rental model (MOD-9881)#925
meiravgri wants to merge 19 commits intomainfrom
meiravg_adjustable_thpool

Conversation

@meiravgri
Copy link
Copy Markdown
Collaborator

@meiravgri meiravgri commented Mar 31, 2026

Overview

Replaces per-index SVS thread pools with a single shared thread pool that can be physically resized at runtime, and introduces a rental-based execution model for parallel_for. This eliminates over-provisioning (previously K indexes × N threads = K×N OS threads) and enables the SVS pool to actually grow/shrink when the external pool capacity changes — something the old design could not do.

Also adds a new C API VecSim_UpdateThreadPoolSize() that combines write-mode toggling with pool resizing, and deprecates SVSParams::num_threads.

Key Guarantees and Assumptions

The fundamental scheduling invariant

The shared SVS thread pool has the same size as the RediSearch worker thread pool (search-workers). Every SVS multi-threaded operation (update, GC) is dispatched via SVSMultiThreadJob, which submits N reserve jobs to the RediSearch worker pool — one per worker thread. Only the threads that actually check in (are not busy with other work) participate in the SVS operation. This means:

  • The sum of all concurrent rent() requests can never exceed the pool capacity, because each rented SVS thread corresponds to a RediSearch worker that checked in via a reserve job. If a worker is busy, it doesn't check in, so it's never counted in availableThreads and never requested from the SVS pool.
  • setParallelism(n) is always called with n = min(availableThreads, problemSize), where availableThreads ≤ poolSize(). This is why the assertion n <= pool->size() is safe.
  • Shrink while threads are rented is safe by design — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and its OS thread) alive until ~RentedThreads releases it. Resize never blocks on in-flight operations.

Other invariants

  • Pool is always valid — the shared pool singleton is initialized with size 1 on first access (VecSimSVSThreadPoolImpl::instance()). There is no nullptr state; write-in-place mode uses a pool of size 1 (0 worker threads, calling thread only).
  • parallelism_ >= 1 — the calling thread always participates, so parallelism can never be 0.
  • Rental is safe during resize — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and OS thread) alive until ~RentedThreads.
  • No cross-index state leakage — each index has its own shared_ptr<atomic<size_t>> parallelism_; setParallelism() on one index never affects another.

Relationship with RediSearch

  • RediSearch owns the scheduling — it decides when to dispatch SVS jobs and how many workers are available.
  • VecSim owns the threads — the SVS thread pool is a separate set of OS threads inside VecSim, distinct from RediSearch workers.
  • RediSearch's only responsibility toward the SVS pool is calling VecSim_UpdateThreadPoolSize(N) when the worker count changes, so VecSim can create/destroy the matching number of OS threads.
  • The two pools are always the same size: RediSearch workers = N, SVS threads = N. This 1:1 sizing is what makes the "sum of rents ≤ capacity" guarantee work.

Changes from Old Implementation

Aspect Old (per-index pools) New (shared pool + rental)
Pool ownership Each SVS index creates its own VecSimSVSThreadPoolImpl Singleton VecSimSVSThreadPoolImpl shared across all indexes via shared_ptr
Thread lifecycle Pre-allocated at max capacity; resize() clamped a logical counter Physical grow/shrink: resize() spawns/destroys OS threads
Pool could be null Yes (write-in-place = no pool) Never null; write-in-place = pool with size 1
parallel_for threading Threads used directly from the per-index pool Threads rented from shared pool under mutex, released lock-free via atomic<bool>
Concurrent index operations Each index had dedicated threads (K×N total) Indexes rent disjoint subsets from N shared threads
Per-index parallelism control setNumThreads(n) / resize(n) — mutated the pool setParallelism(n) — sets a per-index atomic<size_t>, pool unchanged
Pool size query capacity() (max pre-allocated) poolSize() (current shared pool size)
Error handling manage_exception_during_run() threw unconditionally manage_workers_after_run() only throws if an error occurred

API Changes

Renamed methods

Old New Scope
setNumThreads(n) setParallelism(n) SVSIndex, SVSIndexBase virtual interface
getNumThreads() getParallelism() SVSIndex, SVSIndexBase virtual interface
getThreadPoolCapacity() getPoolSize() SVSIndex, SVSIndexBase virtual interface

Removed methods

Method Reason
VecSimSVSThreadPool::capacity() No pre-allocation; use poolSize() for shared pool size
VecSimSVSThreadPool::resize() Wrappers don't own threads; use setParallelism() for per-index control, VecSimSVSThreadPool::resize(size_t) (static) for global pool

New C API

// Combines write-mode toggling + physical SVS pool resize.
// new_size == 0 → WriteInPlace mode, pool resized to 1
// new_size >  0 → WriteAsync mode, pool resized to new_size
void VecSim_UpdateThreadPoolSize(size_t new_size);

New internal APIs

API Purpose
VecSimSVSThreadPoolImpl::instance() Static singleton accessor; returns shared_ptr<VecSimSVSThreadPoolImpl> by value
VecSimSVSThreadPool(void* log_ctx) Constructor — uses shared pool singleton internally
VecSimSVSThreadPool::resize(size_t) Static method — resizes the shared pool singleton
VecSimSVSThreadPool::poolSize() Returns shared pool size (for scheduling reserve jobs)
VecSimSVSThreadPoolImpl::rent(count, log_ctx) Rents worker threads with RAII guard (RentedThreads)

Deprecated

API Replacement Notes
VecSim_SetWriteMode() VecSim_UpdateThreadPoolSize() Kept for internal VecSim usage (called inside VecSim_UpdateThreadPoolSize) and unit tests. Not called by RediSearch — RediSearch will use only VecSim_UpdateThreadPoolSize
SVSParams::num_threads Shared pool singleton Deprecated and ignored. Setting it to a non-zero value emits a deprecation warning log. Kept for backward compatibility with existing callers/tests. Pool size is controlled globally via VecSim_UpdateThreadPoolSize()

New types

  • ThreadSlot — wraps svs::threads::Thread + atomic<bool> occupied; non-copyable, non-movable
  • RentedThreads — RAII guard; move-only; releases slots via lock-free atomic stores in destructor

Behavioral Changes

Area Old behavior New behavior
setParallelism(0) resize(0) silently clamped to 1 Asserts (debug) / undefined (release) — reserving 0 threads is a bug
setParallelism(n > poolSize) Grew the per-index pool beyond search-workers Asserts n <= pool->size() — scheduling bug
n > parallelism in parallel_for Threw ThreadingException if n > size_ Asserts n <= parallelism_ on the wrapper — bug in SVS partitioning
parallelism_ initial value Implicitly matched pool size at construction Starts at 1 — safe for immediate write-in-place use
Fewer threads rented than requested N/A (no rental) Logs a warning via logCallback, asserts in debug, uses rented.count() for graceful degradation in release
Exception in parallel_for manage_exception_during_run() always threw manage_workers_after_run() only throws if an error actually occurred
updateSVSIndexWrapper Always locked backend even if no vectors to move Skips lock when labels_to_move is empty
SVSParams::num_threads set to non-zero Used as per-index pool size Ignored; logs a deprecation warning

Public info API fields — unchanged

The debug info field names numThreads and lastReservedThreads (and their string keys NUM_THREADS, LAST_RESERVED_NUM_THREADS) are unchanged. Their semantics shift slightly:

  • numThreads → now reports getPoolSize() (shared pool size, not per-index capacity)
  • lastReservedThreads → now reports getParallelism() (per-index parallelism for last operation)

What Does NOT Change

  • ScalableVectorSearch (deps/ScalableVectorSearch/) — no modifications
  • SVSMultiThreadJob pattern — still submits N reserve jobs to the external worker pool
  • HNSW tiered indexes — unaffected
  • SVS Thread state machine — used as-is; lifecycle managed via shared_ptr

Files Changed

File Changes
src/VecSim/algorithms/svs/svs_utils.h Core refactor: ThreadSlot, RentedThreads, rental-based VecSimSVSThreadPoolImpl, per-index VecSimSVSThreadPool wrapper
src/VecSim/algorithms/svs/svs.h Singleton accessor, API renames, constructor takes shared pool, num_threads deprecation warning
src/VecSim/algorithms/svs/svs_tiered.h API renames, empty-labels guard in updateSVSIndexWrapper
src/VecSim/vec_sim.h / vec_sim.cpp New VecSim_UpdateThreadPoolSize() C API
src/VecSim/vec_sim_common.h SVSParams::num_threads marked as deprecated in comments
tests/unit/test_svs_tiered.cpp Adapted to shared pool: VecSim_UpdateThreadPoolSize in setup, removed num_threads from params, updated assertions
tests/unit/test_svs_fp16.cpp Same test adaptations
tests/unit/test_svs.cpp numThreads info field expects default; new NumThreadsParamIgnored test
tests/benchmark/bm_utils.h / bm_vecsim_svs.h API renames in benchmark helpers

Testing

  • All existing SVS tiered tests updated to use the shared pool model
  • testThreadPool rewritten to cover: default parallelism, setParallelism boundary assertions (ASSERT_DEATH), parallel_for with n < parallelism, write-in-place mode, exception handling
  • TestDebugInfoThreadCount / TestDebugInfoThreadCountWriteInPlace updated: lastReservedThreads now starts at 1 (not num_threads)
  • ThreadsReservation test uses tieredIndexMock(num_threads) to properly size the shared pool
  • New: NumThreadsParamIgnored — verifies that setting SVSParams::num_threads emits a deprecation warning, does not affect the shared pool size, and that omitting it produces no warning
  • Tests that previously set num_threads = 1 in SVSParams now use tieredIndexMock(1) to resize the shared pool instead

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

High Risk
High risk because it rewrites SVS threading/concurrency primitives (shared pool, rental/RAII, resize semantics) and changes write-mode/thread configuration behavior, which can impact correctness and performance under load.

Overview
Switches SVS from per-index thread pools to a shared singleton pool that can grow/shrink at runtime, and updates SVS execution to rent worker threads per parallel_for call with RAII release and safer resize behavior.

Renames the SVS thread controls from get/setNumThreads and getThreadPoolCapacity to get/setParallelism and getPoolSize, updates tiered SVS update/GC scheduling accordingly, and skips backend locking in the update job when there are no vectors to move.

Adds VecSim_UpdateThreadPoolSize() to globally resize the shared SVS pool and toggle write mode, marks SVSParams::num_threads as deprecated/ignored (with a warning when set), and updates benchmarks/unit tests (including a new test_svs_threadpool.cpp and a deprecation test) to match the new semantics and debug-info reporting.

Reviewed by Cursor Bugbot for commit 8210172. Bugbot is set up for automated code reviews on this repo. Configure here.

…l terminology**

Rename `setNumThreads`/`getNumThreads` → `setParallelism`/`getParallelism` and `getThreadPoolCapacity` → `getPoolSize` across VectorSimilarity. Public info API fields (`numThreads`, `lastReservedThreads`, `NUM_THREADS`, `LAST_RESERVED_NUM_THREADS`) remain unchanged. No behavioral changes.
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Mar 31, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 93.57798% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.90%. Comparing base (f177c40) to head (8210172).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/VecSim/algorithms/svs/svs_utils.h 91.86% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   96.98%   96.90%   -0.09%     
==========================================
  Files         129      129              
  Lines        7574     7631      +57     
==========================================
+ Hits         7346     7395      +49     
- Misses        228      236       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thpool tests
fix fp16 tests
@meiravgri meiravgri changed the title SVS Shared Thread Pool Refactor (MOD-9881) refactor: SVS shared thread pool with rental model (MOD-9881) Apr 7, 2026
rfsaliev
rfsaliev previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM, but some suggestions.

// Singleton accessor for the shared SVS thread pool.
// Always valid — initialized with size 1 (write-in-place mode: 0 worker threads,
// only the calling thread participates). Resized on VecSim_UpdateThreadPoolSize() calls.
static std::shared_ptr<VecSimSVSThreadPoolImpl> getSharedThreadPool() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the VecSimSVSThreadPoolImpl is expected to be always used as a singleton.
If this assumption is true, then I would suggest to use more generic idiom:

class VecSimSVSThreadPoolImpl {
...
private:
    explicit VecSimSVSThreadPoolImpl(size_t num_threads = 1) {...}
public:
    static std::shared_ptr<VecSimSVSThreadPoolImpl> instance() {...}
...
};
  1. Such idiom prevents non-singleton usages.
  2. Moves TP lifetime managing responsibilities out of SVSIndex class

Moreover, I would move the full ownership to VecSimSVSThreadPool with possible hiding VecSimSVSThreadPoolImpl to the details namespace:

class VecSimSVSThreadPool {
...
public:
    explicit VecSimSVSThreadPool(void *log_ctx = nullptr)
        : pool_(details::VecSimSVSThreadPoolImpl::instance()),
          parallelism_(std::make_shared<std::atomic<size_t>>(1)),
          log_ctx_(log_ctx) {
        assert(pool_ && "Pool must not be null");
    }

    static auto resize(size_t new_size) {
        return details::VecSimSVSThreadPoolImpl::instance()->resize(new_size);
    }
...
};

Comment on lines +529 to +530
void manage_workers_after_run(const std::string &main_thread_error, RentedThreads &rented,
size_t rented_count) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 3 arguments if rented.count() == rented_count here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good reason :)

rfsaliev
rfsaliev previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ca4b396. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants