feat: Distribute chunks with replication back-off instead of panicking#33
feat: Distribute chunks with replication back-off instead of panicking#33define-null wants to merge 2 commits into
Conversation
Unfortunately, this approach loses the consistency properties. Once the version restriction is removed, the data may get reshuffled across workers. Or at least I don't know a proof that it wouldn't. See the discussion in #25 I think a simpler approach would be to drop the chunk replica if it couldn't fit anywhere, and later just check that each chunk has the minimum required replication factor. We could indeed go layer by layer — first replicas of all chunks first, then second replicas and so on. But now we can't easily switch to that behaviour because it would require an enormous reshuffle |
|
Reshuffling is actually reduced on this branch when using the simulation: master: this PR: |
|
@kalabukdima I have cherry-picked #34 to this branch. Here is the run: |
|
@kalabukdima So if there are concerns regarding the distribution - let's come with the specific use-cases that should be validated. |
…nk placement Existing scheduling algorithm doesn't do backoff of the replication factor which may result to distribute chunks across the replicas, even if the capacity is sufficient for the replication factor >= min-replication
The scheduler estimates a replication factor — how many copies of each chunk to keep — from the total data size and the total worker capacity. That estimate assumes the data spreads evenly across workers, but it can't always: a worker holds at most one copy of a given chunk and must store it whole, so the chunks don't pack perfectly and some capacity is left unused. The estimate is therefore an upper bound that may not be fully reachable, especially when chunks are large compared to a worker's capacity. Previously the scheduler placed every copy at the estimated factor in a single pass and panicked whenever that factor couldn't be reached. The scheduler now treats the estimate as an upper bound and places copies in phases, getting as close to it as the free space allows and backing off where it can't, instead of failing. The phases are: - version-restricted chunks first, up to the minimum replication, since they can only go on the few upgraded workers; - unrestricted chunks, up to the minimum replication; - all chunks together, adding copies toward the estimate as far as space allows. The minimum replication is required: if it can't be met, scheduling returns a capacity error instead of panicking. A copy's worker-version requirement is checked as it is placed, so the old up-front validation — which rejected setups that needed more copies than there were eligible workers, or that packed too much restricted data for the spare capacity — has been removed, and those cases are now handled without failing. The replication reported per weight is now the smallest factor actually achieved across that weight's chunks — a pessimistic figure, since one chunk that fell short pulls it down — rather than the estimate. New property tests check that scheduling never panics, always meets the minimum replication when it succeeds, and never places two copies of a chunk on the same worker, with regressions for the back-off and restricted-minimum cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
287834a to
6ce584c
Compare
Closes: https://linear.app/sqd-ai/issue/NET-786/scheduling-missing-replication-backoff
What is this PR about?
The existing scheduling algorithm picks a single replication factor from the available capacity and tries to place every copy at that factor in one pass. When that factor can't actually be realized it panics - even when a lower
factor that is still ≥
min_replicationwould have placed every chunk.The factor is estimated from total data size vs. total worker capacity, which assumes the data packs evenly across workers. It doesn't: a worker holds at most one copy of a given chunk and must store it whole, so when chunks are large relative to a worker's capacity, fewer copies fit than the estimate assumes and the target factor becomes unreachable.
This PR makes the scheduler back off instead of failing. It treats the estimated factor as an upper bound and places copies in phases, getting as close to it as the workers can actually hold:
The minimum replication is still mandatory: if it can't be met, scheduling returns a capacity error rather than panicking.
Notable behaviour changes
Structure
The PR is split into two commits:
min_replicationexists.