Skip to content

feat: Distribute chunks with replication back-off instead of panicking#33

Open
define-null wants to merge 2 commits into
masterfrom
defnull/net-786-add-proptest
Open

feat: Distribute chunks with replication back-off instead of panicking#33
define-null wants to merge 2 commits into
masterfrom
defnull/net-786-add-proptest

Conversation

@define-null
Copy link
Copy Markdown
Contributor

@define-null define-null commented Jun 2, 2026

Closes: https://linear.app/sqd-ai/issue/NET-786/scheduling-missing-replication-backoff

What is this PR about?

The existing scheduling algorithm picks a single replication factor from the available capacity and tries to place every copy at that factor in one pass. When that factor can't actually be realized it panics - even when a lower
factor that is still ≥ min_replication would have placed every chunk.

The factor is estimated from total data size vs. total worker capacity, which assumes the data packs evenly across workers. It doesn't: a worker holds at most one copy of a given chunk and must store it whole, so when chunks are large relative to a worker's capacity, fewer copies fit than the estimate assumes and the target factor becomes unreachable.

This PR makes the scheduler back off instead of failing. It treats the estimated factor as an upper bound and places copies in phases, getting as close to it as the workers can actually hold:

  • version-restricted chunks first, up to the minimum replication (they can only go on the few upgraded workers);
  • unrestricted chunks, up to the minimum replication;
  • all chunks together, topped up toward the estimate as far as space allows.

The minimum replication is still mandatory: if it can't be met, scheduling returns a capacity error rather than panicking.

Notable behaviour changes

  • Scheduling no longer panics on an unplaceable factor; it returns a result and the caller surfaces it as an error.
  • The up-front version-restriction validation (rejecting setups that needed more copies than eligible workers, or too much restricted data for the spare capacity) is removed — those cases now degrade gracefully during placement.
  • The replication reported per weight is the smallest factor actually achieved across that weight's chunks (a pessimistic figure) rather than the estimate.

Structure

The PR is split into two commits:

  1. Add the failing property-based test that reproduces the bug - the scheduler can't distribute chunks even when a feasible factor ≥ min_replication exists.
  2. Fix it with the phased, best-effort placement described above. New property tests assert that scheduling never panics, always meets the minimum replication when it succeeds, and never places two copies of a chunk on the same worker, with regressions for the back-off and restricted-minimum cases.

@define-null define-null requested a review from kalabukdima June 2, 2026 10:56
@define-null define-null changed the title chore: Add proptest with regression that founds a failure to do a chunk distribution feat: Distribute chunks with replication back-off instead of panicking Jun 2, 2026
@kalabukdima
Copy link
Copy Markdown
Contributor

version-restricted chunks first

Unfortunately, this approach loses the consistency properties. Once the version restriction is removed, the data may get reshuffled across workers. Or at least I don't know a proof that it wouldn't. See the discussion in #25

I think a simpler approach would be to drop the chunk replica if it couldn't fit anywhere, and later just check that each chunk has the minimum required replication factor.

We could indeed go layer by layer — first replicas of all chunks first, then second replicas and so on. But now we can't easily switch to that behaviour because it would require an enormous reshuffle

@define-null
Copy link
Copy Markdown
Contributor Author

Reshuffling is actually reduced on this branch when using the simulation:

wget https://metadata.sqd-datasets.io/assignments/tethys/2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz

master:

time cargo run --release -p reshuffle-sim -- -c ./examples/testnet_config.yaml --steps 10 2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz --report report.html
    Finished `release` profile [optimized] target(s) in 0.15s
     Running `target/release/reshuffle-sim -c ./examples/testnet_config.yaml --steps 10 '2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz' --report report.html`
Loading baseline assignment from "2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz"
Loaded 95092 chunks across 8 datasets, 67 workers
 Step |   New chunks |   Total chunks |                    Replication |       New DL |          Shuffled |     Shuffle DL | Repl DL change |  Total DL(*) |     Free cap |  Used% |      W new |     W lost | W shuffled
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1 |         1000 |          96092 |                      1:5, 3:15 |    307.4 GiB |        137 chunks |        5.9 GiB |    0 B+ / 0 B  - |    313.3 GiB |    803.7 GiB |   97.3% |         67 |          2 |         60
    2 |         1000 |          97092 |                      1:5, 3:15 |    313.8 GiB |       1156 chunks |       37.9 GiB |    0 B+ / 0 B  - |    351.7 GiB |    489.9 GiB |   98.4% |         67 |         14 |         66
    3 |         1000 |          98092 |                      1:4, 3:14 |    271.7 GiB |       1055 chunks |       36.6 GiB |    0 B+ / 3.6 TiB- |    308.3 GiB |      3.9 TiB |   86.9% |         67 |         67 |         14
    4 |         1000 |          99092 |                      1:4, 3:14 |    286.0 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    286.0 GiB |      3.6 TiB |   87.9% |         67 |          0 |          0
    5 |         1000 |         100092 |                      1:4, 3:14 |    247.6 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    247.6 GiB |      3.3 TiB |   88.7% |         67 |          0 |          0
    6 |         1000 |         101092 |                      1:4, 3:14 |    246.8 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    246.8 GiB |      3.1 TiB |   89.5% |         67 |          0 |          0
    7 |         1000 |         102092 |                      1:4, 3:14 |    270.5 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    270.5 GiB |      2.8 TiB |   90.4% |         67 |          0 |          0
    8 |         1000 |         103092 |                      1:4, 3:14 |    272.4 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    272.4 GiB |      2.6 TiB |   91.3% |         67 |          0 |          0
    9 |         1000 |         104092 |                      1:4, 3:14 |    253.2 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    253.2 GiB |      2.3 TiB |   92.1% |         67 |          0 |          0
   10 |         1000 |         105092 |                      1:4, 3:13 |    245.4 GiB |          0 chunks |            0 B |    0 B+ / 1.2 TiB- |    245.4 GiB |      3.2 TiB |   89.0% |         67 |         67 |          0
Report written to report.html
cargo run --release -p reshuffle-sim -- -c ./examples/testnet_config.yaml  10  221.61s user 3.39s system 1280% cpu 17.572 total

this PR:

target/release/reshuffle-sim -c ./examples/testnet_config.yaml --steps 10 '2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz' --report report.html`
Loading baseline assignment from "2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz"
Loaded 95092 chunks across 8 datasets, 67 workers
 Step |   New chunks |   Total chunks |                    Replication |       New DL |          Shuffled |     Shuffle DL | Repl DL change |  Total DL(*) |     Free cap |  Used% |      W new |     W lost | W shuffled
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1 |         1000 |          96092 |                      1:5, 3:15 |    307.4 GiB |        168 chunks |        9.6 GiB |    0 B+ / 0 B  - |    317.1 GiB |    803.7 GiB |   97.3% |         67 |         35 |         57
    2 |         1000 |          97092 |                      1:5, 3:15 |    313.8 GiB |        593 chunks |       36.5 GiB |    0 B+ / 0 B  - |    350.4 GiB |    489.9 GiB |   98.4% |         67 |         14 |         65
    3 |         1000 |          98092 |                      1:4, 3:14 |    271.7 GiB |          0 chunks |            0 B |    0 B+ / 3.6 TiB- |    271.7 GiB |      3.9 TiB |   86.9% |         67 |         67 |          0
    4 |         1000 |          99092 |                      1:4, 3:14 |    286.0 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    286.0 GiB |      3.6 TiB |   87.9% |         67 |          0 |          0
    5 |         1000 |         100092 |                      1:4, 3:14 |    247.6 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    247.6 GiB |      3.3 TiB |   88.7% |         67 |          0 |          0
    6 |         1000 |         101092 |                      1:4, 3:14 |    246.8 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    246.8 GiB |      3.1 TiB |   89.5% |         67 |          0 |          0
    7 |         1000 |         102092 |                      1:4, 3:14 |    270.5 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    270.5 GiB |      2.8 TiB |   90.4% |         67 |          0 |          0
    8 |         1000 |         103092 |                      1:4, 3:14 |    272.4 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    272.4 GiB |      2.6 TiB |   91.3% |         67 |          0 |          0
    9 |         1000 |         104092 |                      1:4, 3:14 |    253.2 GiB |          0 chunks |            0 B |    0 B+ / 0 B  - |    253.2 GiB |      2.3 TiB |   92.1% |         67 |          0 |          0
   10 |         1000 |         105092 |                      1:4, 3:13 |    245.4 GiB |          0 chunks |            0 B |    0 B+ / 1.2 TiB- |    245.4 GiB |      3.2 TiB |   89.0% |         67 |         67 |          0
Report written to report.html
cargo run --release -p reshuffle-sim -- -c ./examples/testnet_config.yaml  10  149.91s user 2.10s system 473% cpu 32.093 total

@define-null
Copy link
Copy Markdown
Contributor Author

define-null commented Jun 2, 2026

@kalabukdima I have cherry-picked #34 to this branch. Here is the run:

cargo run -p reshuffle-sim -- -c examples/testnet_config.yaml \
  --chunks-per-step 1000 --steps 20 \
  --initial-new-fraction 1.0 \
  --restricted-fraction 0.5 \
  --lift-restriction-at-step 11 \
  --report report.html 2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz
Loading baseline assignment from "2026-05-25T09:20:03_EA5D1CA701776B3B4ECCECDEAEFECD8069F15C8F1D5C56D50903090698212136.fb.1.gz"
Loaded 95092 chunks across 8 datasets, 67 workers
Worker version distribution: 10.0.0:67
┌──────┬────────────┬───────────────┬───────────┬───────────┬────────────────┬─────────────────┬───────────┬───────────┬───────────────────┬─────┬──────┬──────────┬──────────┬────────┐
│ Step │    New     │       Total   │     Repl. │  New      │     Shuffled   │        Repl chg │  Total    │  Restr.   │          Free cap │        Workers        │ Upgraded │ Sched. │
├      ┼    chunks  ┼       chunks  ┼           ┼  download ┼     chunks     ┼        +gain/   ┼  download ┼  download ┼          (used %) ┼─────┼──────┼──────────┼ workers  ┼        ┤
│      │    (restr) │       (restr) │           │           │     (download) │        -freed   │           │           │                   │ new │ lost │ shuffled │          │        │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    1 │ 1000 (500) │   96092 (500) │ 1:5, 3:15 │ 338.8 GiB │ 185 (10.7 GiB) │     +0 B / -0 B │ 349.5 GiB │ 170.9 GiB │ 772.3 GiB (97.4%) │  67 │   35 │       62 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    2 │ 1000 (500) │  97092 (1000) │ 1:5, 3:15 │ 299.5 GiB │ 661 (40.7 GiB) │     +0 B / -0 B │ 340.2 GiB │ 147.4 GiB │ 472.9 GiB (98.4%) │  67 │   16 │       66 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    3 │ 1000 (500) │  98092 (1500) │ 1:4, 3:14 │ 288.1 GiB │        0 (0 B) │ +0 B / -3.6 TiB │ 288.1 GiB │ 140.7 GiB │   3.8 TiB (87.0%) │  67 │   67 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    4 │ 1000 (500) │  99092 (2000) │ 1:4, 3:14 │ 274.1 GiB │        0 (0 B) │     +0 B / -0 B │ 274.1 GiB │ 129.7 GiB │   3.6 TiB (87.9%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    5 │ 1000 (500) │ 100092 (2500) │ 1:4, 3:14 │ 256.7 GiB │        0 (0 B) │     +0 B / -0 B │ 256.7 GiB │ 123.6 GiB │   3.3 TiB (88.8%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    6 │ 1000 (500) │ 101092 (3000) │ 1:4, 3:14 │ 269.1 GiB │        0 (0 B) │     +0 B / -0 B │ 269.1 GiB │ 135.2 GiB │   3.0 TiB (89.7%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    7 │ 1000 (500) │ 102092 (3500) │ 1:4, 3:14 │ 261.4 GiB │        0 (0 B) │     +0 B / -0 B │ 261.4 GiB │ 140.4 GiB │   2.8 TiB (90.5%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    8 │ 1000 (500) │ 103092 (4000) │ 1:4, 3:14 │ 246.7 GiB │        0 (0 B) │     +0 B / -0 B │ 246.7 GiB │ 129.8 GiB │   2.5 TiB (91.4%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│    9 │ 1000 (500) │ 104092 (4500) │ 1:4, 3:14 │ 267.6 GiB │        0 (0 B) │     +0 B / -0 B │ 267.6 GiB │ 139.5 GiB │   2.3 TiB (92.3%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   10 │ 1000 (500) │ 105092 (5000) │ 1:4, 3:13 │ 250.0 GiB │        0 (0 B) │ +0 B / -1.2 TiB │ 250.0 GiB │ 120.1 GiB │   3.2 TiB (89.2%) │  67 │   67 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   11 │   1000 (0) │    101092 (0) │ 1:4, 3:14 │ 271.6 GiB │        0 (0 B) │ +1.1 TiB / -0 B │   1.4 TiB │       0 B │   3.1 TiB (89.6%) │  67 │   67 │       67 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   12 │   1000 (0) │    102092 (0) │ 1:4, 3:14 │ 269.2 GiB │        0 (0 B) │     +0 B / -0 B │ 269.2 GiB │       0 B │   2.8 TiB (90.5%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   13 │   1000 (0) │    103092 (0) │ 1:4, 3:14 │ 270.1 GiB │        0 (0 B) │     +0 B / -0 B │ 270.1 GiB │       0 B │   2.5 TiB (91.4%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   14 │   1000 (0) │    104092 (0) │ 1:4, 3:14 │ 272.1 GiB │        0 (0 B) │     +0 B / -0 B │ 272.1 GiB │       0 B │   2.3 TiB (92.3%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   15 │   1000 (0) │    105092 (0) │ 1:4, 3:13 │ 268.0 GiB │        0 (0 B) │ +0 B / -1.2 TiB │ 268.0 GiB │       0 B │   3.2 TiB (89.2%) │  67 │   67 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   16 │   1000 (0) │    106092 (0) │ 1:4, 3:13 │ 259.9 GiB │        0 (0 B) │     +0 B / -0 B │ 259.9 GiB │       0 B │   2.9 TiB (90.1%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   17 │   1000 (0) │    107092 (0) │ 1:4, 3:13 │ 250.8 GiB │        0 (0 B) │     +0 B / -0 B │ 250.8 GiB │       0 B │   2.7 TiB (90.9%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   18 │   1000 (0) │    108092 (0) │ 1:4, 3:13 │ 238.7 GiB │        0 (0 B) │     +0 B / -0 B │ 238.7 GiB │       0 B │   2.4 TiB (91.7%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   19 │   1000 (0) │    109092 (0) │ 1:4, 3:13 │ 270.7 GiB │        0 (0 B) │     +0 B / -0 B │ 270.7 GiB │       0 B │   2.2 TiB (92.6%) │  67 │    0 │        0 │       67 │    yes │
├──────┼────────────┼───────────────┼───────────┼───────────┼────────────────┼─────────────────┼───────────┼───────────┼───────────────────┼─────┼──────┼──────────┼──────────┼────────┤
│   20 │   1000 (0) │    110092 (0) │ 1:4, 3:13 │ 245.1 GiB │        0 (0 B) │     +0 B / -0 B │ 245.1 GiB │       0 B │   1.9 TiB (93.4%) │  67 │    0 │        0 │       67 │    yes │
└──────┴────────────┴───────────────┴───────────┴───────────┴────────────────┴─────────────────┴───────────┴───────────┴───────────────────┴─────┴──────┴──────────┴──────────┴────────┘

Chunks added: 15000 (version-restricted: 0, non-restricted: 15000)
Workers upgraded to 10.0.0: 67 of 67
Report written to report.html

@define-null
Copy link
Copy Markdown
Contributor Author

@kalabukdima So if there are concerns regarding the distribution - let's come with the specific use-cases that should be validated.

define-null and others added 2 commits June 3, 2026 13:25
…nk placement

Existing scheduling algorithm doesn't do backoff of the replication factor
which may result to distribute chunks across the replicas, even if the capacity
is sufficient for the replication factor >= min-replication
The scheduler estimates a replication factor — how many copies of each
chunk to keep — from the total data size and the total worker capacity.
That estimate assumes the data spreads evenly across workers, but it
can't always: a worker holds at most one copy of a given chunk and must
store it whole, so the chunks don't pack perfectly and some capacity is
left unused. The estimate is therefore an upper bound that may not be
fully reachable, especially when chunks are large compared to a worker's
capacity. Previously the scheduler placed every copy at the estimated
factor in a single pass and panicked whenever that factor couldn't be
reached.

The scheduler now treats the estimate as an upper bound and places copies
in phases, getting as close to it as the free space allows and backing
off where it can't, instead of failing. The phases are:

- version-restricted chunks first, up to the minimum replication, since
  they can only go on the few upgraded workers;
- unrestricted chunks, up to the minimum replication;
- all chunks together, adding copies toward the estimate as far as space
  allows.

The minimum replication is required: if it can't be met, scheduling
returns a capacity error instead of panicking.

A copy's worker-version requirement is checked as it is placed, so the
old up-front validation — which rejected setups that needed more copies
than there were eligible workers, or that packed too much restricted data
for the spare capacity — has been removed, and those cases are now handled
without failing.

The replication reported per weight is now the smallest factor actually
achieved across that weight's chunks — a pessimistic figure, since one
chunk that fell short pulls it down — rather than the estimate.

New property tests check that scheduling never panics, always meets the
minimum replication when it succeeds, and never places two copies of a
chunk on the same worker, with regressions for the back-off and
restricted-minimum cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@define-null define-null force-pushed the defnull/net-786-add-proptest branch from 287834a to 6ce584c Compare June 3, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants