Add Normalized DN Cache with sharded S3-FIFO design by droideck · Pull Request #27 · 389ds/389ds.github.io

droideck · 2026-05-12T03:09:19Z

No description provided.

Firstyear

Very interesting proposal, I'm curious about the locking design however. :)

Firstyear · 2026-05-12T06:17:36Z

+no notion of stale entries.
+
+The cache is split into 64 shards. Each shard runs the eviction algorithm
+independently over its own hash table and its own lock, so reads on


Are shards "per thread" or based on sharding on the hash of the input/key value?

Hashed by key. Shard index is hash & (num_shards - 1), so threads compete on locks per-key, not per-thread. It's a single shared cache, not per-thread tables.

BTW, specifically the low bits, not the high bits, because hashbrown's H2 tag (the byte SIMD compares against during probes) is the top 7 bits of the hash. Sharding on high bits would collapse the H2 tags within a shard to a tiny set and explode SIMD probe collisions inside hashbrown. Low bits keep the shard index independent of the tag.

Background on the hashbrown control byte layout: https://deepwiki.com/rust-lang/hashbrown/2-core-architecture

So the issue then is if you have a very active key, then all your threads wil serialise around that single lock as they compete for it. We see this commonly where a customer has a sssd service account that authenticates "thousands of times per second" so this scheme may impact that kind of scenario.

Not clear to me. we have a kind of 64 slot (shard) hashtable containing keys. Correct ?
If we want to remove a key from a shard, how many locks are required ? one for the key ? one for the shard ?

@tbordaz In the design of this cache there is no such action as "remove", it just relies on the key aging out.

Right. The eviction algorithm being per shard, does it mean that a evicted key only require the lock of the key ?

Only the lock of the shard that the key is in.

So the issue then is if you have a very active key, then all your threads wil serialise around that single lock as they compete for it. We see this commonly where a customer has a sssd service account that authenticates "thousands of times per second" so this scheme may impact that kind of scenario.

Yep, agreed.
That is the weak case for this design. If one DN/key is hot enough, sharding does not help much because that key keeps landing on the same shard lock. I added the hot-DN server tests to check that. The result was close though! Across the tested cache sizes and 16/32/64 threads, S3-FIFO stayed within about 3% of concread either way. I would not present that as a win, but as a measured tradeoff, which I called out in the doc.

That's pretty good at least. The hot-DN case was one that worried me especially because of SSSD and how poorly it behaves as a client.

Firstyear · 2026-05-12T06:22:37Z

+take a transactional snapshot and writers copy-on-write a new version on
+commit. That is the design for caches whose values can change during a
+read, and it remains a candidate for the 389 DS entry cache. The NDN
+cache does not need those guarantees, and on the memberOf suite below


Concread specifically reduces some single thread performance for higher multi-thread performance - is the memberof test multi-threaded or single?

The integration perf suite quoted in the doc is single-threaded, yeah, my mistake...

I saw the Rust microbench numbers (which were for different number of threads and with isolation from any DS bottlenecks) and thought that would be enough to make the multi-thread case:

threads s3fifo concread 1 179.81 µs 1.9131 ms 16 6.0811 ms 98.733 ms 64 24.982 ms 377.34 ms

But you're right that it doesn't replace a real workload. I'll re-run the sustained multi-threaded test in the integration suite we have (16 workers, 600s of MOD_ADD/MOD_DEL on member) and share those results later.

Something to consider in that test is the quiesce behaviour too, but I haven't seen the full test to comment on why arc seems so slow there.

Firstyear · 2026-05-12T06:25:09Z

+Alternatives Considered
+-----------------------
+
+Concread's `ARCache` is a transactional, snapshot-consistent ARC: readers


Interestingly the transactional nature is separate from the algorithm used - if S3-FIFO is very 'good' then there is no reason we couldn't use it in concread instead of ARC.

Also within the way this works in 389-ds, we actually only use readers and we async-submit changes for the main cache to include.

This isn't really a concread-vs-S3-FIFO question. It's whether the NDN cache shape we have today needs the transactional power concread provides.

Concread's ARCache gives readers a transactionally consistent snapshot, a haunted set so evicted-then-revived keys can't clobber newer writes, and an mpsc channel from readers to the writer so reader hits influence eviction (as you describe in CACHE.md). IIUC, these properties are correct for caches whose values can change under readers and whose readers need a stable view across a transaction. The entry cache and the index cache fit that description, yep.

IMO, the NDN cache doesn't. The value is a pure function of the key. Two readers asking for the same DN at different transaction ids get the same answer or a cache miss, never a stale-but-different answer. There's no rollback to support and no temporal ordering between readers and writers to protect. The haunted set and the reader-to-writer channel pay for guarantees the NDN cache doesn't need.

So for this proposal the relevant comparison is concread's transactional shell vs no shell, for a cache whose values are immutable functions of the key. For this specific cache, no shell wins on the hot path, I think. Which is why the new design is sharded S3-FIFO directly rather than ARC-inside-concread.

The reader-doesnt-block-writer property you get from async-submit carries over to the design through a different mechanism: shared read lock with a relaxed atomic counter on hit, write lock only on miss-insert. No reader-to-writer mpsc, but readers stay off the write path on the hot case. Plus the cache is hash sharded across per-shard locks, so reads on different shards never contend at the lock level.

This isn't really a concread-vs-S3-FIFO question. It's whether the NDN cache shape we have today needs the transactional power concread provides.

Part of the goal historically of using concread in 389 was to show the cache was effective transactionally, and could then be used in other places where it matters - especially with LMDB for example since it shares an identical transaction model.

However, no one else on the team really ever showed interest in that, and the changes to make 389-ds strongly transactional don't seem to have eventuated even with the changes to support LMDB.

So with that in mind, in this case yes, maybe the s3-fifo is a better option. That's fine. The number of cases for the NDN to rely on transactions would be low.

One area the concread cache has a possibility to outshine the s3-fifo is a majority read-only workset. This is because we end up with most cache lines in the shared state (where s3-fifo will be invalidating them frequently due to atomics). But this is very subjective.

Something to try in your benchmarking is different cache sizes - extremely small, moderate, and large ndn cache sizes. This also affects the performance.

Good point. I added a cache-size + multithreaded sweep.

For the multithreaded memberOf cascade run, changing the configured cache size did not change the direction of the result. This run is mostly lookup-path and contention, not eviction pressure. S3-FIFO stayed around +12% ops/s vs concread, with lower p95 in the same runs.

So yeah, on the read-mostly point, I agree this is where concread has the strongest argument. The useful thing from the new runs is that we see that the hot-DN shape stays close to concread, while the multi-key memberOf shape stays ahead across the cache sizes. Hits take the shared side of a per-shard parking_lot::RwLock, not an exclusive mutex. It is still not a zero-write read path though: the hit/miss stats are per-shard AtomicU64 counters with relaxed updates (we don't need great precision there), and the per-entry AtomicU8 freq counter only CASes until it saturates (which is great for the hot DNs). I think that is a reasonable trade for NDN
WDYT?

This run is mostly lookup-path and contention, not eviction pressure. S3-FIFO stayed around +12% ops/s vs concread, with lower p95 in the same runs.

There are some other things I want to suggest you can try, just for comparison.

set_reader_quiesce to false and spawn a dedicated quiesce thread.

set_look_back_limit to 8 (decreased from 32)

My guess is these two are the biggest factors youre seeing that causes the p95 changes, as well as some of the operational performance difference, especially the with quiesce.

So yeah, on the read-mostly point, I agree this is where concread has the strongest argument. The useful thing from the new runs is that we see that the hot-DN shape stays close to concread, while the multi-key memberOf shape stays ahead across the cache sizes.

A good question is why is memberOf hitting the NDN cache so hard in the firstplace? I assume we're still using the "old" memberOf algorithm (compared to the one I proposed around 2017 that wasn't adopted - note; yes, it's still the slow algo).

Hits take the shared side of a per-shard parking_lot::RwLock, not an exclusive mutex.

It will make the write side of the equation much slower though as you need all readers to drains before a write can proceed. So be mindful of that consequence.

It is still not a zero-write read path though: the hit/miss stats are per-shard AtomicU64 counters with relaxed updates (we don't need great precision there), and the per-entry AtomicU8 freq counter only CASes until it saturates (which is great for the hot DNs). I think that is a reasonable trade for NDN WDYT?

Yeah, that's fine. Realistically if you plan to only do "sampling" based stats and are willing to forego some precision for performance, you can do a trick. You actually only need to record some results, not all to get a hit ratio. For 100,000 events, if you sample 1/10th of them (10,000) as a hit or miss, then you actually end up with a result that is statistically likely to be within 1% of the true value. This means you can probably forgo a lot of atomic calls by sampling instead of recording everything.

(check https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Sample+Size+Calculator for your own homework)

Firstyear · 2026-05-12T06:33:15Z

Oh something else to consider is the S3 shards are fixed size where as in ARC they're variable size. S3 uses a ghost set to do promotions, but it seems to be like a fixed set recent ghost list. If anything S3 really is kind of ARC but without the adaptive weights, where you have a small shard/recent list, and then a longer frequent list.

So there is a risk that if you have a lot of high churn on the recent list, your frequent list will never populate since it will be continually pushing out the ghost/shard lists and never able to actually cause the needed 2 hits (either in S or G) that will trigger a move to M. Where as ARC will adapt in this case and extend the S set if there are too many hits into G increasing the likelihood of a promotion. Very subtle edge case but it exists nonetheless.

droideck · 2026-05-12T20:50:05Z

Oh something else to consider is the S3 shards are fixed size where as in ARC they're variable size. S3 uses a ghost set to do promotions, but it seems to be like a fixed set recent ghost list. If anything S3 really is kind of ARC but without the adaptive weights, where you have a small shard/recent list, and then a longer frequent list.

So there is a risk that if you have a lot of high churn on the recent list, your frequent list will never populate since it will be continually pushing out the ghost/shard lists and never able to actually cause the needed 2 hits (either in S or G) that will trigger a move to M. Where as ARC will adapt in this case and extend the S set if there are too many hits into G increasing the likelihood of a promotion. Very subtle edge case but it exists nonetheless.

Yeah, that's a concern. The paper authors made an explicit static-over-adaptive tradeoff here. The failure mode is bounded and observable, but it's still a tradeoff.

So, to my understanding, that's what we have:

S is sized at around 10% of cache. The window for a second hit before eviction from S is bounded by the number of distinct DNs that land in the same shard. To miss the freq >= 1 threshold, the workload would need to push that many one-hit DNs through the shard faster than the working set cycles.

The ghost queue G is sized to M, not to S. A key evicted from S to G and re-requested while still in G goes straight into M, not back through S (see Algorithm 1 in the paper: if x in G then insert x to head of M). G capacity is much larger than S, so the second-chance window through G is much wider than through S alone.

But the access-bits-cleared rule in the paper's 4.1 section means a G-to-M promotion lands at effective freq=0, and M's eviction will drop it on the first pass if it doesn't earn a hit in M. So a one-hit-via-G entry is still vulnerable. This is the part the static design trades for predictability...

On adaptivity: the paper tested an adaptive S3-FIFO variant (S3-FIFO-d, section 6.2) that resizes S and M based on ghost-queue hits. They report S3-FIFO-d beats static S3-FIFO only on the 2% adversarial tail of their dataset, and underperforms on the rest because the adaptive sizing keeps oscillating. I read the authors as choosing predictability over adaptivity.

For the NDN cache specifically, steady-state hit rate is over 99% with a working set that fits inside M with substantial headroom. I'm not waving this away... IIUC, the scan resistance is one of the reasons why you picked ARC in the entry cache, and that feels right. For the NDN cache, IMO, the static-size choice fits the workload, but it's still a choice. If we see this in production, both S's relative size and G's size are levers we can expose. I'd rather not commit to a cn=config knob until we have a workload that needs them.

Firstyear · 2026-05-13T01:59:46Z

To miss the freq >= 1 threshold, the workload would need to push that many one-hit DNs through the shard faster than the working set cycles.

Yeah, this could happen with a full table scan though, but I guess at that point you don't mind as this provides scanning resistance.

G capacity is much larger than S, so the second-chance window through G is much wider than through S alone.

Yep, exactly how the ghost sets in arc work (they are inversely sized to their related set, so a small recent set has a large recent ghost set).

I'd rather not commit to a cn=config knob until we have a workload that needs them.

Agreed, it's best not to add too many knobs.

droideck · 2026-05-29T02:33:37Z

Okay, I’m back from vacation! I did more testing around the hot-DN case and different cache sizes, refined the benchmark section, and updated the doc with the current conclusions. Please review...

Add Normalized DN Cache with sharded S3-FIFO design

d5eaf72

droideck mentioned this pull request May 12, 2026

Replace concread NDN cache with sharded S3-FIFO 389ds/389-ds-base#7489

Open

droideck requested a review from Firstyear May 12, 2026 03:10

Firstyear requested changes May 12, 2026

View reviewed changes

tbordaz reviewed May 12, 2026

View reviewed changes

Improve the design thanks to the comments and new test runs

71fa5af

droideck requested review from Firstyear and tbordaz May 29, 2026 02:32

Conversation

droideck commented May 12, 2026

Uh oh!

Firstyear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Firstyear commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

droideck commented May 12, 2026

Uh oh!

Firstyear commented May 13, 2026

Uh oh!

droideck commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants