Skip to content

Add Normalized DN Cache with sharded S3-FIFO design#27

Open
droideck wants to merge 2 commits into
389ds:mainfrom
droideck:ndn-cache-s3fifo
Open

Add Normalized DN Cache with sharded S3-FIFO design#27
droideck wants to merge 2 commits into
389ds:mainfrom
droideck:ndn-cache-s3fifo

Conversation

@droideck
Copy link
Copy Markdown
Member

No description provided.

Copy link
Copy Markdown
Contributor

@Firstyear Firstyear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting proposal, I'm curious about the locking design however. :)

Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md Outdated
no notion of stale entries.

The cache is split into 64 shards. Each shard runs the eviction algorithm
independently over its own hash table and its own lock, so reads on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are shards "per thread" or based on sharding on the hash of the input/key value?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashed by key. Shard index is hash & (num_shards - 1), so threads compete on locks per-key, not per-thread. It's a single shared cache, not per-thread tables.

BTW, specifically the low bits, not the high bits, because hashbrown's H2 tag (the byte SIMD compares against during probes) is the top 7 bits of the hash. Sharding on high bits would collapse the H2 tags within a shard to a tiny set and explode SIMD probe collisions inside hashbrown. Low bits keep the shard index independent of the tag.

Background on the hashbrown control byte layout: https://deepwiki.com/rust-lang/hashbrown/2-core-architecture

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue then is if you have a very active key, then all your threads wil serialise around that single lock as they compete for it. We see this commonly where a customer has a sssd service account that authenticates "thousands of times per second" so this scheme may impact that kind of scenario.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me. we have a kind of 64 slot (shard) hashtable containing keys. Correct ?
If we want to remove a key from a shard, how many locks are required ? one for the key ? one for the shard ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbordaz In the design of this cache there is no such action as "remove", it just relies on the key aging out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The eviction algorithm being per shard, does it mean that a evicted key only require the lock of the key ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the lock of the shard that the key is in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue then is if you have a very active key, then all your threads wil serialise around that single lock as they compete for it. We see this commonly where a customer has a sssd service account that authenticates "thousands of times per second" so this scheme may impact that kind of scenario.

Yep, agreed.
That is the weak case for this design. If one DN/key is hot enough, sharding does not help much because that key keeps landing on the same shard lock. I added the hot-DN server tests to check that. The result was close though! Across the tested cache sizes and 16/32/64 threads, S3-FIFO stayed within about 3% of concread either way. I would not present that as a win, but as a measured tradeoff, which I called out in the doc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty good at least. The hot-DN case was one that worried me especially because of SSSD and how poorly it behaves as a client.

Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md
take a transactional snapshot and writers copy-on-write a new version on
commit. That is the design for caches whose values can change during a
read, and it remains a candidate for the 389 DS entry cache. The NDN
cache does not need those guarantees, and on the memberOf suite below
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concread specifically reduces some single thread performance for higher multi-thread performance - is the memberof test multi-threaded or single?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration perf suite quoted in the doc is single-threaded, yeah, my mistake...

I saw the Rust microbench numbers (which were for different number of threads and with isolation from any DS bottlenecks) and thought that would be enough to make the multi-thread case:

threads   s3fifo       concread
1         179.81 µs    1.9131 ms
16        6.0811 ms    98.733 ms
64        24.982 ms    377.34 ms

But you're right that it doesn't replace a real workload. I'll re-run the sustained multi-threaded test in the integration suite we have (16 workers, 600s of MOD_ADD/MOD_DEL on member) and share those results later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider in that test is the quiesce behaviour too, but I haven't seen the full test to comment on why arc seems so slow there.

Alternatives Considered
-----------------------

Concread's `ARCache` is a transactional, snapshot-consistent ARC: readers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the transactional nature is separate from the algorithm used - if S3-FIFO is very 'good' then there is no reason we couldn't use it in concread instead of ARC.

Also within the way this works in 389-ds, we actually only use readers and we async-submit changes for the main cache to include.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a concread-vs-S3-FIFO question. It's whether the NDN cache shape we have today needs the transactional power concread provides.

Concread's ARCache gives readers a transactionally consistent snapshot, a haunted set so evicted-then-revived keys can't clobber newer writes, and an mpsc channel from readers to the writer so reader hits influence eviction (as you describe in CACHE.md). IIUC, these properties are correct for caches whose values can change under readers and whose readers need a stable view across a transaction. The entry cache and the index cache fit that description, yep.

IMO, the NDN cache doesn't. The value is a pure function of the key. Two readers asking for the same DN at different transaction ids get the same answer or a cache miss, never a stale-but-different answer. There's no rollback to support and no temporal ordering between readers and writers to protect. The haunted set and the reader-to-writer channel pay for guarantees the NDN cache doesn't need.

So for this proposal the relevant comparison is concread's transactional shell vs no shell, for a cache whose values are immutable functions of the key. For this specific cache, no shell wins on the hot path, I think. Which is why the new design is sharded S3-FIFO directly rather than ARC-inside-concread.

The reader-doesnt-block-writer property you get from async-submit carries over to the design through a different mechanism: shared read lock with a relaxed atomic counter on hit, write lock only on miss-insert. No reader-to-writer mpsc, but readers stay off the write path on the hot case. Plus the cache is hash sharded across per-shard locks, so reads on different shards never contend at the lock level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a concread-vs-S3-FIFO question. It's whether the NDN cache shape we have today needs the transactional power concread provides.

Part of the goal historically of using concread in 389 was to show the cache was effective transactionally, and could then be used in other places where it matters - especially with LMDB for example since it shares an identical transaction model.

However, no one else on the team really ever showed interest in that, and the changes to make 389-ds strongly transactional don't seem to have eventuated even with the changes to support LMDB.

So with that in mind, in this case yes, maybe the s3-fifo is a better option. That's fine. The number of cases for the NDN to rely on transactions would be low.

One area the concread cache has a possibility to outshine the s3-fifo is a majority read-only workset. This is because we end up with most cache lines in the shared state (where s3-fifo will be invalidating them frequently due to atomics). But this is very subjective.

Something to try in your benchmarking is different cache sizes - extremely small, moderate, and large ndn cache sizes. This also affects the performance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I added a cache-size + multithreaded sweep.

For the multithreaded memberOf cascade run, changing the configured cache size did not change the direction of the result. This run is mostly lookup-path and contention, not eviction pressure. S3-FIFO stayed around +12% ops/s vs concread, with lower p95 in the same runs.

So yeah, on the read-mostly point, I agree this is where concread has the strongest argument. The useful thing from the new runs is that we see that the hot-DN shape stays close to concread, while the multi-key memberOf shape stays ahead across the cache sizes. Hits take the shared side of a per-shard parking_lot::RwLock, not an exclusive mutex. It is still not a zero-write read path though: the hit/miss stats are per-shard AtomicU64 counters with relaxed updates (we don't need great precision there), and the per-entry AtomicU8 freq counter only CASes until it saturates (which is great for the hot DNs). I think that is a reasonable trade for NDN
WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This run is mostly lookup-path and contention, not eviction pressure. S3-FIFO stayed around +12% ops/s vs concread, with lower p95 in the same runs.

There are some other things I want to suggest you can try, just for comparison.

  1. set_reader_quiesce to false and spawn a dedicated quiesce thread.
  2. set_look_back_limit to 8 (decreased from 32)

My guess is these two are the biggest factors youre seeing that causes the p95 changes, as well as some of the operational performance difference, especially the with quiesce.

So yeah, on the read-mostly point, I agree this is where concread has the strongest argument. The useful thing from the new runs is that we see that the hot-DN shape stays close to concread, while the multi-key memberOf shape stays ahead across the cache sizes.

A good question is why is memberOf hitting the NDN cache so hard in the firstplace? I assume we're still using the "old" memberOf algorithm (compared to the one I proposed around 2017 that wasn't adopted - note; yes, it's still the slow algo).

Hits take the shared side of a per-shard parking_lot::RwLock, not an exclusive mutex.

It will make the write side of the equation much slower though as you need all readers to drains before a write can proceed. So be mindful of that consequence.

It is still not a zero-write read path though: the hit/miss stats are per-shard AtomicU64 counters with relaxed updates (we don't need great precision there), and the per-entry AtomicU8 freq counter only CASes until it saturates (which is great for the hot DNs). I think that is a reasonable trade for NDN WDYT?

Yeah, that's fine. Realistically if you plan to only do "sampling" based stats and are willing to forego some precision for performance, you can do a trick. You actually only need to record some results, not all to get a hit ratio. For 100,000 events, if you sample 1/10th of them (10,000) as a hit or miss, then you actually end up with a result that is statistically likely to be within 1% of the true value. This means you can probably forgo a lot of atomic calls by sampling instead of recording everything.

(check https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Sample+Size+Calculator for your own homework)

@Firstyear
Copy link
Copy Markdown
Contributor

Oh something else to consider is the S3 shards are fixed size where as in ARC they're variable size. S3 uses a ghost set to do promotions, but it seems to be like a fixed set recent ghost list. If anything S3 really is kind of ARC but without the adaptive weights, where you have a small shard/recent list, and then a longer frequent list.

So there is a risk that if you have a lot of high churn on the recent list, your frequent list will never populate since it will be continually pushing out the ghost/shard lists and never able to actually cause the needed 2 hits (either in S or G) that will trigger a move to M. Where as ARC will adapt in this case and extend the S set if there are too many hits into G increasing the likelihood of a promotion. Very subtle edge case but it exists nonetheless.

Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md Outdated
Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md
Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md Outdated
Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md
Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md
Comment thread docs/389ds/design/normalized-dn-cache-sharded-s3fifo.md Outdated
@droideck
Copy link
Copy Markdown
Member Author

Oh something else to consider is the S3 shards are fixed size where as in ARC they're variable size. S3 uses a ghost set to do promotions, but it seems to be like a fixed set recent ghost list. If anything S3 really is kind of ARC but without the adaptive weights, where you have a small shard/recent list, and then a longer frequent list.

So there is a risk that if you have a lot of high churn on the recent list, your frequent list will never populate since it will be continually pushing out the ghost/shard lists and never able to actually cause the needed 2 hits (either in S or G) that will trigger a move to M. Where as ARC will adapt in this case and extend the S set if there are too many hits into G increasing the likelihood of a promotion. Very subtle edge case but it exists nonetheless.

Yeah, that's a concern. The paper authors made an explicit static-over-adaptive tradeoff here. The failure mode is bounded and observable, but it's still a tradeoff.

So, to my understanding, that's what we have:

S is sized at around 10% of cache. The window for a second hit before eviction from S is bounded by the number of distinct DNs that land in the same shard. To miss the freq >= 1 threshold, the workload would need to push that many one-hit DNs through the shard faster than the working set cycles.

The ghost queue G is sized to M, not to S. A key evicted from S to G and re-requested while still in G goes straight into M, not back through S (see Algorithm 1 in the paper: if x in G then insert x to head of M). G capacity is much larger than S, so the second-chance window through G is much wider than through S alone.

But the access-bits-cleared rule in the paper's 4.1 section means a G-to-M promotion lands at effective freq=0, and M's eviction will drop it on the first pass if it doesn't earn a hit in M. So a one-hit-via-G entry is still vulnerable. This is the part the static design trades for predictability...

On adaptivity: the paper tested an adaptive S3-FIFO variant (S3-FIFO-d, section 6.2) that resizes S and M based on ghost-queue hits. They report S3-FIFO-d beats static S3-FIFO only on the 2% adversarial tail of their dataset, and underperforms on the rest because the adaptive sizing keeps oscillating. I read the authors as choosing predictability over adaptivity.

For the NDN cache specifically, steady-state hit rate is over 99% with a working set that fits inside M with substantial headroom. I'm not waving this away... IIUC, the scan resistance is one of the reasons why you picked ARC in the entry cache, and that feels right. For the NDN cache, IMO, the static-size choice fits the workload, but it's still a choice. If we see this in production, both S's relative size and G's size are levers we can expose. I'd rather not commit to a cn=config knob until we have a workload that needs them.

@Firstyear
Copy link
Copy Markdown
Contributor

To miss the freq >= 1 threshold, the workload would need to push that many one-hit DNs through the shard faster than the working set cycles.

Yeah, this could happen with a full table scan though, but I guess at that point you don't mind as this provides scanning resistance.

G capacity is much larger than S, so the second-chance window through G is much wider than through S alone.

Yep, exactly how the ghost sets in arc work (they are inversely sized to their related set, so a small recent set has a large recent ghost set).

I'd rather not commit to a cn=config knob until we have a workload that needs them.

Agreed, it's best not to add too many knobs.

@droideck droideck requested review from Firstyear and tbordaz May 29, 2026 02:32
@droideck
Copy link
Copy Markdown
Member Author

Okay, I’m back from vacation! I did more testing around the hot-DN case and different cache sizes, refined the benchmark section, and updated the doc with the current conclusions. Please review...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants