docs(chat-bot-memory): TurboQuant inspiration + sync to v4 memory state#1
docs(chat-bot-memory): TurboQuant inspiration + sync to v4 memory state#1thedeutschmark wants to merge 2 commits into
Conversation
… state Add an "Inspiration: TurboQuant" section drawing on Google Research's TurboQuant paper (arXiv:2504.19874), read in full rather than from the press summaries — which conflate it with the separate PolarQuant method. The honest verdict: TurboQuant validates the "compress, don't hoard" thesis but its machinery buys nothing at this scale (retrieval is a sort over dozens of rows; the reply LLM is cloud-hosted, so there's no local KV cache to compress). Two things are worth taking: - The QJL primitive (1-bit sign-projection, unbiased angle estimate) as a semantic-dedup signal in the hygiene pass — no stored embeddings. - The outlier-channel bit-allocation idea, generalized: spend retention and token budget where the variance/signal is, not uniformly. Reframes confidence-weighted selection as a measurable "survival-weighted retrieval quality" metric. TurboQuant is also documented as the escape hatch that de-risks the deferred embeddings step (packed bit-blobs + brute-force Hamming, no codebook/server) without reordering the roadmap. Also sync the note to current code: cross-stream session recaps (v4 stream_sessions) and provenance render-tags ([said]/[reported]/[guess]).
…l framing
Independent audit found two inaccuracies introduced/exposed by the
implementation work:
- The note claimed a 1-bit sketch would catch zero-overlap synonyms
('plays drums' / 'is a drummer'). It won't — over bag-of-token
features a sign-projection sketch is ~equivalent to Jaccard. The
prototype was built and removed; reframed as the honest lesson, with
true synonym dedup waiting for embeddings (QJL makes those cheap).
- The 'uniform / recency-rank-alone' framing was stale: retrieval
already blends confidence x recency, and stale-decay now scales with
confidence. Reframed so the genuinely-open frontier is distinctiveness
+ measuring whether the weighting helps.
65f9410 to
cb4023a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65f94109d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | 10 | ~500 | ~2,000 | 99.93% | | ||
| | 100 | ~5,000 | ~20,000 | 99.34% | | ||
| | 1,000 | ~50,000 | ~200,000 | 93% | | ||
| | 10,000 | ~500,000 | ~2M | 34% | |
There was a problem hiding this comment.
Recalculate 10k-user headroom against stated plan limits
The 10,000-user row reports ~500,000 KV reads/day and ~2M worker req/day while also claiming 34% headroom, but §4.1 states Workers Paid includes 10M/month KV reads and 10M/month worker requests (~333k/day each). At these daily rates, both resources are already well over the included monthly quotas (about 15M KV reads/month and 60M worker requests/month), so this headroom value is directionally wrong and will mislead capacity/cost planning.
Useful? React with 👍 / 👎.
What
Adds an Inspiration: TurboQuant section to the chat-bot-memory note, and brings the note up to date with the current engine.
TurboQuant (arXiv:2504.19874)
Read the full paper, not the press summaries (which incorrectly describe it as "PolarQuant + QJL" — PolarQuant is a separate method TurboQuant beats as a baseline). Actual pipeline: random rotation → per-coordinate Lloyd–Max scalar quantizer → 1-bit QJL residual.
Honest verdict baked into the section: the machinery buys this app nothing at its scale (retrieval is a sort over dozens of rows; reply LLM is cloud-hosted → no local KV cache). What's worth taking:
Drift fixes (the "up to date" part)
stream_sessions/ "this stream" vs "last stream").[said]/[reported]/[guess]) surfaced to the model.Roadmap bullets updated to fold in the two concrete ideas. Deliberately scoped out bait-decisiveness work — that's a reply/action concern, not memory.
Note
This documents thinking and sharpens the roadmap; it does not commit to building a quantizer. The forgetmenot code changes (QJL dedup, signal-weighted budget) are a separate effort.