Skip to content

docs(chat-bot-memory): TurboQuant inspiration + sync to v4 memory state#1

Open
thedeutschmark wants to merge 2 commits into
mainfrom
docs/turboquant-inspiration
Open

docs(chat-bot-memory): TurboQuant inspiration + sync to v4 memory state#1
thedeutschmark wants to merge 2 commits into
mainfrom
docs/turboquant-inspiration

Conversation

@thedeutschmark
Copy link
Copy Markdown
Owner

What

Adds an Inspiration: TurboQuant section to the chat-bot-memory note, and brings the note up to date with the current engine.

TurboQuant (arXiv:2504.19874)

Read the full paper, not the press summaries (which incorrectly describe it as "PolarQuant + QJL" — PolarQuant is a separate method TurboQuant beats as a baseline). Actual pipeline: random rotation → per-coordinate Lloyd–Max scalar quantizer → 1-bit QJL residual.

Honest verdict baked into the section: the machinery buys this app nothing at its scale (retrieval is a sort over dozens of rows; reply LLM is cloud-hosted → no local KV cache). What's worth taking:

  • QJL one-bit sketch as a semantic-dedup signal in the hygiene pass — catches "plays drums" vs "is a drummer" (lexical Jaccard scores these ~0.33 and misses them), no stored embeddings, off the hot path.
  • Outlier-channel bit allocation, generalized (the "+1"): spend retention + token budget where the variance/signal is, not uniformly. Reframes confidence-weighted selection as a measurable survival-weighted retrieval quality metric.
  • TurboQuant documented as the escape hatch that de-risks the deferred embeddings step (packed bit-blobs + brute-force Hamming) without reordering the roadmap.

Drift fixes (the "up to date" part)

  • Cross-stream session recaps (v4 stream_sessions / "this stream" vs "last stream").
  • Provenance render-tags ([said]/[reported]/[guess]) surfaced to the model.

Roadmap bullets updated to fold in the two concrete ideas. Deliberately scoped out bait-decisiveness work — that's a reply/action concern, not memory.

Note

This documents thinking and sharpens the roadmap; it does not commit to building a quantizer. The forgetmenot code changes (QJL dedup, signal-weighted budget) are a separate effort.

… state

Add an "Inspiration: TurboQuant" section drawing on Google Research's
TurboQuant paper (arXiv:2504.19874), read in full rather than from the
press summaries — which conflate it with the separate PolarQuant method.

The honest verdict: TurboQuant validates the "compress, don't hoard"
thesis but its machinery buys nothing at this scale (retrieval is a sort
over dozens of rows; the reply LLM is cloud-hosted, so there's no local
KV cache to compress). Two things are worth taking:

- The QJL primitive (1-bit sign-projection, unbiased angle estimate) as
  a semantic-dedup signal in the hygiene pass — no stored embeddings.
- The outlier-channel bit-allocation idea, generalized: spend retention
  and token budget where the variance/signal is, not uniformly. Reframes
  confidence-weighted selection as a measurable "survival-weighted
  retrieval quality" metric.

TurboQuant is also documented as the escape hatch that de-risks the
deferred embeddings step (packed bit-blobs + brute-force Hamming, no
codebook/server) without reordering the roadmap.

Also sync the note to current code: cross-stream session recaps (v4
stream_sessions) and provenance render-tags ([said]/[reported]/[guess]).
…l framing

Independent audit found two inaccuracies introduced/exposed by the
implementation work:
- The note claimed a 1-bit sketch would catch zero-overlap synonyms
  ('plays drums' / 'is a drummer'). It won't — over bag-of-token
  features a sign-projection sketch is ~equivalent to Jaccard. The
  prototype was built and removed; reframed as the honest lesson, with
  true synonym dedup waiting for embeddings (QJL makes those cheap).
- The 'uniform / recency-rank-alone' framing was stale: retrieval
  already blends confidence x recency, and stale-decay now scales with
  confidence. Reframed so the genuinely-open frontier is distinctiveness
  + measuring whether the weighting helps.
@thedeutschmark thedeutschmark force-pushed the docs/turboquant-inspiration branch from 65f9410 to cb4023a Compare May 23, 2026 17:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65f94109d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scaling-streaming-toolsets/README.md Outdated
| 10 | ~500 | ~2,000 | 99.93% |
| 100 | ~5,000 | ~20,000 | 99.34% |
| 1,000 | ~50,000 | ~200,000 | 93% |
| 10,000 | ~500,000 | ~2M | 34% |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recalculate 10k-user headroom against stated plan limits

The 10,000-user row reports ~500,000 KV reads/day and ~2M worker req/day while also claiming 34% headroom, but §4.1 states Workers Paid includes 10M/month KV reads and 10M/month worker requests (~333k/day each). At these daily rates, both resources are already well over the included monthly quotas (about 15M KV reads/month and 60M worker requests/month), so this headroom value is directionally wrong and will mislead capacity/cost planning.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant