Skip to content

feat: findByIds read primitive#62

Merged
techiejd merged 20 commits into
mainfrom
feat/find-embeddings-by-ids
Jun 11, 2026
Merged

feat: findByIds read primitive#62
techiejd merged 20 commits into
mainfrom
feat/find-embeddings-by-ids

Conversation

@techiejd

@techiejd techiejd commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Adds findByIds — public method on VectorizedPayload and on the DbAdapter contract: fetch stored embedding records by primary key. Building block for "more like this."

By default returns each record's text + metadata; pass populateEmbedding: true to also get the raw embedding vector back (the normal search/query API never returns it). Defaults to false so callers don't pay for the heavy vector unless they ask — EmbeddingRecord.embedding is therefore optional.

The id of each record is whatever search() returns as result.id, so a search result round-trips directly.

What's here

  • EmbeddingRecord type (optional embedding?: number[]) + VectorizedPayload.findByIds({ knowledgePool, ids, populateEmbedding? }) + DbAdapter.findByIds(payload, poolName, ids, populateEmbedding?), re-exported for adapters.
  • Public wiring in createVectorizedPayloadObject (empty ids short-circuits to []), covered via the in-memory mock adapter.
  • PostgreSQL: Drizzle inArray direct read; embedding column selected only when populating; non-numeric ids dropped via /^\d+$/.
  • MongoDB: find({ _id: { $in } }) with { projection: { embedding: 0 } } when not populating; non-24-hex ids dropped via the existing HEX24 guard (never throws).
  • Cloudflare: official binding.getByIds(ids) (typed by refactor(cf): adopt @cloudflare/workers-types #61); getByIds always returns values, so the vector is stripped post-fetch when not populating.
  • Docs: adapters/README.md contract + root README.md Local API section.

populateEmbedding semantics

Default falseembedding omitted from each record. Where possible the read skips the vector at the source (pg: column not selected; mongo: { projection: { embedding: 0 } }); CF's getByIds always returns values, so it's stripped post-fetch.

Contract

Misses dropped (result length may be < ids.length); order not guaranteed; empty ids[] without a backend call; unknown/malformed ids treated as misses, not errors.

Tests

All green: pg 66 · cf 95 · mongodb 111 · userland vectorizedPayload 21. Each adapter has a dedicated findByIds spec covering both populateEmbedding: true (full record incl. numeric embedding + reserved + extension fields) and the default omit-embedding case, plus drop-misses and empty-ids.

Stacking

#61 (cf-workers-types) is now merged into main, and this PR is based on main (latest merged in). The CF findByIds relies on #61's typed getByIds.

techiejd added 16 commits May 31, 2026 23:44
The CF adapter's '/// <reference types="@cloudflare/workers-types" />'
pulls in Workers ambient globals that redefine Request/Response (where
.json() returns unknown, not the DOM's any). The root tsconfig included
./adapters/*/src/**, so 'tsc --noEmit' (build:types:all) leaked those
globals into core endpoint/admin code and failed typecheck. Adapters are
already typechecked independently via their own tsconfig.build.json in the
CI build job, so root coverage of adapter sources was redundant.
Depend on Pick<Vectorize, 'query' | 'upsert' | 'deleteByIds' | 'getByIds'>
via a named VectorizeBinding type instead of the full 8-method Vectorize
contract. env.VECTORIZE remains assignable; CloudflareVectorizeBinding is
kept as a deprecated alias for back-compat.
Rename `payload.findEmbeddingsByIds` -> `payload.findByIds` and add an
opt-in `populateEmbedding?: boolean` (default false). `EmbeddingRecord.embedding`
is now optional and only returned when populateEmbedding is true.

Each backend honors the flag at the source where possible: pg skips selecting
the embedding column, mongodb uses { projection: { embedding: 0 } }, and CF
strips values post-fetch (getByIds always returns them). DbAdapter.findByIds
gains the populateEmbedding param; the shared mock and adapters README follow.

Specs split into a populateEmbedding:true case (keeps the full-vector
assertions) plus a default-omits-embedding case.
Root README advertised a `findEmbeddingsByIds` method that doesn't exist —
the shipped public method is `findByIds`. Rename the method reference and
example, document the `populateEmbedding?` param (default false), and fix the
adapters/README EmbeddingRecord interface block to `embedding?: number[]`
(optional, present only when populateEmbedding: true).
@techiejd techiejd changed the title feat: findEmbeddingsByIds read primitive feat: findByIds read primitive Jun 6, 2026
@techiejd techiejd changed the base branch from feat/cf-workers-types to main June 6, 2026 12:05
techiejd added 2 commits June 6, 2026 21:28
findByIds filtered ids through /^\d+$/ and mapped to Number, hardcoding an
integer primary key. The embeddings collection defines no custom id, so under
postgresAdapter({ idType: 'uuid' }) every embedding id is a uuid — the filter
dropped all of them and findByIds returned [] for ids that exist, while
search() round-tripped the same uuids fine.

Pass ids straight to inArray; Postgres casts the text params to the column
type, supporting both integer and uuid PKs. Well-formed but nonexistent ids
are still absent from results; a malformed id now surfaces a backend error
rather than being silently dropped (documented in adapters/README).

Adds a uuid-idType regression spec.
chunkText and embeddingVersion are not required in the embeddings schema, so a
null column was spread through raw as `null`, violating EmbeddingRecord /
VectorSearchResult (`chunkText: string`). CF and MongoDB both coerce via
String(x ?? '') → '', so identical data round-tripped as '' on those adapters
but null on pg; a consumer doing record.chunkText.length crashed only on pg.

Coerce sourceCollection/chunkText/embeddingVersion via String(x ?? '') in both
mapRowsToRecords (findByIds) and mapRowsToResults (search) so pg matches the
declared types and the other adapters. Adds a regression test.
@techiejd

techiejd commented Jun 7, 2026

Copy link
Copy Markdown
Owner Author

Related to #60

findByIds now returns Record<string, EmbeddingRecord | undefined> instead of
Array<EmbeddingRecord>. Order isn't conserved by any backend and a lookup may
miss, so an array forced callers to re-join by id and made misses a silent gap.
Keying by the requested id makes the round-trip O(1) (records[searchHit.id]),
order irrelevant, and a miss an explicit undefined. Every requested id is a key.

Unify the malformed-id contract: unknown AND malformed ids map to undefined,
never throw. pg now filters ids that don't match the PK column type
(getSQLType: numeric for integer/serial, uuid-shaped for uuid) before the IN
query, so a bad id is a miss instead of a cast error that poisoned the batch —
matching mongo (non-24-hex drop) and cf (unknown ids absent from getByIds).

Stop the mock adapter from swallowing real errors: only Payload NotFound is
treated as a miss; everything else rethrows.

Docs + specs updated across all adapters; note that key order is not input
order (integer-like keys sort first) so callers must look up by id.
@techiejd techiejd merged commit ea6e169 into main Jun 11, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant