feat: findByIds read primitive#62
Merged
Merged
Conversation
The CF adapter's '/// <reference types="@cloudflare/workers-types" />' pulls in Workers ambient globals that redefine Request/Response (where .json() returns unknown, not the DOM's any). The root tsconfig included ./adapters/*/src/**, so 'tsc --noEmit' (build:types:all) leaked those globals into core endpoint/admin code and failed typecheck. Adapters are already typechecked independently via their own tsconfig.build.json in the CI build job, so root coverage of adapter sources was redundant.
Depend on Pick<Vectorize, 'query' | 'upsert' | 'deleteByIds' | 'getByIds'> via a named VectorizeBinding type instead of the full 8-method Vectorize contract. env.VECTORIZE remains assignable; CloudflareVectorizeBinding is kept as a deprecated alias for back-compat.
# Conflicts: # adapters/mongodb/src/index.ts
Rename `payload.findEmbeddingsByIds` -> `payload.findByIds` and add an
opt-in `populateEmbedding?: boolean` (default false). `EmbeddingRecord.embedding`
is now optional and only returned when populateEmbedding is true.
Each backend honors the flag at the source where possible: pg skips selecting
the embedding column, mongodb uses { projection: { embedding: 0 } }, and CF
strips values post-fetch (getByIds always returns them). DbAdapter.findByIds
gains the populateEmbedding param; the shared mock and adapters README follow.
Specs split into a populateEmbedding:true case (keeps the full-vector
assertions) plus a default-omits-embedding case.
Root README advertised a `findEmbeddingsByIds` method that doesn't exist — the shipped public method is `findByIds`. Rename the method reference and example, document the `populateEmbedding?` param (default false), and fix the adapters/README EmbeddingRecord interface block to `embedding?: number[]` (optional, present only when populateEmbedding: true).
findByIds filtered ids through /^\d+$/ and mapped to Number, hardcoding an
integer primary key. The embeddings collection defines no custom id, so under
postgresAdapter({ idType: 'uuid' }) every embedding id is a uuid — the filter
dropped all of them and findByIds returned [] for ids that exist, while
search() round-tripped the same uuids fine.
Pass ids straight to inArray; Postgres casts the text params to the column
type, supporting both integer and uuid PKs. Well-formed but nonexistent ids
are still absent from results; a malformed id now surfaces a backend error
rather than being silently dropped (documented in adapters/README).
Adds a uuid-idType regression spec.
chunkText and embeddingVersion are not required in the embeddings schema, so a null column was spread through raw as `null`, violating EmbeddingRecord / VectorSearchResult (`chunkText: string`). CF and MongoDB both coerce via String(x ?? '') → '', so identical data round-tripped as '' on those adapters but null on pg; a consumer doing record.chunkText.length crashed only on pg. Coerce sourceCollection/chunkText/embeddingVersion via String(x ?? '') in both mapRowsToRecords (findByIds) and mapRowsToResults (search) so pg matches the declared types and the other adapters. Adds a regression test.
4e4f1b4 to
1a9063d
Compare
Owner
Author
|
Related to #60 |
findByIds now returns Record<string, EmbeddingRecord | undefined> instead of Array<EmbeddingRecord>. Order isn't conserved by any backend and a lookup may miss, so an array forced callers to re-join by id and made misses a silent gap. Keying by the requested id makes the round-trip O(1) (records[searchHit.id]), order irrelevant, and a miss an explicit undefined. Every requested id is a key. Unify the malformed-id contract: unknown AND malformed ids map to undefined, never throw. pg now filters ids that don't match the PK column type (getSQLType: numeric for integer/serial, uuid-shaped for uuid) before the IN query, so a bad id is a miss instead of a cast error that poisoned the batch — matching mongo (non-24-hex drop) and cf (unknown ids absent from getByIds). Stop the mock adapter from swallowing real errors: only Payload NotFound is treated as a miss; everything else rethrows. Docs + specs updated across all adapters; note that key order is not input order (integer-like keys sort first) so callers must look up by id.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
findByIds— public method onVectorizedPayloadand on theDbAdaptercontract: fetch stored embedding records by primary key. Building block for "more like this."By default returns each record's text + metadata; pass
populateEmbedding: trueto also get the rawembeddingvector back (the normal search/query API never returns it). Defaults tofalseso callers don't pay for the heavy vector unless they ask —EmbeddingRecord.embeddingis therefore optional.The
idof each record is whateversearch()returns asresult.id, so a search result round-trips directly.What's here
EmbeddingRecordtype (optionalembedding?: number[]) +VectorizedPayload.findByIds({ knowledgePool, ids, populateEmbedding? })+DbAdapter.findByIds(payload, poolName, ids, populateEmbedding?), re-exported for adapters.createVectorizedPayloadObject(emptyidsshort-circuits to[]), covered via the in-memory mock adapter.inArraydirect read; embedding column selected only when populating; non-numeric ids dropped via/^\d+$/.find({ _id: { $in } })with{ projection: { embedding: 0 } }when not populating; non-24-hex ids dropped via the existingHEX24guard (never throws).binding.getByIds(ids)(typed by refactor(cf): adopt @cloudflare/workers-types #61);getByIdsalways returns values, so the vector is stripped post-fetch when not populating.adapters/README.mdcontract + rootREADME.mdLocal API section.populateEmbedding semantics
Default
false→embeddingomitted from each record. Where possible the read skips the vector at the source (pg: column not selected; mongo:{ projection: { embedding: 0 } }); CF'sgetByIdsalways returns values, so it's stripped post-fetch.Contract
Misses dropped (result length may be
< ids.length); order not guaranteed; emptyids→[]without a backend call; unknown/malformed ids treated as misses, not errors.Tests
All green: pg 66 · cf 95 · mongodb 111 · userland
vectorizedPayload21. Each adapter has a dedicatedfindByIdsspec covering bothpopulateEmbedding: true(full record incl. numeric embedding + reserved + extension fields) and the default omit-embedding case, plus drop-misses and empty-ids.Stacking
#61 (cf-workers-types) is now merged into
main, and this PR is based onmain(latest merged in). The CFfindByIdsrelies on #61's typedgetByIds.