Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions embeddings/CHUNKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Embedding strategy FFI (embeddings lib → Manticore daemon)

Embeddings lib **v8**. The one embedding call, `make_vect_embeddings`, takes an
optional `ChunkSettings*` that selects how a document becomes one or many
vectors. Cardinality is carried as **data** in the return (a per-document
offsets sidecar), so a single method covers both 1-vector and N-vector
strategies — no second method.

## FFI

```c
constexpr uint32_t STRATEGY_TRUNCATE = 0; // 1 vector/doc
constexpr uint32_t STRATEGY_MEAN = 1; // 1 vector/doc
constexpr uint32_t STRATEGY_FIXED = 2; // N vectors/doc
constexpr uint32_t STRATEGY_RECURSIVE = 3; // N vectors/doc
constexpr uint32_t STRATEGY_SENTENCE = 4; // N vectors/doc

struct ChunkSettings {
uint32_t strategy; // one of STRATEGY_*
uint32_t max_tokens; // chunk size in tokens; 0 = model max
uint32_t overlap_tokens; // token overlap between chunks; 0 = none
uint32_t max_chunks; // cap on chunks/doc; 0 = unlimited (overflow merges into last)
};

// settings == nullptr ⇒ truncate (today's behavior)
FloatVecResult make_vect_embeddings(
const TextModelWrapper*, const StringItem* texts, uintptr_t count,
const ChunkSettings*, int32_t threads);
```

### Return: flat vectors + per-row offsets (cardinality is data)

```c
struct FloatVecResult {
char *m_szError;
const FloatVec *m_tEmbedding; // FLAT: every document's vectors concatenated, `len` total
uintptr_t len;
uintptr_t cap;
const uintptr_t *m_pRowOffsets; // length rows+1; doc i = m_tEmbedding[off[i] .. off[i+1]]
uintptr_t rows; // number of input documents
uintptr_t offsets_cap;
};
```

Read document `i`'s vectors as `m_tEmbedding[m_pRowOffsets[i] .. m_pRowOffsets[i+1]]`.
- `truncate`/`mean` → one vector/doc, so `len == rows == count` and offsets are
`[0, 1, …, rows]` (you may just index `m_tEmbedding[i]`).
- `fixed`/`recursive`/`sentence` → N vectors/doc; `len` = total chunks, and the
offsets group them per document.

Free with `free_vec_result` (it frees the offsets too). Load-time check:
`EmbedLib.version == 8`.

## Strategies

| strategy | val | output | what it does |
|---|---|---|---|
| **truncate** | 0 | 1 vec/doc | embed the first `max_tokens` tokens (rest dropped). `max_tokens`/`overlap`/`max_chunks` ignored. |
| **mean** | 1 | 1 vec/doc | split (recursive, token-aware) → embed every chunk → **average** into one L2-normalized vector. Whole document, no tail loss. |
| **fixed** | 2 | N vecs/doc | split into fixed `max_tokens`-token windows; one vector per chunk. |
| **recursive** | 3 | N vecs/doc | split on a separator hierarchy (paragraph → line → space) ≤ `max_tokens`; one vector per chunk. |
| **sentence** | 4 | N vecs/doc | UAX-29 sentence segmentation (ES-style), grouped to `max_tokens`; one vector per chunk. |

- `max_tokens = 0` → the model's own max input length.
- `overlap_tokens` → token overlap between chunks (multi-vector + mean).
- `max_chunks` → cap chunks/doc; overflow merges the tail into the last chunk.
- Local models chunk on the model's real subword tokens; remote API models
(OpenAI/Voyage/Jina) chunk by a char/byte heuristic (no local tokenizer).

## Calling it

```cpp
// non-chunked field & queries: pass nullptr → truncate
pFuncs->make_vect_embeddings( &model, items.data(), items.size(), nullptr, iThreads );

// any strategy, taken from a table option:
ChunkSettings tCfg { STRATEGY_SENTENCE, /*max_tokens*/ 256, /*overlap*/ 0, /*max_chunks*/ 0 };
FloatVecResult tRes = pFuncs->make_vect_embeddings( &model, items.data(), items.size(), &tCfg, iThreads );
// doc i owns vectors tRes.m_tEmbedding[ tRes.m_pRowOffsets[i] .. tRes.m_pRowOffsets[i+1] ]
```

The daemon owns the SQL/DDL surface (e.g. a per-`float_vector`-field option),
parses it into a `ChunkSettings`, and passes it on every embed call. Queries are
short: pass `nullptr` so they are never chunked.

## Daemon side (next phase, not this lib)

The lib already returns N vectors/doc for `fixed`/`recursive`/`sentence` via the
offsets sidecar. To *use* them the daemon needs N-vectors-per-row storage +
max-over-chunks search; `truncate`/`mean` (1 vector/doc) work with today's
storage unchanged.
1 change: 1 addition & 0 deletions embeddings/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions embeddings/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ edition = "2021"
# For local dev with ../../candle, add a [patch] section to use path deps.
[dependencies]
tokenizers = "0.15.2"
unicode-segmentation = "1"
hf-hub = { git = "https://github.com/huggingface/hf-hub.git", rev = "ac22200ea0b5af4d8c362f699be0340647b19060", default-features = false,features = ["ureq"] }
anyhow = "1.0.81"
serde_json = "1.0.114"
Expand Down
2 changes: 1 addition & 1 deletion embeddings/INSTRUCTIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ src/

1. **`src/model/text_model_wrapper.rs`** - The heart of FFI
- `load_model()` - Creates model instances
- `make_vect_embeddings()` - Generates embeddings (FIXED: no empty vectors on error)
- `make_vect_embeddings()` - Generates embeddings; takes an optional `ChunkSettings*` selecting the strategy (truncate / mean / fixed / recursive / sentence). Returns a flat `FloatVecResult` grouped per document by `m_pRowOffsets`. **Contract + strategy reference: [`CHUNKING.md`](CHUNKING.md).**
- `free_model_result()` / `free_vec_result()` - Memory cleanup

2. **`manticoresearch_text_embeddings.h`** - Auto-generated C header
Expand Down
41 changes: 41 additions & 0 deletions embeddings/manticoresearch_text_embeddings.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@
#include <ostream>
#include <new>

/// Strategy, mirrored as a `u32` across the FFI in [`ChunkSettings`].
constexpr static const uint32_t STRATEGY_TRUNCATE = 0;

constexpr static const uint32_t STRATEGY_MEAN = 1;

constexpr static const uint32_t STRATEGY_FIXED = 2;

constexpr static const uint32_t STRATEGY_RECURSIVE = 3;

constexpr static const uint32_t STRATEGY_SENTENCE = 4;

struct TextModelResult {
void *m_pModel;
char *m_szError;
Expand All @@ -33,11 +44,25 @@ struct FloatVec {
uintptr_t cap;
};

/// Embedding result for one `make_vect_embeddings` call.
///
/// `m_tEmbedding` is a FLAT array of `len` vectors — every input document's
/// vectors concatenated. `m_pRowOffsets` (length `rows + 1`) groups them per
/// input document, Arrow-style: document `i` owns
/// `m_tEmbedding[m_pRowOffsets[i] .. m_pRowOffsets[i + 1]]`. For the v1
/// strategies (truncate / mean) every document yields exactly one vector, so
/// `len == rows` and the offsets are `[0, 1, ..., rows]`. The sidecar lets a
/// future multi-vector strategy return N vectors per document through this same
/// struct — no second method, cardinality carried as data.
///
struct FloatVecResult {
char *m_szError;
const FloatVec *m_tEmbedding;
uintptr_t len;
uintptr_t cap;
const uintptr_t *m_pRowOffsets;
uintptr_t rows;
uintptr_t offsets_cap;
};

using TextModelWrapper = void*;
Expand All @@ -47,9 +72,25 @@ struct StringItem {
uintptr_t len;
};

/// Chunking parameters. `#[repr(C)]` — passed straight across the FFI by the
/// daemon, which owns the DDL surface and validates against the model.
struct ChunkSettings {
/// One of the `STRATEGY_*` constants.
uint32_t strategy;
/// Target chunk size in tokens. `0` ⇒ use the model's max. Always clamped to
/// the model's real input limit.
uint32_t max_tokens;
/// Token overlap between consecutive chunks. `0` ⇒ none.
uint32_t overlap_tokens;
/// Hard cap on chunks per document. `0` ⇒ unlimited. Overflow merges the
/// tail into the last chunk (matches OpenSearch's `max_chunk_limit`).
uint32_t max_chunks;
};

using MakeVectEmbeddingsFn = FloatVecResult(*)(const TextModelWrapper*,
const StringItem*,
uintptr_t,
const ChunkSettings*,
int32_t);

using FreeVecResultFn = void(*)(FloatVecResult);
Expand Down
Loading