Skip to content

feat(rerank): add per-model token limits for cross-encoder rerank#666

Open
vlsi wants to merge 1 commit into
michaelfeil:mainfrom
vlsi:feat/rerank-token-limits
Open

feat(rerank): add per-model token limits for cross-encoder rerank#666
vlsi wants to merge 1 commit into
michaelfeil:mainfrom
vlsi:feat/rerank-token-limits

Conversation

@vlsi

@vlsi vlsi commented Jun 17, 2026

Copy link
Copy Markdown

Why

A cross-encoder scores one <s>query</s></s>document</s> sequence per candidate, so an oversized query or document inflates every pair and can exhaust memory or stall the backend when a client sends no limits. The /rerank API had no way to bound the scored sequence length, and a single global default cannot fit every reranker.

What

Add three token budgets that bound the scored sequence per pair:

  • max_query_tokens: head-truncate the query.
  • max_tokens_per_doc: head-truncate each document (Cohere v2 compatible).
  • max_pair_tokens: cap the joined pair, trimming the longest side first.

Configured per model at startup, since a sensible ceiling depends on the model:

  • EngineArgs.max_query_tokens / max_tokens_per_doc / max_pair_tokens
  • env vars INFINITY_MAX_QUERY_TOKENS / INFINITY_MAX_TOKENS_PER_DOC / INFINITY_MAX_PAIR_TOKENS (per model, ;-separated)
  • v2 CLI flags --max-query-tokens / --max-tokens-per-doc / --max-pair-tokens

Each defaults to None (no limit). The /rerank request still accepts the same three fields (defaulting to null), but the engine clamps each requested budget to the model's startup ceiling: a client may lower a limit to trade quality for speed, but cannot raise it above the configured maximum. The startup config owns stability; the client owns precision.

The values thread RerankInputengine.rerankAsyncEngineArraybatch_handler.rerank, where they are bundled into a RerankLimits tuple on each ReRankSingle and reach both cross-encoder encode_pre paths (torch and optimum) via to_input(). Truncation runs in the single preprocessing thread using the model's own tokenizer (not the token-counting copy used on another thread) to avoid a data race; per-item pair tokenisation plus tokenizer.pad keeps each item's limits correct inside a mixed batch. Two-element (query, document) tuples still work and apply no truncation.

The checked-in OpenAPI spec (docs/assets/openapi.json) is regenerated so the Swagger UI lists the three fields with their null defaults.

How to verify

  • cd libs/infinity_emb && pytest tests/unit_test/test_args.py -k "clamp or reject" (ceiling/clamp logic, no model download)
  • pytest tests/unit_test/transformer/crossencoder/test_torch_crossencoder.py::test_crossencoder_rerank_limits
  • Start with a ceiling, e.g. infinity_emb v2 --model-id <reranker> --max-pair-tokens 512, then send a /rerank request with max_pair_tokens: 4096 and confirm it is clamped to 512.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces configurable token budgets and limits for reranking (via RerankLimits) across the engine, batch handler, and cross-encoder backends to prevent performance issues on oversized inputs. The feedback highlights three key robustness improvements: optimizing the tokenizer in truncate_texts_to_tokens with a maximum cap to prevent potential OOMs, adding a fallback for max_length in the PyTorch cross-encoder to avoid ValueError exceptions, and using getattr with a fallback for max_position_embeddings in the Optimum cross-encoder to prevent potential AttributeError exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/__init__.py Outdated
Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py
Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/optimum.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf8a67b0fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py Outdated
Comment thread libs/infinity_emb/infinity_emb/primitives.py Outdated
@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds per-model token-budget ceilings for the cross-encoder reranker. Three startup settings (max_query_tokens, max_tokens_per_doc, max_pair_tokens) configure a server-side maximum; the same three fields on the /rerank request body let a client lower but never exceed those ceilings. Truncation happens inside encode_pre using a dedicated helper (truncate_texts_to_tokens) that pre-truncates queries and documents independently before pair encoding.

  • RerankLimits NamedTuple threads the resolved budgets from the HTTP layer all the way to the tokenizer, and its values are included in str_repr so the disk cache distinguishes capped from uncapped results for the same pair.
  • Both the torch and ONNX/optimum encode_pre paths were rewritten to accept 3-tuples (query, doc, RerankLimits) while retaining 2-tuple backward compatibility; per-item pair tokenisation followed by tokenizer.pad keeps the limits correct inside a mixed batch.
  • New env vars (INFINITY_MAX_QUERY_TOKENS / INFINITY_MAX_TOKENS_PER_DOC / INFINITY_MAX_PAIR_TOKENS) and v2 CLI flags expose the new options.

Confidence Score: 5/5

Safe to merge; the clamping logic, truncation helper, and cache key changes are all correct for the single-model and default-limits cases that make up the common deployment.

The core token-budget mechanism — ceiling clamping, truncate_texts_to_tokens, per-item pair encoding, RerankLimits threading — is implemented correctly. Both encoder backends guard max_pair_tokens with min(limit, model_max). The two findings are edge cases: the CLI type annotation mismatch only matters when a user needs to express a null limit for one specific model in a multi-model CLI invocation (env vars remain the correct path for that), and the cache key ambiguity requires a deliberately crafted query string to trigger.

cli.py (type annotation for the three new list options) and primitives.py (str_repr suffix format) are worth a second look before a 1.0 release, but neither blocks the feature from working correctly in normal deployments.

Important Files Changed

Filename Overview
libs/infinity_emb/infinity_emb/cli.py New CLI options declared as list[int] but default value from the env manager is [None], creating a type mismatch; CLI users also cannot express "no limit" for a specific model in a multi-model flag sequence.
libs/infinity_emb/infinity_emb/primitives.py Adds RerankLimits NamedTuple and threads it through ReRankSingle; str_repr correctly varies the cache key when limits are non-default, but the key format (plain concatenation) inherits the existing ambiguity of query+document concatenation.
libs/infinity_emb/infinity_emb/transformer/crossencoder/init.py New truncate_texts_to_tokens helper; batch-level tokenisation bounded by the largest cap keeps the OOM concern in check, but the decode/re-encode round-trip means texts at exactly the cap boundary may differ slightly from their originals.
libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py Rewrites encode_pre to handle 3-tuples; adds per-item pair tokenisation + tokenizer.pad; correctly caps pair length to min(limit, model_max) and falls back gracefully for legacy 2-tuples.
libs/infinity_emb/infinity_emb/transformer/crossencoder/optimum.py Mirrors the torch path for the ONNX backend; per-item encoding + tokenizer.pad keeps dtype and shape consistent; model_max fallback chain is identical to torch.
libs/infinity_emb/infinity_emb/engine.py Adds _clamp_to_ceiling helper with correct min-of-present-values semantics; ceiling clamping applied consistently in AsyncEmbeddingEngine.rerank and forwarded through AsyncEngineArray.
libs/infinity_emb/infinity_emb/args.py Adds three new Optional[int] fields with __post_init__ validation; correctly threads them through the zip_longest multi-model expansion in EngineArgs.create.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant Server as infinity_server (FastAPI)
    participant Engine as AsyncEmbeddingEngine
    participant Array as AsyncEngineArray
    participant BH as BatchHandler
    participant Pre as encode_pre (preprocessing thread)

    Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
    Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
    Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
    Engine->>Array: rerank(...clamped limits...)
    Array->>BH: rerank(...clamped limits...)
    Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
    BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
    Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
    Pre-->>BH: padded batch tensors
    BH-->>Engine: scores, usage
    Engine-->>Client: RerankResponse
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant Server as infinity_server (FastAPI)
    participant Engine as AsyncEmbeddingEngine
    participant Array as AsyncEngineArray
    participant BH as BatchHandler
    participant Pre as encode_pre (preprocessing thread)

    Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
    Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
    Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
    Engine->>Array: rerank(...clamped limits...)
    Array->>BH: rerank(...clamped limits...)
    Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
    BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
    Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
    Pre-->>BH: padded batch tensors
    BH-->>Engine: scores, usage
    Engine-->>Client: RerankResponse
Loading

Reviews (3): Last reviewed commit: "feat(rerank): add per-model token limits..." | Re-trigger Greptile

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py
Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/__init__.py
@vlsi vlsi force-pushed the feat/rerank-token-limits branch from cf8a67b to b4a1209 Compare June 17, 2026 21:17
@vlsi vlsi changed the title feat(rerank): add per-pair token limits for cross-encoder rerank feat(rerank): add per-model token limits for cross-encoder rerank Jun 17, 2026
A cross-encoder scores one `<s>query</s></s>document</s>` sequence per
candidate, so an oversized query or document inflates every pair and can
exhaust memory or stall the backend when a client sends no limits. Add
three token budgets that bound the scored sequence length per pair:

- max_query_tokens: head-truncate the query.
- max_tokens_per_doc: head-truncate each document (Cohere v2 compatible).
- max_pair_tokens: cap the joined pair, trimming the longest side first.

A sensible ceiling depends on the model, so the limits are configured per
model at startup -- via EngineArgs, the INFINITY_MAX_* env vars (per model,
`;`-separated), and the v2 CLI flags -- rather than as one global default.
Each defaults to None (no limit). The /rerank request still accepts the
same three fields (Cohere v2 keeps max_tokens_per_doc), defaulting to null;
the engine clamps each requested budget to the model's startup ceiling, so
a client may lower a limit to trade quality for speed but cannot raise it
above the configured maximum.

The values thread RerankInput -> engine.rerank -> AsyncEngineArray ->
batch_handler.rerank, where they are bundled into a RerankLimits tuple on
each ReRankSingle and reach both cross-encoder encode_pre paths (torch and
optimum) via to_input(). Both paths cap the joined pair at the model's
positional limit (min(max_pair_tokens, max_position_embeddings or
model_max_length)) so a budget above the model's positions never builds a
sequence it cannot process. Per-axis truncation tokenises only up to the
largest cap + 1 to bound work on oversized inputs. Truncation runs in the
single preprocessing thread using the model's own tokenizer (not the
token-counting copy used on another thread) to avoid a data race; per-item
pair tokenisation plus tokenizer.pad keeps each item's limits correct
inside a mixed batch. Two-element (query, document) tuples still work and
apply no truncation.

RerankLimits are part of ReRankSingle.str_repr so vector_disk_cache keys by
them and does not serve a capped result for a later uncapped request. The
checked-in OpenAPI spec is regenerated to list the three fields.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vlsi vlsi force-pushed the feat/rerank-token-limits branch from b4a1209 to 9bec690 Compare June 17, 2026 21:27
@vlsi

vlsi commented Jun 17, 2026

Copy link
Copy Markdown
Author

Thanks for the review. Two notes first: the limits are now configured per model at startup (EngineArgs / INFINITY_MAX_* / v2 flags) and each defaults to None rather than 256/512/768 — a request can only lower a limit, never raise it above the model's configured ceiling. The findings still applied to the threading code, so I addressed them:

  • Clamp the torch pair length to the model max (P1, greptile + codex + gemini). Fixed. torch.encode_pre now uses min(max_pair_tokens, max_position_embeddings or model_max_length), matching the optimum path; an over-large or unset budget can no longer build a sequence the model cannot process in encode_core. Added a regression assertion.
  • max_position_embeddings via getattr (gemini). Fixed in both paths: getattr(config, "max_position_embeddings", None) or tokenizer.model_max_length.
  • truncation=False can OOM on huge inputs (gemini, greptile). Fixed. truncate_texts_to_tokens now tokenises with truncation=True, max_length=max(caps)+1, which bounds the work while still detecting overflow.
  • Cache keys must include the limits (codex P2). Fixed. ReRankSingle.str_repr now includes the limits when set, so vector_disk_cache no longer serves a capped result for a later uncapped request; the no-limit path is byte-for-byte unchanged.

Force-pushed as a single squashed commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant