feat(rerank): add per-model token limits for cross-encoder rerank by vlsi · Pull Request #666 · michaelfeil/infinity

vlsi · 2026-06-17T20:48:54Z

Why

A cross-encoder scores one <s>query</s></s>document</s> sequence per candidate, so an oversized query or document inflates every pair and can exhaust memory or stall the backend when a client sends no limits. The /rerank API had no way to bound the scored sequence length, and a single global default cannot fit every reranker.

What

Add three token budgets that bound the scored sequence per pair:

max_query_tokens: head-truncate the query.
max_tokens_per_doc: head-truncate each document (Cohere v2 compatible).
max_pair_tokens: cap the joined pair, trimming the longest side first.

Configured per model at startup, since a sensible ceiling depends on the model:

EngineArgs.max_query_tokens / max_tokens_per_doc / max_pair_tokens
env vars INFINITY_MAX_QUERY_TOKENS / INFINITY_MAX_TOKENS_PER_DOC / INFINITY_MAX_PAIR_TOKENS (per model, ;-separated)
v2 CLI flags --max-query-tokens / --max-tokens-per-doc / --max-pair-tokens

Each defaults to None (no limit). The /rerank request still accepts the same three fields (defaulting to null), but the engine clamps each requested budget to the model's startup ceiling: a client may lower a limit to trade quality for speed, but cannot raise it above the configured maximum. The startup config owns stability; the client owns precision.

The values thread RerankInput → engine.rerank → AsyncEngineArray → batch_handler.rerank, where they are bundled into a RerankLimits tuple on each ReRankSingle and reach both cross-encoder encode_pre paths (torch and optimum) via to_input(). Truncation runs in the single preprocessing thread using the model's own tokenizer (not the token-counting copy used on another thread) to avoid a data race; per-item pair tokenisation plus tokenizer.pad keeps each item's limits correct inside a mixed batch. Two-element (query, document) tuples still work and apply no truncation.

The checked-in OpenAPI spec (docs/assets/openapi.json) is regenerated so the Swagger UI lists the three fields with their null defaults.

How to verify

cd libs/infinity_emb && pytest tests/unit_test/test_args.py -k "clamp or reject" (ceiling/clamp logic, no model download)
pytest tests/unit_test/transformer/crossencoder/test_torch_crossencoder.py::test_crossencoder_rerank_limits
Start with a ceiling, e.g. infinity_emb v2 --model-id <reranker> --max-pair-tokens 512, then send a /rerank request with max_pair_tokens: 4096 and confirm it is clamped to 512.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces configurable token budgets and limits for reranking (via RerankLimits) across the engine, batch handler, and cross-encoder backends to prevent performance issues on oversized inputs. The feedback highlights three key robustness improvements: optimizing the tokenizer in truncate_texts_to_tokens with a maximum cap to prevent potential OOMs, adding a fallback for max_length in the PyTorch cross-encoder to avoid ValueError exceptions, and using getattr with a fallback for max_position_embeddings in the Optimum cross-encoder to prevent potential AttributeError exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf8a67b0fa

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

greptile-apps · 2026-06-17T20:56:46Z

Greptile Summary

Adds per-model token-budget ceilings for the cross-encoder reranker. Three startup settings (max_query_tokens, max_tokens_per_doc, max_pair_tokens) configure a server-side maximum; the same three fields on the /rerank request body let a client lower but never exceed those ceilings. Truncation happens inside encode_pre using a dedicated helper (truncate_texts_to_tokens) that pre-truncates queries and documents independently before pair encoding.

RerankLimits NamedTuple threads the resolved budgets from the HTTP layer all the way to the tokenizer, and its values are included in str_repr so the disk cache distinguishes capped from uncapped results for the same pair.
Both the torch and ONNX/optimum encode_pre paths were rewritten to accept 3-tuples (query, doc, RerankLimits) while retaining 2-tuple backward compatibility; per-item pair tokenisation followed by tokenizer.pad keeps the limits correct inside a mixed batch.
New env vars (INFINITY_MAX_QUERY_TOKENS / INFINITY_MAX_TOKENS_PER_DOC / INFINITY_MAX_PAIR_TOKENS) and v2 CLI flags expose the new options.

Confidence Score: 5/5

Safe to merge; the clamping logic, truncation helper, and cache key changes are all correct for the single-model and default-limits cases that make up the common deployment.

The core token-budget mechanism — ceiling clamping, truncate_texts_to_tokens, per-item pair encoding, RerankLimits threading — is implemented correctly. Both encoder backends guard max_pair_tokens with min(limit, model_max). The two findings are edge cases: the CLI type annotation mismatch only matters when a user needs to express a null limit for one specific model in a multi-model CLI invocation (env vars remain the correct path for that), and the cache key ambiguity requires a deliberately crafted query string to trigger.

cli.py (type annotation for the three new list options) and primitives.py (str_repr suffix format) are worth a second look before a 1.0 release, but neither blocks the feature from working correctly in normal deployments.

Important Files Changed

Filename	Overview
libs/infinity_emb/infinity_emb/cli.py	New CLI options declared as `list[int]` but default value from the env manager is `[None]`, creating a type mismatch; CLI users also cannot express "no limit" for a specific model in a multi-model flag sequence.
libs/infinity_emb/infinity_emb/primitives.py	Adds `RerankLimits` NamedTuple and threads it through `ReRankSingle`; `str_repr` correctly varies the cache key when limits are non-default, but the key format (plain concatenation) inherits the existing ambiguity of query+document concatenation.
libs/infinity_emb/infinity_emb/transformer/crossencoder/init.py	New `truncate_texts_to_tokens` helper; batch-level tokenisation bounded by the largest cap keeps the OOM concern in check, but the decode/re-encode round-trip means texts at exactly the cap boundary may differ slightly from their originals.
libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py	Rewrites `encode_pre` to handle 3-tuples; adds per-item pair tokenisation + `tokenizer.pad`; correctly caps pair length to `min(limit, model_max)` and falls back gracefully for legacy 2-tuples.
libs/infinity_emb/infinity_emb/transformer/crossencoder/optimum.py	Mirrors the torch path for the ONNX backend; per-item encoding + `tokenizer.pad` keeps dtype and shape consistent; `model_max` fallback chain is identical to torch.
libs/infinity_emb/infinity_emb/engine.py	Adds `_clamp_to_ceiling` helper with correct min-of-present-values semantics; ceiling clamping applied consistently in `AsyncEmbeddingEngine.rerank` and forwarded through `AsyncEngineArray`.
libs/infinity_emb/infinity_emb/args.py	Adds three new `Optional[int]` fields with `__post_init__` validation; correctly threads them through the `zip_longest` multi-model expansion in `EngineArgs.create`.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant Server as infinity_server (FastAPI)
    participant Engine as AsyncEmbeddingEngine
    participant Array as AsyncEngineArray
    participant BH as BatchHandler
    participant Pre as encode_pre (preprocessing thread)

    Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
    Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
    Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
    Engine->>Array: rerank(...clamped limits...)
    Array->>BH: rerank(...clamped limits...)
    Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
    BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
    Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
    Pre-->>BH: padded batch tensors
    BH-->>Engine: scores, usage
    Engine-->>Client: RerankResponse

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant Server as infinity_server (FastAPI)
    participant Engine as AsyncEmbeddingEngine
    participant Array as AsyncEngineArray
    participant BH as BatchHandler
    participant Pre as encode_pre (preprocessing thread)

    Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
    Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
    Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
    Engine->>Array: rerank(...clamped limits...)
    Array->>BH: rerank(...clamped limits...)
    Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
    BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
    Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
    Pre-->>BH: padded batch tensors
    BH-->>Engine: scores, usage
    Engine-->>Client: RerankResponse

_{Reviews (3): Last reviewed commit: "feat(rerank): add per-model token limits..." | Re-trigger Greptile}

A cross-encoder scores one `<s>query</s></s>document</s>` sequence per candidate, so an oversized query or document inflates every pair and can exhaust memory or stall the backend when a client sends no limits. Add three token budgets that bound the scored sequence length per pair: - max_query_tokens: head-truncate the query. - max_tokens_per_doc: head-truncate each document (Cohere v2 compatible). - max_pair_tokens: cap the joined pair, trimming the longest side first. A sensible ceiling depends on the model, so the limits are configured per model at startup -- via EngineArgs, the INFINITY_MAX_* env vars (per model, `;`-separated), and the v2 CLI flags -- rather than as one global default. Each defaults to None (no limit). The /rerank request still accepts the same three fields (Cohere v2 keeps max_tokens_per_doc), defaulting to null; the engine clamps each requested budget to the model's startup ceiling, so a client may lower a limit to trade quality for speed but cannot raise it above the configured maximum. The values thread RerankInput -> engine.rerank -> AsyncEngineArray -> batch_handler.rerank, where they are bundled into a RerankLimits tuple on each ReRankSingle and reach both cross-encoder encode_pre paths (torch and optimum) via to_input(). Both paths cap the joined pair at the model's positional limit (min(max_pair_tokens, max_position_embeddings or model_max_length)) so a budget above the model's positions never builds a sequence it cannot process. Per-axis truncation tokenises only up to the largest cap + 1 to bound work on oversized inputs. Truncation runs in the single preprocessing thread using the model's own tokenizer (not the token-counting copy used on another thread) to avoid a data race; per-item pair tokenisation plus tokenizer.pad keeps each item's limits correct inside a mixed batch. Two-element (query, document) tuples still work and apply no truncation. RerankLimits are part of ReRankSingle.str_repr so vector_disk_cache keys by them and does not serve a capped result for a later uncapped request. The checked-in OpenAPI spec is regenerated to list the three fields. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vlsi · 2026-06-17T21:27:39Z

Thanks for the review. Two notes first: the limits are now configured per model at startup (EngineArgs / INFINITY_MAX_* / v2 flags) and each defaults to None rather than 256/512/768 — a request can only lower a limit, never raise it above the model's configured ceiling. The findings still applied to the threading code, so I addressed them:

Clamp the torch pair length to the model max (P1, greptile + codex + gemini). Fixed. torch.encode_pre now uses min(max_pair_tokens, max_position_embeddings or model_max_length), matching the optimum path; an over-large or unset budget can no longer build a sequence the model cannot process in encode_core. Added a regression assertion.
max_position_embeddings via getattr (gemini). Fixed in both paths: getattr(config, "max_position_embeddings", None) or tokenizer.model_max_length.
truncation=False can OOM on huge inputs (gemini, greptile). Fixed. truncate_texts_to_tokens now tokenises with truncation=True, max_length=max(caps)+1, which bounds the work while still detecting overflow.
Cache keys must include the limits (codex P2). Fixed. ReRankSingle.str_repr now includes the limits when set, so vector_disk_cache no longer serves a capped result for a later uncapped request; the no-limit path is byte-for-byte unchanged.

Force-pushed as a single squashed commit.

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/__init__.py Outdated

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/optimum.py Outdated

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py Outdated

Comment thread libs/infinity_emb/infinity_emb/primitives.py Outdated

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/torch.py

Comment thread libs/infinity_emb/infinity_emb/transformer/crossencoder/__init__.py

vlsi force-pushed the feat/rerank-token-limits branch from cf8a67b to b4a1209 Compare June 17, 2026 21:17

vlsi changed the title ~~feat(rerank): add per-pair token limits for cross-encoder rerank~~ feat(rerank): add per-model token limits for cross-encoder rerank Jun 17, 2026

vlsi force-pushed the feat/rerank-token-limits branch from b4a1209 to 9bec690 Compare June 17, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rerank): add per-model token limits for cross-encoder rerank#666

feat(rerank): add per-model token limits for cross-encoder rerank#666
vlsi wants to merge 1 commit into
michaelfeil:mainfrom
vlsi:feat/rerank-token-limits

vlsi commented Jun 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

vlsi commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vlsi commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

How to verify

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

vlsi commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vlsi commented Jun 17, 2026 •

edited

Loading

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading