feat(rerank): add per-model token limits for cross-encoder rerank#666
feat(rerank): add per-model token limits for cross-encoder rerank#666vlsi wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces configurable token budgets and limits for reranking (via RerankLimits) across the engine, batch handler, and cross-encoder backends to prevent performance issues on oversized inputs. The feedback highlights three key robustness improvements: optimizing the tokenizer in truncate_texts_to_tokens with a maximum cap to prevent potential OOMs, adding a fallback for max_length in the PyTorch cross-encoder to avoid ValueError exceptions, and using getattr with a fallback for max_position_embeddings in the Optimum cross-encoder to prevent potential AttributeError exceptions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cf8a67b0fa
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryAdds per-model token-budget ceilings for the cross-encoder reranker. Three startup settings (
Confidence Score: 5/5Safe to merge; the clamping logic, truncation helper, and cache key changes are all correct for the single-model and default-limits cases that make up the common deployment. The core token-budget mechanism — ceiling clamping, truncate_texts_to_tokens, per-item pair encoding, RerankLimits threading — is implemented correctly. Both encoder backends guard max_pair_tokens with min(limit, model_max). The two findings are edge cases: the CLI type annotation mismatch only matters when a user needs to express a null limit for one specific model in a multi-model CLI invocation (env vars remain the correct path for that), and the cache key ambiguity requires a deliberately crafted query string to trigger. cli.py (type annotation for the three new list options) and primitives.py (str_repr suffix format) are worth a second look before a 1.0 release, but neither blocks the feature from working correctly in normal deployments. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Client
participant Server as infinity_server (FastAPI)
participant Engine as AsyncEmbeddingEngine
participant Array as AsyncEngineArray
participant BH as BatchHandler
participant Pre as encode_pre (preprocessing thread)
Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
Engine->>Array: rerank(...clamped limits...)
Array->>BH: rerank(...clamped limits...)
Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
Pre-->>BH: padded batch tensors
BH-->>Engine: scores, usage
Engine-->>Client: RerankResponse
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Client
participant Server as infinity_server (FastAPI)
participant Engine as AsyncEmbeddingEngine
participant Array as AsyncEngineArray
participant BH as BatchHandler
participant Pre as encode_pre (preprocessing thread)
Client->>Server: "POST /rerank {query, docs, max_query_tokens?, max_tokens_per_doc?, max_pair_tokens?}"
Server->>Engine: rerank(query, docs, max_query_tokens, max_tokens_per_doc, max_pair_tokens)
Note over Engine: _clamp_to_ceiling(requested, EngineArgs.ceiling) for each limit
Engine->>Array: rerank(...clamped limits...)
Array->>BH: rerank(...clamped limits...)
Note over BH: Build RerankLimits(clamped values), Create ReRankSingle per doc
BH->>Pre: encode_pre(list[tuple[str,str,RerankLimits]])
Note over Pre: truncate_texts_to_tokens for queries and docs independently, then per-item pair tokenization + tokenizer.pad
Pre-->>BH: padded batch tensors
BH-->>Engine: scores, usage
Engine-->>Client: RerankResponse
Reviews (3): Last reviewed commit: "feat(rerank): add per-model token limits..." | Re-trigger Greptile |
cf8a67b to
b4a1209
Compare
A cross-encoder scores one `<s>query</s></s>document</s>` sequence per candidate, so an oversized query or document inflates every pair and can exhaust memory or stall the backend when a client sends no limits. Add three token budgets that bound the scored sequence length per pair: - max_query_tokens: head-truncate the query. - max_tokens_per_doc: head-truncate each document (Cohere v2 compatible). - max_pair_tokens: cap the joined pair, trimming the longest side first. A sensible ceiling depends on the model, so the limits are configured per model at startup -- via EngineArgs, the INFINITY_MAX_* env vars (per model, `;`-separated), and the v2 CLI flags -- rather than as one global default. Each defaults to None (no limit). The /rerank request still accepts the same three fields (Cohere v2 keeps max_tokens_per_doc), defaulting to null; the engine clamps each requested budget to the model's startup ceiling, so a client may lower a limit to trade quality for speed but cannot raise it above the configured maximum. The values thread RerankInput -> engine.rerank -> AsyncEngineArray -> batch_handler.rerank, where they are bundled into a RerankLimits tuple on each ReRankSingle and reach both cross-encoder encode_pre paths (torch and optimum) via to_input(). Both paths cap the joined pair at the model's positional limit (min(max_pair_tokens, max_position_embeddings or model_max_length)) so a budget above the model's positions never builds a sequence it cannot process. Per-axis truncation tokenises only up to the largest cap + 1 to bound work on oversized inputs. Truncation runs in the single preprocessing thread using the model's own tokenizer (not the token-counting copy used on another thread) to avoid a data race; per-item pair tokenisation plus tokenizer.pad keeps each item's limits correct inside a mixed batch. Two-element (query, document) tuples still work and apply no truncation. RerankLimits are part of ReRankSingle.str_repr so vector_disk_cache keys by them and does not serve a capped result for a later uncapped request. The checked-in OpenAPI spec is regenerated to list the three fields. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b4a1209 to
9bec690
Compare
|
Thanks for the review. Two notes first: the limits are now configured per model at startup (
Force-pushed as a single squashed commit. |
Why
A cross-encoder scores one
<s>query</s></s>document</s>sequence per candidate, so an oversized query or document inflates every pair and can exhaust memory or stall the backend when a client sends no limits. The/rerankAPI had no way to bound the scored sequence length, and a single global default cannot fit every reranker.What
Add three token budgets that bound the scored sequence per pair:
max_query_tokens: head-truncate the query.max_tokens_per_doc: head-truncate each document (Cohere v2 compatible).max_pair_tokens: cap the joined pair, trimming the longest side first.Configured per model at startup, since a sensible ceiling depends on the model:
EngineArgs.max_query_tokens/max_tokens_per_doc/max_pair_tokensINFINITY_MAX_QUERY_TOKENS/INFINITY_MAX_TOKENS_PER_DOC/INFINITY_MAX_PAIR_TOKENS(per model,;-separated)v2CLI flags--max-query-tokens/--max-tokens-per-doc/--max-pair-tokensEach defaults to
None(no limit). The/rerankrequest still accepts the same three fields (defaulting tonull), but the engine clamps each requested budget to the model's startup ceiling: a client may lower a limit to trade quality for speed, but cannot raise it above the configured maximum. The startup config owns stability; the client owns precision.The values thread
RerankInput→engine.rerank→AsyncEngineArray→batch_handler.rerank, where they are bundled into aRerankLimitstuple on eachReRankSingleand reach both cross-encoderencode_prepaths (torch and optimum) viato_input(). Truncation runs in the single preprocessing thread using the model's own tokenizer (not the token-counting copy used on another thread) to avoid a data race; per-item pair tokenisation plustokenizer.padkeeps each item's limits correct inside a mixed batch. Two-element(query, document)tuples still work and apply no truncation.The checked-in OpenAPI spec (
docs/assets/openapi.json) is regenerated so the Swagger UI lists the three fields with theirnulldefaults.How to verify
cd libs/infinity_emb && pytest tests/unit_test/test_args.py -k "clamp or reject"(ceiling/clamp logic, no model download)pytest tests/unit_test/transformer/crossencoder/test_torch_crossencoder.py::test_crossencoder_rerank_limitsinfinity_emb v2 --model-id <reranker> --max-pair-tokens 512, then send a/rerankrequest withmax_pair_tokens: 4096and confirm it is clamped to 512.🤖 Generated with Claude Code