Skip to content

Improve Evaluation Reliability: Retry Logic, Voting Mechanism, and Token Optimization #9

@devcomfort

Description

@devcomfort

Background

During the evaluation of 23 judge models (PR #8), several reliability issues were identified that warrant future investigation.


1. High Rate of Invalid Responses

Analysis of None/invalid judgments revealed the following breakdown:

Failure Type Proportion Cause
Empty Response ~80% API failures, rate limiting, network errors
Truncated Response ~15% Token limits, timeouts
Format Non-compliance ~5% Model not following [[A>B]] output format

Example: minimax-m2 had only 50% valid responses, with 260/311 failures being completely empty responses.


2. Proposed Solutions

2.1 Retry Logic Enhancement

  • Differentiate failure types: Empty responses (API-level failures) should trigger immediate retry, while truncated responses may need increased max_tokens.
  • Exponential backoff: Implement smarter retry strategies for rate-limited requests.
  • Consider local inference: Integrate ktransformers or transformers for models that can run locally, eliminating API reliability issues.

2.2 Voting / Ensemble Mechanism

  • Multiple submissions: Allow each judgment to be submitted N times and use majority voting.
  • Confidence scoring: Weight judgments by model confidence or response completeness.
  • Cross-validation: Compare results across multiple judge models for consensus.

2.3 Token Limit Optimization

  • Set appropriate max_tokens: Current responses sometimes include excessive reasoning. Setting a reasonable cap (e.g., 2048-4096 tokens) could:
    • Reduce latency and API costs
    • Force models to be more concise
    • Prevent truncation by ensuring completion within limits
  • Structured output: Consider using JSON mode or structured outputs to ensure the [[A>B]] verdict is always included.

3. Recommendations

  1. Implement retry with failure classification: Retry empty responses immediately; increase token limit for truncated responses.
  2. Add max_tokens configuration: Allow users to specify output token limits per model.
  3. Explore local inference options: For reproducibility and cost savings, support local model execution via transformers or ktransformers.
  4. Voting mechanism: Implement optional N-way voting for higher-stakes evaluations.

⚠️ These improvements would significantly increase the reliability of models with currently low Valid/Total ratios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions