Improve Evaluation Reliability: Retry Logic, Voting Mechanism, and Token Optimization

## Background

During the evaluation of 23 judge models (PR #8), several reliability issues were identified that warrant future investigation.

---

## 1. High Rate of Invalid Responses

Analysis of None/invalid judgments revealed the following breakdown:

| Failure Type | Proportion | Cause |
|--------------|------------|-------|
| Empty Response | ~80% | API failures, rate limiting, network errors |
| Truncated Response | ~15% | Token limits, timeouts |
| Format Non-compliance | ~5% | Model not following `[[A>B]]` output format |

**Example**: `minimax-m2` had only 50% valid responses, with 260/311 failures being completely empty responses.

---

## 2. Proposed Solutions

### 2.1 Retry Logic Enhancement
- **Differentiate failure types**: Empty responses (API-level failures) should trigger immediate retry, while truncated responses may need increased `max_tokens`.
- **Exponential backoff**: Implement smarter retry strategies for rate-limited requests.
- **Consider local inference**: Integrate `ktransformers` or `transformers` for models that can run locally, eliminating API reliability issues.

### 2.2 Voting / Ensemble Mechanism
- **Multiple submissions**: Allow each judgment to be submitted N times and use majority voting.
- **Confidence scoring**: Weight judgments by model confidence or response completeness.
- **Cross-validation**: Compare results across multiple judge models for consensus.

### 2.3 Token Limit Optimization
- **Set appropriate `max_tokens`**: Current responses sometimes include excessive reasoning. Setting a reasonable cap (e.g., 2048-4096 tokens) could:
  - Reduce latency and API costs
  - Force models to be more concise
  - Prevent truncation by ensuring completion within limits
- **Structured output**: Consider using JSON mode or structured outputs to ensure the `[[A>B]]` verdict is always included.

---

## 3. Recommendations

1. **Implement retry with failure classification**: Retry empty responses immediately; increase token limit for truncated responses.
2. **Add `max_tokens` configuration**: Allow users to specify output token limits per model.
3. **Explore local inference options**: For reproducibility and cost savings, support local model execution via `transformers` or `ktransformers`.
4. **Voting mechanism**: Implement optional N-way voting for higher-stakes evaluations.

---

> ⚠️ These improvements would significantly increase the reliability of models with currently low Valid/Total ratios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Evaluation Reliability: Retry Logic, Voting Mechanism, and Token Optimization #9

Background

1. High Rate of Invalid Responses

2. Proposed Solutions

2.1 Retry Logic Enhancement

2.2 Voting / Ensemble Mechanism

2.3 Token Limit Optimization

3. Recommendations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Failure Type	Proportion	Cause
Empty Response	~80%	API failures, rate limiting, network errors
Truncated Response	~15%	Token limits, timeouts
Format Non-compliance	~5%	Model not following `[[A>B]]` output format

Improve Evaluation Reliability: Retry Logic, Voting Mechanism, and Token Optimization #9

Description

Background

1. High Rate of Invalid Responses

2. Proposed Solutions

2.1 Retry Logic Enhancement

2.2 Voting / Ensemble Mechanism

2.3 Token Limit Optimization

3. Recommendations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions