TraceLens is a friendly evaluation and regression-testing framework for AI agents. It turns agent runs into inspectable traces, graded outcomes, baseline comparisons, and CI-ready reliability signals.
迹镜是一个面向 AI Agent 的评测与回归检测框架。它把每次 agent run 转化成可观察的轨迹、可评分的结果、可比较的 baseline,以及可用于 CI 的可靠性信号。
Agents are non-deterministic. Unit tests are not enough. TraceLens helps teams capture agent traces, grade outcomes, compare against baselines, and block regressions in CI.
Use it when you need to answer questions like:
- Did this agent produce the right outcome, not just run without crashing?
- Is a flaky success still a real capability after 3-5 attempts?
- Did a prompt, model, tool, or infra change regress a baseline?
- Can CI block unsafe or lower-quality agent behavior before it ships?
PyPI is live for normal use:
# Recommended: uv
uv pip install tracelens
# Or: plain pip
pip install tracelensFor the repository examples and local development tools:
git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"python examples/hello_world.py
tracelens report --results examples/reports/hello_world_report.json --format markdownExpected first output:
tracelens hello-world
--------------------
trials run : 9
pass rate : 100%
report json: examples/reports/hello_world_report.json
sample md : examples/reports/hello_world_report.md
The checked-in sample report is here:
examples/reports/hello_world_report.md.
It shows the concrete pieces a real eval needs: tasks, trials, pass@k,
pass^k, graders, baseline comparison, regression result, and CI summary.
TraceLens provides a unified evaluation methodology for AI agent projects. It supports both subjective evaluations (LLM-as-judge for quality assessment) and objective evaluations (deterministic metrics like schema validity, tool-use constraints, latency, budget, or domain-specific scores).
src/tracelens/
├── core/ # Abstract interfaces
│ ├── task.py # Task, TaskLoader, EvalSet
│ ├── trial.py # Trial, TrialBatch execution model
│ ├── grader.py # Grader ABCs (CodeGrader, LLMGrader, CompositeGrader)
│ ├── transcript.py # Agent execution logging
│ ├── decision_spec.py # Reproducibility fingerprinting
│ └── outcome.py # Grading results
├── execution/ # Trial runner
│ ├── runner.py # EvaluationRunner - parallel/concurrent execution
│ ├── agent_adapter.py # AgentAdapter ABC, SimpleAdapter
│ └── registry.py # Plugin loading via dotted import paths
├── statistics/ # Non-determinism handling
│ ├── pass_at_k.py # Capability ceiling (pass@k)
│ ├── consistency.py # Reliability (pass^k)
│ └── inference.py # Bootstrap CI, significance testing
├── baselines/ # Regression detection
│ ├── manager.py # Baseline storage, promotion semantics
│ └── comparison.py # RegressionDetector, severity levels
├── reporting/ # Output
│ └── generator.py # ReportGenerator (markdown, CI summary, HTML)
└── cli/ # Command-line interface
└── main.py # tracelens run / tracelens report
Planned modules:
human_eval/(sample selection, LLM-human reconciliation) is designed but not yet implemented.
A Task defines a single evaluation test case:
from tracelens import Task
task = Task(
name="Portfolio website decomposition",
input_data={
"goal": "Build a personal portfolio website",
"user_context": {"experience": "beginner", "hours_per_week": 15}
},
category="programming",
tags=["web", "beginner"],
)Graders evaluate agent outputs. There are two main types:
CodeGrader - For deterministic metrics:
from tracelens import CodeGrader
class SharpeGrader(CodeGrader):
def compute_metrics(self, transcript, task):
returns = transcript.final_output["returns"]
return {"sharpe_ratio": calculate_sharpe_ratio(returns)}
def determine_pass(self, metrics, task):
passed = metrics["sharpe_ratio"] >= 1.0
score = min(metrics["sharpe_ratio"] / 2.0, 1.0) # Normalize
return passed, scoreLLMGrader - For subjective quality (planning, summarisation, helpfulness):
from tracelens import LLMGrader
class SpecificityGrader(LLMGrader):
def build_grading_prompt(self, transcript, task):
return f"""Evaluate specificity of this decomposition:
{transcript.final_output}
Score 1-10 on: concrete actions, quantifiable targets, named resources
"""
def parse_llm_response(self, response, task):
# Parse LLM JSON response
return passed, score, metrics, feedbackA Trial represents a single execution of a Task:
from tracelens import Trial, TrialStatus
trial = Trial(
task_id=task.task_id,
run_index=0,
total_runs=5, # For pass@k
status=TrialStatus.COMPLETED,
transcript=transcript,
outcomes=[outcome1, outcome2],
)pass@k - Probability of at least one success in k attempts:
- Use for capability evaluation (can the agent solve this at all?)
- Higher k = higher pass@k (more chances to succeed)
pass^k - Probability of all k attempts succeeding:
- Use for reliability evaluation (is the agent consistent?)
- Higher k = lower pass^k (harder to pass every time)
from tracelens.statistics import pass_at_k, pass_to_k
# Capability: can it succeed at least once in 5 tries?
capability = pass_at_k(n=10, c=7, k=5) # 0.99+
# Reliability: will it succeed every time?
reliability = pass_to_k(results=[True, True, False, True, True], k=3) # 0.33DecisionSpec captures all parameters affecting agent behavior for reproducibility. The fingerprint is a SHA-256 hash of the entire configuration.
from tracelens.core.decision_spec import DecisionSpec, ModelConfig, AgentSpec
# Capture agent configuration
decision_spec = DecisionSpec(
model=ModelConfig(
model_id="gpt-4-turbo",
temperature=0.7,
max_tokens=4096,
),
agent=AgentSpec(
agent_id="goal-decomposer-v2",
version="1.2.3",
git_commit="abc123",
),
global_seed=42,
)
# Get fingerprint for reproducibility tracking
print(f"Fingerprint: {decision_spec.fingerprint[:16]}...")
# Attach to transcript for full reproducibility
transcript = Transcript(
task_id="task-1",
final_output={"result": "..."},
decision_spec=decision_spec,
)Graders can have two roles in composite evaluation:
- MUST_PASS: Safety/constraint graders. Any failure = trial fails.
- SCORE_CONTRIBUTOR: Quality graders. Contribute to weighted average.
from tracelens import CompositeGrader, GraderRole, GraderConfig
# Safety grader - must pass or entire trial fails
safety_config = GraderConfig(role=GraderRole.MUST_PASS)
safety_grader = FormatValidationGrader("format", config=safety_config)
# Quality grader - contributes to score average
quality_config = GraderConfig(role=GraderRole.SCORE_CONTRIBUTOR)
quality_grader = SpecificityGrader("specificity", config=quality_config)
# Composite: safety failure = trial failure, quality affects score
composite = CompositeGrader(
grader_id="combined",
graders=[
(safety_grader, 0.2), # Weight still affects score
(quality_grader, 0.8), # Higher weight for quality
],
)
outcome = await composite.grade(transcript, task)
# outcome.passed = False if safety_grader fails, regardless of quality scorefrom tracelens.baselines import BaselineManager, RegressionDetector
manager = BaselineManager("baselines/baselines.json")
baseline = manager.get_baseline("btc_backtest")
detector = RegressionDetector(significance_level=0.05)
report = detector.compare(baseline, current_results)
if report.should_block_ci(threshold=RegressionSeverity.MODERATE):
sys.exit(1) # Block the PRBaselines can be protected or auto-promoted based on their type:
- CANARY: Protected baselines that never auto-update. Manual promotion only.
- CAPABILITY: Track improvements over time. Auto-promote when criteria met.
- EXPERIMENTAL: For testing. No restrictions.
from tracelens.baselines import BaselineManager, BaselineType, PromotionPolicy
manager = BaselineManager("baselines/baselines.json")
# Create a canary baseline (protected, manual promotion only)
canary = manager.create_canary_baseline(
task_id="critical_safety_check",
metrics={"safety_score": 0.95},
)
# Create capability baseline with auto-promotion policy
policy = PromotionPolicy(
allow_auto_promotion=True,
min_improvement_relative=0.05, # 5% improvement required
min_samples=10,
required_confidence=0.95,
)
capability = manager.create_capability_baseline(
task_id="quality_benchmark",
metrics={"quality_score": 0.75},
policy=policy,
)
# Try auto-promotion (returns True if promoted)
promoted = manager.try_promote(
task_id="quality_benchmark",
new_metrics={"quality_score": 0.82},
sample_count=15,
)Research-grade statistical comparison with confidence intervals:
from tracelens.statistics.inference import (
compare_metrics,
compare_to_baseline_summary,
estimate_metric,
)
# Compare current run against baseline with bootstrap CI
baseline_values = [0.72, 0.75, 0.71, 0.74, 0.73]
current_values = [0.78, 0.81, 0.79, 0.82, 0.80]
result = compare_metrics(
baseline_values,
current_values,
confidence=0.95,
compute_p_value=True,
)
print(f"Baseline: {result.baseline.mean:.3f} ± {result.baseline.std:.3f}")
print(f"Current: {result.current.mean:.3f} ± {result.current.std:.3f}")
print(f"Difference: {result.difference:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
print(f"Effect size (Cohen's d): {result.effect_size:.2f}")
print(f"Significant improvement: {result.significant_improvement}")
# Get summary for CI reporting
summary = compare_to_baseline_summary(
baseline_values,
current_values,
metric_name="quality_score",
)
# Returns: "quality_score: 0.800 vs baseline 0.730 (Δ=+0.070, 95% CI [0.045, 0.095], d=1.23, p<0.05)"- name: Run Evaluation
run: |
tracelens run \
--eval-set eval/suite.json \
--graders quality,personalization \
--num-runs 5 \
--baseline-check \
--fail-on-regression moderate
- name: Comment on PR
run: tracelens report --format github-prConfigure in baselines/thresholds.py:
THRESHOLDS = {
"sharpe_ratio": {
"direction": "higher_is_better",
"absolute_threshold": -0.2, # Block if drops by 0.2
"relative_threshold": 0.10, # Block if drops by 10%
},
"max_drawdown": {
"direction": "closer_to_zero_is_better",
"absolute_threshold": -0.05,
},
}The
human_eval/module is planned but not yet implemented. The recommended workflow:
Weekly process to calibrate LLM graders:
- Sample Selection: Select 20 diverse samples from recent eval runs
- Human Rating: Rate on 1-10 scale per dimension
- Correlation Analysis: Compare LLM vs human scores
- Grader Tuning: Adjust prompts if correlation < 0.7
See docs/accuracy.md for calibration best practices.
Install from PyPI:
# Using uv (recommended)
uv pip install tracelens
# With LLM support
uv pip install "tracelens[llm]"
# Or add to pyproject.toml
# dependencies = [
# "tracelens>=0.1.0",
# ]# Clone and install
git clone https://github.com/ssf0409/tracelens.git
cd tracelens
uv pip install -e ".[dev]"
# Run tests
uv run pytest tests/ -v
# Run with Docker
docker compose run --rm testimport asyncio
from tracelens import (
Task, EvalSet, SimpleAdapter, CodeGrader,
EvaluationRunner, RunnerConfig, Transcript,
)
from tracelens.reporting.generator import ReportGenerator
# 1. Define tasks
tasks = [
Task(name="Add 2+3", input_data={"a": 2, "b": 3}, metadata={"expected": 5}),
Task(name="Add 10+20", input_data={"a": 10, "b": 20}, metadata={"expected": 30}),
]
eval_set = EvalSet(name="Math Suite", tasks=tasks)
# 2. Write a simple agent
async def math_agent(input_data: dict) -> dict:
return {"answer": input_data["a"] + input_data["b"]}
adapter = SimpleAdapter(math_agent)
# 3. Write a grader
class MathGrader(CodeGrader):
def compute_metrics(self, transcript: Transcript, task: Task) -> dict[str, float]:
expected = task.metadata["expected"]
actual = transcript.final_output["answer"]
return {"correct": float(actual == expected)}
def determine_pass(self, metrics: dict[str, float], task: Task) -> tuple[bool, float]:
return metrics["correct"] == 1.0, metrics["correct"]
# 4. Run evaluation
runner = EvaluationRunner(adapter, [MathGrader("math")], RunnerConfig(num_runs=3))
batch = asyncio.run(runner.run(eval_set))
# 5. Generate report
gen = ReportGenerator()
report = gen.build_report(batch)
print(gen.render_markdown(report))Five-minute version:
examples/hello_world.py. Sample report:examples/reports/hello_world_report.md. Walkthrough: docs/getting-started.md.
- Getting Started — Run your first eval in five minutes; the example ladder.
- Quickstart — Build a custom grader, JSON task loader, and CLI workflow.
- Supported Scenarios — Which agent-evaluation problems TraceLens is designed for.
- User Guide — Comprehensive framework guide.
- Evaluation Levels — Function, task, and system-level evaluation architecture.
- Accuracy Best Practices — LLM-judge calibration and grader drift.
- CI/CD Integration — GitHub Actions with regression gating.
- Contributor Testing — Local, wheel-smoke, downstream, and release-safety environments.
- Examples — Four working scripts:
hello_world.py→contract_eval.py→http_agent_eval.py→noise_aware_regression.py. - Releasing — Maintainer guide for tag-driven PyPI releases.
TraceLens is MIT licensed and open to contributions. Start with CONTRIBUTING.md, run the local verification gate, and open a focused PR:
uv run --frozen pytest -q
uv run --frozen ruff check src/ tests/ examples/ benchmarks/high-stakes-autonomous
uv run --frozen --extra dev mypy src/tracelens/Security issues should be reported privately using SECURITY.md.
From Anthropic's evaluation guide:
- Grade outcomes, not execution paths - Focus on what the agent produced
- Handle non-determinism with pass@k and pass^k - Different metrics for capability vs reliability
- Start with 20-50 real failure cases - Build from actual issues
- Read transcripts regularly - Catch false signals and grader bugs
- Calibrate with human evaluation - LLM graders drift without calibration