Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Tessl-authenticated checks:

- [ ] `bash scripts/check_publish_dry_run.sh .`
- [ ] `tessl plugin publish --dry-run --bump patch .`
- [ ] `tessl skill review --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed
- [ ] `tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed
- [ ] Targeted main/reference `scripts/run_eval_suite.sh <main|reference> <scenario-name>`, if skill behavior or those evals changed
- [ ] Targeted regression `scripts/run_eval_suite.sh regression <scenario-name>`, if regression evals changed
- [ ] Every substantively changed eval scenario was rerun targeted and reached 100% with context, or the PR explains the Tessl blocker and remaining work
Expand All @@ -65,7 +65,7 @@ Tessl-authenticated checks:
- [ ] `scripts/classify_eval_result.py <run-json> --scenario-dir <scenario-dir>`, if a scenario was added or moved between suites
- [ ] Full/main `scripts/run_eval_suite.sh main`, if benchmark claims changed or targeted with-context results are clean

`bash scripts/check_publish_dry_run.sh .`, `tessl skill review`, and hosted Tessl evals require
`bash scripts/check_publish_dry_run.sh .`, `tessl review run`, and hosted Tessl evals require
Tessl authentication. Hosted evals also require a linked Tessl project. If you can't run one of
them, leave it unchecked and explain why in the details.

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/skill-review.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ jobs:

- name: Review skill
if: ${{ env.TESSL_TOKEN_AVAILABLE == 'true' }}
run: tessl skill review --threshold 100 skills/java-streams/SKILL.md
run: tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ __pycache__/
*.py[cod]
.tessl/cache/
.tessl/tmp/
.tessl/eval-evidence/
.codex/
.mcp.json
.vscode/
18 changes: 10 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ tessl plugin lint .
If you change the skill text or reference files, also run:

```bash
tessl skill review --threshold 100 skills/java-streams/SKILL.md
tessl review run --threshold 100 skills/java-streams/SKILL.md
```

If you have Tessl access, run the publish dry-run:
Expand All @@ -88,23 +88,25 @@ bash scripts/check_publish_dry_run.sh .
tessl plugin publish --dry-run --bump patch .
```

Hosted evals require Tessl authentication and a linked Tessl project. Use Sonnet 4.6 for this
repository's main eval checks. Prefer `scripts/run_eval_suite.sh`; it runs from a temporary plugin
copy so with-context variants can see the skill bundle. Start with targeted runs when possible to
conserve Tessl daily rate-limit budget:
Hosted evals require Tessl authentication and a linked Tessl project. Prefer
`scripts/run_eval_suite.sh`; it runs from a temporary plugin copy so with-context variants can see
the skill bundle. By default, it uses Tessl's default solver so contributors without model-selection
entitlement can still run evals. If your Tessl plan allows explicit model selection, Sonnet 4.6 or a
better frontier model is recommended for a more representative real-world check. Start with targeted
runs when possible to conserve Tessl daily rate-limit budget:

```bash
scripts/run_eval_suite.sh main
```

Run hosted eval variants by suite purpose:

- `evals/`: run both `without-context` and `with-context`; these runs support public lift
- `evals/`: run both baseline control and `with-context`; these runs support public lift
reporting. Use `scripts/run_eval_suite.sh main`.
- `evals-reference/`: run both `without-context` and `with-context`; these runs decide whether a
- `evals-reference/`: run both baseline control and `with-context`; these runs decide whether a
scenario has meaningful lift or should move suites. Use `scripts/run_eval_suite.sh reference`.
- `evals-regression/`: run `with-context` only by default; these runs are safety checks, not lift
discovery. Run regression `without-context` only when deliberately checking whether a scenario
discovery. Run regression baseline control only when deliberately checking whether a scenario
should move back to `evals-reference/`. Use `scripts/run_eval_suite.sh regression`.

## Commit Messages
Expand Down
26 changes: 26 additions & 0 deletions docs/agents/eval-risk-probes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# suite:scenario entries that historically exposed scored with-context failures, ordered by
# observed failure count per hosted solution cost from the paginated Tessl/local run-history
# corpus. Missing-score/incomplete runs are excluded from this ranking. These are scheduling
# probes only. They do not change final evidence requirements.
reference:26-uppercase-side-effect-review
reference:08-primary-contact-review
main:02-delivery-appointments-mapconcurrent
regression:04-primary-address-review
regression:22-java8-version-scan
main:01-offer-availability-mapconcurrent
main:03-payment-screening-gatherer-review
regression:16-java11-report-review
regression:19-null-collector-review
reference:15-session-roster-indexes
reference:28-overdue-shipment-notices
reference:05-parallel-cpu-review
regression:09-order-collector-report
regression:23-collector-order-scan
regression:01-permission-and-orders
regression:13-training-and-packets
regression:14-parallel-mutation-review
regression:17-java8-optional-prefix-review
regression:24-mutable-batch-modernization
regression:25-hard-stop-scan-audit
main:04-invoice-bounds-and-temperature-windows
reference:27-uppercase-names-implementation
Loading