martinfrancois · martinfrancois · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -55,7 +55,7 @@ Tessl-authenticated checks:
 
 - [ ] `bash scripts/check_publish_dry_run.sh .`
 - [ ] `tessl plugin publish --dry-run --bump patch .`
-- [ ] `tessl skill review --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed
+- [ ] `tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed
 - [ ] Targeted main/reference `scripts/run_eval_suite.sh <main|reference> <scenario-name>`, if skill behavior or those evals changed
 - [ ] Targeted regression `scripts/run_eval_suite.sh regression <scenario-name>`, if regression evals changed
 - [ ] Every substantively changed eval scenario was rerun targeted and reached 100% with context, or the PR explains the Tessl blocker and remaining work
@@ -65,7 +65,7 @@ Tessl-authenticated checks:
 - [ ] `scripts/classify_eval_result.py <run-json> --scenario-dir <scenario-dir>`, if a scenario was added or moved between suites
 - [ ] Full/main `scripts/run_eval_suite.sh main`, if benchmark claims changed or targeted with-context results are clean
 
-`bash scripts/check_publish_dry_run.sh .`, `tessl skill review`, and hosted Tessl evals require
+`bash scripts/check_publish_dry_run.sh .`, `tessl review run`, and hosted Tessl evals require
 Tessl authentication. Hosted evals also require a linked Tessl project. If you can't run one of
 them, leave it unchecked and explain why in the details.
 

diff --git a/.github/workflows/skill-review.yml b/.github/workflows/skill-review.yml
@@ -44,4 +44,4 @@ jobs:
 
       - name: Review skill
         if: ${{ env.TESSL_TOKEN_AVAILABLE == 'true' }}
-        run: tessl skill review --threshold 100 skills/java-streams/SKILL.md
+        run: tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ __pycache__/
 *.py[cod]
 .tessl/cache/
 .tessl/tmp/
+.tessl/eval-evidence/
 .codex/
 .mcp.json
 .vscode/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -78,7 +78,7 @@ tessl plugin lint .
 If you change the skill text or reference files, also run:
 
 ```bash
-tessl skill review --threshold 100 skills/java-streams/SKILL.md
+tessl review run --threshold 100 skills/java-streams/SKILL.md
 ```
 
 If you have Tessl access, run the publish dry-run:
@@ -88,23 +88,25 @@ bash scripts/check_publish_dry_run.sh .
 tessl plugin publish --dry-run --bump patch .
 ```
 
-Hosted evals require Tessl authentication and a linked Tessl project. Use Sonnet 4.6 for this
-repository's main eval checks. Prefer `scripts/run_eval_suite.sh`; it runs from a temporary plugin
-copy so with-context variants can see the skill bundle. Start with targeted runs when possible to
-conserve Tessl daily rate-limit budget:
+Hosted evals require Tessl authentication and a linked Tessl project. Prefer
+`scripts/run_eval_suite.sh`; it runs from a temporary plugin copy so with-context variants can see
+the skill bundle. By default, it uses Tessl's default solver so contributors without model-selection
+entitlement can still run evals. If your Tessl plan allows explicit model selection, Sonnet 4.6 or a
+better frontier model is recommended for a more representative real-world check. Start with targeted
+runs when possible to conserve Tessl daily rate-limit budget:
 
 ```bash
 scripts/run_eval_suite.sh main
 ```
 
 Run hosted eval variants by suite purpose:
 
-- `evals/`: run both `without-context` and `with-context`; these runs support public lift
+- `evals/`: run both baseline control and `with-context`; these runs support public lift
   reporting. Use `scripts/run_eval_suite.sh main`.
-- `evals-reference/`: run both `without-context` and `with-context`; these runs decide whether a
+- `evals-reference/`: run both baseline control and `with-context`; these runs decide whether a
   scenario has meaningful lift or should move suites. Use `scripts/run_eval_suite.sh reference`.
 - `evals-regression/`: run `with-context` only by default; these runs are safety checks, not lift
-  discovery. Run regression `without-context` only when deliberately checking whether a scenario
+  discovery. Run regression baseline control only when deliberately checking whether a scenario
   should move back to `evals-reference/`. Use `scripts/run_eval_suite.sh regression`.
 
 ## Commit Messages

diff --git a/docs/agents/eval-risk-probes.txt b/docs/agents/eval-risk-probes.txt
@@ -0,0 +1,26 @@
+# suite:scenario entries that historically exposed scored with-context failures, ordered by
+# observed failure count per hosted solution cost from the paginated Tessl/local run-history
+# corpus. Missing-score/incomplete runs are excluded from this ranking. These are scheduling
+# probes only. They do not change final evidence requirements.
+reference:26-uppercase-side-effect-review
+reference:08-primary-contact-review
+main:02-delivery-appointments-mapconcurrent
+regression:04-primary-address-review
+regression:22-java8-version-scan
+main:01-offer-availability-mapconcurrent
+main:03-payment-screening-gatherer-review
+regression:16-java11-report-review
+regression:19-null-collector-review
+reference:15-session-roster-indexes
+reference:28-overdue-shipment-notices
+reference:05-parallel-cpu-review
+regression:09-order-collector-report
+regression:23-collector-order-scan
+regression:01-permission-and-orders
+regression:13-training-and-packets
+regression:14-parallel-mutation-review
+regression:17-java8-optional-prefix-review
+regression:24-mutable-batch-modernization
+regression:25-hard-stop-scan-audit
+main:04-invoice-bounds-and-temperature-windows
+reference:27-uppercase-names-implementation