feat(skill): discourage multi-line stream lambdas by martinfrancois · Pull Request #37 · martinfrancois/java-streams-skill

martinfrancois · 2026-06-23T10:41:04Z

Summary

Problem: $java-streams did not explicitly reject multi-line stream lambdas, including block lambdas, body-on-next-line lambdas, and wrapped nested stream callbacks.
What changed: added runtime lambda-readability guidance, broadened the hard-stop scan, added Java 8 fallback guidance, added one new reference eval, and added multi-line-lambda criteria to existing evals where generated with-context artifacts used wrapped lambdas.
Lift-sensitive scope: skill/runtime guidance, eval criteria, and eval wrapper behavior changed in separate commits so each can be reverted independently if hosted evidence regresses.
What did not change: no plugin version bump, release metadata update, or public benchmark claim update is included.

Change Type

Choose all that apply.

Linked Issue

Fixes feat: Improve readability by avoiding multi-line lambdas #36
Related: N/A

User-Visible Behavior

Users of $java-streams now get explicit guidance to keep stream lambdas as short glue and extract branching, loops, temporary variables, formatting, merge rules, and nested stream chains into named helpers. The hard-stop scan now also flags block lambdas, arrows whose body starts on the next line, and nested stream lambda bodies that continue on later lines.

The default hosted eval wrapper no longer pins Sonnet. It uses Tessl's current default solver unless an explicit --agent is supplied. The docs still recommend Sonnet 4.6 or a better frontier model when the account has model-selection entitlement and the goal is a more representative real-world check.

Bug Fix Details

For bug fixes or regressions, explain why the issue happened and what now prevents it from coming
back. For other changes, write N/A.

Root cause: the skill named many stream antipatterns but did not treat multi-line lambda readability as a hard-stop review target, and the initial audit only checked -> { block lambdas.
Test, eval, or guardrail added: added evals-reference/28-overdue-shipment-notices, added multi-line-lambda criteria to 05-parallel-cpu-review, 15-session-roster-indexes, and 22-java8-version-scan, broadened the hard-stop scan, and reran targeted hosted evals for those scenarios.
If no test or eval was added, why not: N/A.

Validation

List the commands, manual checks, or hosted checks you ran. Include relevant failures that were fixed
during the PR.

Checks most contributors can run:

python3 scripts/validate_skill.py skills/java-streams
python3 scripts/validate_eval_criteria.py evals evals-reference evals-regression
python3 -m py_compile scripts/*.py
bash -n scripts/*.sh
tessl plugin lint .
Manual rendered-doc or example review, if docs or examples changed

Tessl-authenticated checks:

bash scripts/check_publish_dry_run.sh ., tessl skill review, and hosted Tessl evals require
Tessl authentication. Hosted evals also require a linked Tessl project. If you can't run one of
them, leave it unchecked and explain why in the details.

Details:

Final local validation on this branch:
- python3 scripts/validate_skill.py skills/java-streams
- python3 scripts/validate_eval_criteria.py evals evals-reference evals-regression
- python3 scripts/validate_json_files.py
- python3 scripts/validate_openai_agent_yaml.py
- python3 -m py_compile scripts/*.py
- bash -n scripts/*.sh
- shellcheck scripts/run_eval_suite.sh
- git diff --check
- tessl plugin lint .
- tessl skill review --threshold 100 skills/java-streams/SKILL.md

Eval wrapper negative checks:
- scripts/run_eval_suite.sh main --skip-baseline -> rejected with exit 2
- scripts/run_eval_suite.sh main --skip-baseline=true -> rejected with exit 2
- scripts/run_eval_suite.sh main -- --skip-baseline -> rejected with exit 2
- scripts/run_eval_suite.sh main -- --skip-baseline=true -> rejected with exit 2
- scripts/run_eval_suite.sh reference 28-overdue-shipment-notices -- --variant with-context -> rejected with exit 2

Publish dry-run:
- bash scripts/check_publish_dry_run.sh .
- tessl plugin publish --dry-run --bump patch .
- tessl plugin publish --dry-run . fails because martinfrancois/java-streams@1.1.4 already exists; patch-bump dry-run confirms 1.1.5 is available.

Codex review fix loop:
- First review found scripts/run_eval_suite.sh allowed --skip-baseline passthrough for main/reference runs. Fixed by rejecting both --variant and --skip-baseline because the wrapper owns suite variant policy.
- Follow-up review found --skip-baseline=true could still bypass the guard. Fixed by rejecting --skip-baseline=* in both argument checks.
- Final review found Java 8 fallback guidance for Collectors.flatMapping could drop empty downstream groups when pre-flattening before groupingBy. Fixed by qualifying pre-flattening and adding a loop fallback that preserves empty groups.
- Final Codex review on pushed HEAD e9b0a15: no discrete correctness issues found.

Hosted eval evidence:
- A Sonnet 4.6 targeted run was attempted and failed with 403 Missing required entitlement: modelSelection. The default-solver commands were updated so free accounts without model selection can still run evals.
- `tessl eval run --list-agents` currently reports `claude:deepseek-v4-flash (default)`. The docs intentionally tell contributors to check `--list-agents` instead of hard-coding a current default model.
- Online resources checked: Tessl's changelog documents plan-entitlement-gated model/agent selection (<https://docs.tessl.io/changelog>), and Tessl's default-eval-model blog explains why ordinary skill-development evals should not always pin Sonnet 4.6 (<https://tessl.io/blog/why-were-changing-our-default-eval-model/>).

Targeted hosted results for changed or corrected scenarios:
- evals-reference/05-parallel-cpu-review: run 019ef470-07fd-7676-858b-945ae61c8e87, baseline 39/60, with context 60/60.
- evals-reference/15-session-roster-indexes: run 019ef472-c76e-737b-be6c-3a439dd4aa95, baseline 59/100, with context 100/100.
- evals-regression/22-java8-version-scan: run 019ef46c-35a7-70a9-85a0-231e6f9bb60f, with context 100/100.
- evals-reference/26-uppercase-side-effect-review: run 019ef484-4a21-7383-9ca7-287c8b4e70b7, baseline 74/100, with context 100/100.
- evals/02-delivery-appointments-mapconcurrent after the service-stub preservation fix: run 019ef492-789e-734d-afb4-fad02ec5a452, baseline 90/400, with context 400/400.
- evals-reference/28-overdue-shipment-notices classifier evidence from run 019ef3f7-7c37-74ac-9ab2-191dea36f333: baseline 80/100, with context 100/100, recommended suite reference because the 20.0 pp delta is below the 30.0 pp main-suite floor.

Broader hosted evidence gathered before final hosted 500s:
- Full main run 019ef486-5d27-7068-a787-67dfa842f863 reached with-context 500/500 for 01, 300/300 for 03, and 200/200 for 04. Scenario 02 scored 394/400 due a service-stub API addition; targeted run 019ef492-789e-734d-afb4-fad02ec5a452 after the fix scored 400/400.
- Full reference run 019ef47f-484a-729d-9040-01627e57c48d reached with-context 100/100 for 08, 27, and 28; 60/60 for 05; 100/100 for 15; and 98/100 for 26. Targeted run 019ef484-4a21-7383-9ca7-287c8b4e70b7 after the performance-guidance fix scored 100/100 for 26.
- Regression targeted run 019ef46c-35a7-70a9-85a0-231e6f9bb60f scored 100/100 for the Java 8 version-scan scenario changed by this PR. Earlier full regression run 019ef3f7-8e1b-72c6-be14-a4082fa90b48 had scenario 04 at 63/65 before later fixes; final broad regression creation is currently blocked by Tessl 500, so this PR does not claim final broad regression release-readiness.

Final hosted broad-suite attempts on pushed branch:
- scripts/run_eval_suite.sh main -- --yes --label "issue-36 final full main after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh reference -- --yes --label "issue-36 final full reference after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh regression -- --yes --label "issue-36 final full regression after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh regression 04-primary-address-review -- --yes --label "issue-36 final targeted regression 04 availability check" -> 500 Internal Server Error, no run ID created.

Generated-artifact audit for issue #36:
- Initial audit was too narrow: it scanned only `-> {` block lambdas.
- Follow-up broader audit found wrapped/body-next-line lambda usage in with-context artifacts for evals-reference/05-parallel-cpu-review, evals-reference/15-session-roster-indexes, and evals-regression/22-java8-version-scan.
- Added criteria to those exact scenarios and reran them targeted to 100% with context.
- Final targeted artifact scan for those scenarios found 7 lambda occurrences in generated Java/fenced code and 0 multiline-lambda hits. Two raw prose hits in the Java 8 review were false positives that described `filter(x -> { ... })` as a bad pattern, not replacement code.

Human Verification

Describe what you tried manually and what result you saw. If the change cannot be tried manually,
explain why.

Reviewed the issue requirements against the final diff. The PR now covers the originally missed wrapped/body-next-line lambda forms, not only block lambdas.

Reviewed the skill text, hard-stop guidance, Java 8 fallback guidance, and stream examples for Java baseline compatibility, ordering, null handling, mutability, parallelism, and behavior preservation. The flatMapping fallback now explicitly preserves empty downstream groups unless omission is intended.

Reviewed the new eval to avoid answer-key leakage: it asks for a natural implementation task and scores behavior plus maintainability rather than prescribing exact helper names or an exact stream chain.

Manually audited generated hosted artifacts from targeted with-context runs for the lambda issue. The final generated code/fenced-code scan for affected scenarios contains no multiline lambda replacement code.

Review Checklist

AI Assistance (if used)

AI-assisted PR
I confirm I understand and reviewed the change

AI prompts / session logs (optional)

AI-assisted implementation with Codex. Codex inspected the issue, repository guidance, and contribution policy; implemented the skill, eval, docs, and script updates; ran local validation and hosted evals where Tessl allowed them; audited generated eval artifacts for multiline lambda usage; and ran the Codex review fix loop until the final review reported no discrete correctness issues.

Discourage multiline stream lambdas, preserve requested Java artifacts and nested helper types, and keep implementation prompts on direct stream result production when the Java baseline allows it. Strengthen findFirst/findAny exception wording, bounded mapConcurrent guidance, Java 8 fallbacks, null collector handling, and parallel-stream review advice. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Add the overdue shipment reference scenario, update multiline-lambda and Java baseline criteria, and mark the uppercase performance review as an explicit skill invocation. Keep suite numbering and agent behavior notes aligned with the expanded reference and regression coverage. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Update scripts/run_eval_suite.sh for the current Tessl eval CLI, let suite policy choose baseline versus with-context runs, pass --skill java-streams and --force for final readiness evidence, and avoid pinning Sonnet in default commands. Document that Tessl's default solver is used by default because explicit model selection is entitlement-gated; Sonnet 4.6 or better remains recommended when modelSelection is available for representative checks. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Add an internal pre-submit gate that runs quality first, tracks scenario-level evidence by skill and scenario fingerprints, schedules impact and historical-risk probes, and broadens to balanced remaining batches only after targeted evidence is clean. Document the final hard requirement for runtime skill changes, including 100% quality and 100% with-context evidence across main, reference, and regression for the final skill bundle state. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Record how future agents should handle Tessl or other CLI deprecation warnings: update runtime scripts to the replacement path, keep compatibility fallbacks when needed, and refresh agent-facing docs in the same change. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Switch the Skill Review workflow from the deprecated tessl skill review command to tessl review run with an explicit martinfrancois workspace so CI is non-interactive and uses the review path validated locally. Update the pull request checklist to match the workflow command. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

Require runtime-reference overlap to be classified with explicit evidence types instead of allowing free-form rationale text to bypass validation. Reclassify focused main and reference scenarios that intentionally overlap runtime guidance, and keep explicit invocation metadata aligned with prompts. Co-Authored-By: marvinbuff <marvinbuff@hotmail.com> Co-Authored-By: PReimers <preimers@pm.me>

martinfrancois force-pushed the codex/multiline-lambda-guidance branch from 99859bd to 91694e9 Compare June 23, 2026 12:04

martinfrancois marked this pull request as ready for review June 23, 2026 12:07