Skip to content

feat(skill): discourage multi-line stream lambdas#37

Merged
martinfrancois merged 7 commits into
mainfrom
codex/multiline-lambda-guidance
Jun 28, 2026
Merged

feat(skill): discourage multi-line stream lambdas#37
martinfrancois merged 7 commits into
mainfrom
codex/multiline-lambda-guidance

Conversation

@martinfrancois

@martinfrancois martinfrancois commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

  • Problem: $java-streams did not explicitly reject multi-line stream lambdas, including block lambdas, body-on-next-line lambdas, and wrapped nested stream callbacks.
  • What changed: added runtime lambda-readability guidance, broadened the hard-stop scan, added Java 8 fallback guidance, added one new reference eval, and added multi-line-lambda criteria to existing evals where generated with-context artifacts used wrapped lambdas.
  • Lift-sensitive scope: skill/runtime guidance, eval criteria, and eval wrapper behavior changed in separate commits so each can be reverted independently if hosted evidence regresses.
  • What did not change: no plugin version bump, release metadata update, or public benchmark claim update is included.

Change Type

Choose all that apply.

  • Skill behavior
  • Evals or scoring
  • Documentation
  • CI, release, or dependency automation
  • Repository metadata or contribution process
  • Other maintenance

Linked Issue

User-Visible Behavior

Users of $java-streams now get explicit guidance to keep stream lambdas as short glue and extract branching, loops, temporary variables, formatting, merge rules, and nested stream chains into named helpers. The hard-stop scan now also flags block lambdas, arrows whose body starts on the next line, and nested stream lambda bodies that continue on later lines.

The default hosted eval wrapper no longer pins Sonnet. It uses Tessl's current default solver unless an explicit --agent is supplied. The docs still recommend Sonnet 4.6 or a better frontier model when the account has model-selection entitlement and the goal is a more representative real-world check.

Bug Fix Details

For bug fixes or regressions, explain why the issue happened and what now prevents it from coming
back. For other changes, write N/A.

  • Root cause: the skill named many stream antipatterns but did not treat multi-line lambda readability as a hard-stop review target, and the initial audit only checked -> { block lambdas.
  • Test, eval, or guardrail added: added evals-reference/28-overdue-shipment-notices, added multi-line-lambda criteria to 05-parallel-cpu-review, 15-session-roster-indexes, and 22-java8-version-scan, broadened the hard-stop scan, and reran targeted hosted evals for those scenarios.
  • If no test or eval was added, why not: N/A.

Validation

List the commands, manual checks, or hosted checks you ran. Include relevant failures that were fixed
during the PR.

Checks most contributors can run:

  • python3 scripts/validate_skill.py skills/java-streams
  • python3 scripts/validate_eval_criteria.py evals evals-reference evals-regression
  • python3 -m py_compile scripts/*.py
  • bash -n scripts/*.sh
  • tessl plugin lint .
  • Manual rendered-doc or example review, if docs or examples changed

Tessl-authenticated checks:

  • bash scripts/check_publish_dry_run.sh .
  • tessl plugin publish --dry-run --bump patch .
  • tessl skill review --threshold 100 skills/java-streams/SKILL.md, if skill text or references changed
  • Targeted main/reference scripts/run_eval_suite.sh <main|reference> <scenario-name>, if skill behavior or those evals changed
  • Targeted regression scripts/run_eval_suite.sh regression <scenario-name>, if regression evals changed
  • Every substantively changed eval scenario was rerun targeted and reached 100% with context, or the PR explains the Tessl blocker and remaining work
  • Runtime skill/reference changes only: full scripts/run_eval_suite.sh reference was run after the final runtime-context change, or the PR links the blocker issue
  • Runtime skill/reference changes only: full scripts/run_eval_suite.sh regression was run after the final runtime-context change, or the PR links the blocker issue
  • Pure eval suite moves did not change task wording, scoring criteria, or capability text beyond suite-placement metadata/numbering notes
  • scripts/classify_eval_result.py <run-json> --scenario-dir <scenario-dir>, if a scenario was added or moved between suites
  • Full/main scripts/run_eval_suite.sh main, if benchmark claims changed or targeted with-context results are clean

bash scripts/check_publish_dry_run.sh ., tessl skill review, and hosted Tessl evals require
Tessl authentication. Hosted evals also require a linked Tessl project. If you can't run one of
them, leave it unchecked and explain why in the details.

Details:

Final local validation on this branch:
- python3 scripts/validate_skill.py skills/java-streams
- python3 scripts/validate_eval_criteria.py evals evals-reference evals-regression
- python3 scripts/validate_json_files.py
- python3 scripts/validate_openai_agent_yaml.py
- python3 -m py_compile scripts/*.py
- bash -n scripts/*.sh
- shellcheck scripts/run_eval_suite.sh
- git diff --check
- tessl plugin lint .
- tessl skill review --threshold 100 skills/java-streams/SKILL.md

Eval wrapper negative checks:
- scripts/run_eval_suite.sh main --skip-baseline -> rejected with exit 2
- scripts/run_eval_suite.sh main --skip-baseline=true -> rejected with exit 2
- scripts/run_eval_suite.sh main -- --skip-baseline -> rejected with exit 2
- scripts/run_eval_suite.sh main -- --skip-baseline=true -> rejected with exit 2
- scripts/run_eval_suite.sh reference 28-overdue-shipment-notices -- --variant with-context -> rejected with exit 2

Publish dry-run:
- bash scripts/check_publish_dry_run.sh .
- tessl plugin publish --dry-run --bump patch .
- tessl plugin publish --dry-run . fails because martinfrancois/java-streams@1.1.4 already exists; patch-bump dry-run confirms 1.1.5 is available.

Codex review fix loop:
- First review found scripts/run_eval_suite.sh allowed --skip-baseline passthrough for main/reference runs. Fixed by rejecting both --variant and --skip-baseline because the wrapper owns suite variant policy.
- Follow-up review found --skip-baseline=true could still bypass the guard. Fixed by rejecting --skip-baseline=* in both argument checks.
- Final review found Java 8 fallback guidance for Collectors.flatMapping could drop empty downstream groups when pre-flattening before groupingBy. Fixed by qualifying pre-flattening and adding a loop fallback that preserves empty groups.
- Final Codex review on pushed HEAD e9b0a15: no discrete correctness issues found.

Hosted eval evidence:
- A Sonnet 4.6 targeted run was attempted and failed with 403 Missing required entitlement: modelSelection. The default-solver commands were updated so free accounts without model selection can still run evals.
- `tessl eval run --list-agents` currently reports `claude:deepseek-v4-flash (default)`. The docs intentionally tell contributors to check `--list-agents` instead of hard-coding a current default model.
- Online resources checked: Tessl's changelog documents plan-entitlement-gated model/agent selection (<https://docs.tessl.io/changelog>), and Tessl's default-eval-model blog explains why ordinary skill-development evals should not always pin Sonnet 4.6 (<https://tessl.io/blog/why-were-changing-our-default-eval-model/>).

Targeted hosted results for changed or corrected scenarios:
- evals-reference/05-parallel-cpu-review: run 019ef470-07fd-7676-858b-945ae61c8e87, baseline 39/60, with context 60/60.
- evals-reference/15-session-roster-indexes: run 019ef472-c76e-737b-be6c-3a439dd4aa95, baseline 59/100, with context 100/100.
- evals-regression/22-java8-version-scan: run 019ef46c-35a7-70a9-85a0-231e6f9bb60f, with context 100/100.
- evals-reference/26-uppercase-side-effect-review: run 019ef484-4a21-7383-9ca7-287c8b4e70b7, baseline 74/100, with context 100/100.
- evals/02-delivery-appointments-mapconcurrent after the service-stub preservation fix: run 019ef492-789e-734d-afb4-fad02ec5a452, baseline 90/400, with context 400/400.
- evals-reference/28-overdue-shipment-notices classifier evidence from run 019ef3f7-7c37-74ac-9ab2-191dea36f333: baseline 80/100, with context 100/100, recommended suite reference because the 20.0 pp delta is below the 30.0 pp main-suite floor.

Broader hosted evidence gathered before final hosted 500s:
- Full main run 019ef486-5d27-7068-a787-67dfa842f863 reached with-context 500/500 for 01, 300/300 for 03, and 200/200 for 04. Scenario 02 scored 394/400 due a service-stub API addition; targeted run 019ef492-789e-734d-afb4-fad02ec5a452 after the fix scored 400/400.
- Full reference run 019ef47f-484a-729d-9040-01627e57c48d reached with-context 100/100 for 08, 27, and 28; 60/60 for 05; 100/100 for 15; and 98/100 for 26. Targeted run 019ef484-4a21-7383-9ca7-287c8b4e70b7 after the performance-guidance fix scored 100/100 for 26.
- Regression targeted run 019ef46c-35a7-70a9-85a0-231e6f9bb60f scored 100/100 for the Java 8 version-scan scenario changed by this PR. Earlier full regression run 019ef3f7-8e1b-72c6-be14-a4082fa90b48 had scenario 04 at 63/65 before later fixes; final broad regression creation is currently blocked by Tessl 500, so this PR does not claim final broad regression release-readiness.

Final hosted broad-suite attempts on pushed branch:
- scripts/run_eval_suite.sh main -- --yes --label "issue-36 final full main after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh reference -- --yes --label "issue-36 final full reference after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh regression -- --yes --label "issue-36 final full regression after review fix" -> 500 Internal Server Error, no run ID created.
- scripts/run_eval_suite.sh regression 04-primary-address-review -- --yes --label "issue-36 final targeted regression 04 availability check" -> 500 Internal Server Error, no run ID created.

Generated-artifact audit for issue #36:
- Initial audit was too narrow: it scanned only `-> {` block lambdas.
- Follow-up broader audit found wrapped/body-next-line lambda usage in with-context artifacts for evals-reference/05-parallel-cpu-review, evals-reference/15-session-roster-indexes, and evals-regression/22-java8-version-scan.
- Added criteria to those exact scenarios and reran them targeted to 100% with context.
- Final targeted artifact scan for those scenarios found 7 lambda occurrences in generated Java/fenced code and 0 multiline-lambda hits. Two raw prose hits in the Java 8 review were false positives that described `filter(x -> { ... })` as a bad pattern, not replacement code.

Human Verification

Describe what you tried manually and what result you saw. If the change cannot be tried manually,
explain why.

Reviewed the issue requirements against the final diff. The PR now covers the originally missed wrapped/body-next-line lambda forms, not only block lambdas.

Reviewed the skill text, hard-stop guidance, Java 8 fallback guidance, and stream examples for Java baseline compatibility, ordering, null handling, mutability, parallelism, and behavior preservation. The flatMapping fallback now explicitly preserves empty downstream groups unless omission is intended.

Reviewed the new eval to avoid answer-key leakage: it asks for a natural implementation task and scores behavior plus maintainability rather than prescribing exact helper names or an exact stream chain.

Manually audited generated hosted artifacts from targeted with-context runs for the lambda issue. The final generated code/fenced-code scan for affected scenarios contains no multiline lambda replacement code.

Review Checklist

  • The change is scoped to the sections, skill files, evals, or workflows described above.
  • Validation that applies to this change is checked above, or any unavailable check is explained.
  • If Java stream guidance changed, Java baseline compatibility plus ordering, null handling, and parallelism were considered.
  • If evals or benchmark claims changed, the eval scenarios remain fair and do not leak answer keys, run IDs, or fixed score claims into runtime references.
  • If runtime skill text or references changed, hosted checks were widened from targeted affected scenarios to main/reference/regression as described in docs/agents/workflow.md, or any Tessl blocker is documented.
  • If a runtime skill/reference change was released, the final report includes the published main eval run plus post-change reference and regression run IDs, or a blocker issue for missing broad suites.
  • Main and reference evals were run with both variants when hosted evals were needed; regression evals were run with context only unless reclassification back to reference was being checked.
  • New or moved eval scenarios follow the classifier recommendation, or the PR explains the maintainer-approved override.
  • Every retained eval scenario has a 100% with-context result, or any below-100 result is documented as blocking follow-up rather than classified/reportable coverage.
  • PR title or squash title uses Conventional Commits.
  • Redaction checked: no tokens, private links, private eval artifacts, local host paths, or proprietary Java source.

AI Assistance (if used)

  • AI-assisted PR
  • I confirm I understand and reviewed the change
AI prompts / session logs (optional)
AI-assisted implementation with Codex. Codex inspected the issue, repository guidance, and contribution policy; implemented the skill, eval, docs, and script updates; ran local validation and hosted evals where Tessl allowed them; audited generated eval artifacts for multiline lambda usage; and ran the Codex review fix loop until the final review reported no discrete correctness issues.

@martinfrancois martinfrancois force-pushed the codex/multiline-lambda-guidance branch from 99859bd to 91694e9 Compare June 23, 2026 12:04
@martinfrancois martinfrancois marked this pull request as ready for review June 23, 2026 12:07
Comment thread evals-reference/28-overdue-shipment-notices/criteria.json
Comment thread skills/java-streams/references/hard-stops.md Outdated
Comment thread skills/java-streams/references/java-stream-api.md
Comment thread skills/java-streams/references/stream-examples.md Outdated
Comment thread skills/java-streams/references/stream-examples.md Outdated
Comment thread skills/java-streams/references/stream-examples.md Outdated
Comment thread skills/java-streams/references/stream-examples.md Outdated
Comment thread skills/java-streams/references/stream-examples.md Outdated
@martinfrancois martinfrancois force-pushed the codex/multiline-lambda-guidance branch 20 times, most recently from b9521d8 to b9d5ff6 Compare June 26, 2026 10:19
@martinfrancois martinfrancois force-pushed the codex/multiline-lambda-guidance branch 20 times, most recently from ce9a552 to e90b092 Compare June 26, 2026 15:19
martinfrancois and others added 5 commits June 28, 2026 13:06
Discourage multiline stream lambdas, preserve requested Java artifacts and nested helper types, and keep implementation prompts on direct stream result production when the Java baseline allows it.

Strengthen findFirst/findAny exception wording, bounded mapConcurrent guidance, Java 8 fallbacks, null collector handling, and parallel-stream review advice.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
Add the overdue shipment reference scenario, update multiline-lambda and Java baseline criteria, and mark the uppercase performance review as an explicit skill invocation.

Keep suite numbering and agent behavior notes aligned with the expanded reference and regression coverage.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
Update scripts/run_eval_suite.sh for the current Tessl eval CLI, let suite policy choose baseline versus with-context runs, pass --skill java-streams and --force for final readiness evidence, and avoid pinning Sonnet in default commands.

Document that Tessl's default solver is used by default because explicit model selection is entitlement-gated; Sonnet 4.6 or better remains recommended when modelSelection is available for representative checks.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
Add an internal pre-submit gate that runs quality first, tracks scenario-level evidence by skill and scenario fingerprints, schedules impact and historical-risk probes, and broadens to balanced remaining batches only after targeted evidence is clean.

Document the final hard requirement for runtime skill changes, including 100% quality and 100% with-context evidence across main, reference, and regression for the final skill bundle state.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
Record how future agents should handle Tessl or other CLI deprecation warnings: update runtime scripts to the replacement path, keep compatibility fallbacks when needed, and refresh agent-facing docs in the same change.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
@martinfrancois martinfrancois force-pushed the codex/multiline-lambda-guidance branch from 87d33e2 to 5577ced Compare June 28, 2026 11:08
martinfrancois and others added 2 commits June 28, 2026 13:36
Switch the Skill Review workflow from the deprecated tessl skill review command to tessl review run with an explicit martinfrancois workspace so CI is non-interactive and uses the review path validated locally.

Update the pull request checklist to match the workflow command.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
Require runtime-reference overlap to be classified with explicit evidence types instead of allowing free-form rationale text to bypass validation.

Reclassify focused main and reference scenarios that intentionally overlap runtime guidance, and keep explicit invocation metadata aligned with prompts.

Co-Authored-By: marvinbuff <marvinbuff@hotmail.com>

Co-Authored-By: PReimers <preimers@pm.me>
@martinfrancois martinfrancois merged commit bac5406 into main Jun 28, 2026
8 checks passed
@martinfrancois martinfrancois deleted the codex/multiline-lambda-guidance branch June 28, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Improve readability by avoiding multi-line lambdas

1 participant