docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40
docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40lbfsc wants to merge 1 commit into
Conversation
…posals - Introduced CONTRIBUTING.md to outline the proposal process and requirements. - Created README.md to provide an overview of enhancement proposals and link to the contributing guide. - Implemented init-proposal.sh script to bootstrap new proposals with metadata and a template. - Added proposal-template.md.template for consistent proposal structure and content.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0bbaeeb3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| exit 0 | ||
| ;; | ||
| -s|--status) | ||
| STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]') |
There was a problem hiding this comment.
Validate required values for option flags
The option parser reads $2 directly for --status, --author, and --output without first checking that a value exists, so invoking the script with a dangling flag (for example init-proposal.sh --status) crashes under set -u with an unbound variable error instead of returning a user-facing usage error. This makes the bootstrap tool brittle for normal CLI misuse and is easy to fix by guarding [[ $# -ge 2 ]] (or equivalent) before dereferencing $2 in each flag branch.
Useful? React with 👍 / 👎.
| @@ -0,0 +1,132 @@ | |||
| --- | |||
| title: {{title}} | |||
There was a problem hiding this comment.
Quote title in YAML front matter template
The template writes title as an unquoted YAML scalar, so common titles containing : (for example My: Proposal) produce invalid front matter (mapping values are not allowed when parsed). This makes the generated proposal file syntactically broken for valid human titles; quoting and escaping the title field is needed to keep generated markdown parseable.
Useful? React with 👍 / 👎.
Feedback: Add Scenario-Driven Multi-Turn EvaluationThe current proposal is valuable because it makes multi-turn evaluation real: each turn is executed separately, the Agent session is resumed between turns, and the evaluator can inspect intermediate responses through However, I think there is an important user-experience risk worth calling out: asking users to author static multi-turn cases can become difficult and fragile for many real-world Skills. Concern: Static
|
| Mode | Best For | User Responsibility |
|---|---|---|
input.turns |
Deterministic scripted multi-turn tests, protocol checks, guardrail verification, exact regressions | Author each turn and per-turn assertions |
input.conversation with chatter_agent |
Adaptive business workflows, clarification flows, goal-oriented task completion | Describe the objective, available knowledge, and judge criteria |
In other words, input.turns is a low-level deterministic mechanism, while chatter_agent is a higher-level scenario mechanism.
Additional Design Considerations
To make chatter_agent mode reliable enough for evaluation, the proposal may need to define several constraints:
max_turnsshould be required to control cost and prevent infinite conversations.- The ChatterAgent should have a clear stop condition, such as task completed, task impossible, or max turns reached.
- The transcript should mark which messages came from the ChatterAgent and which came from the Agent-With-Skill.
- The Judge should receive the full transcript exactly like static multi-turn mode.
- The final report should include the generated user turns so failures remain debuggable.
- The ChatterAgent should be instructed not to evaluate the Agent directly, but only to play the simulated user role.
This approach would reduce the burden of authoring multi-turn cases while still preserving the current proposal's core architecture: real turn-by-turn execution, session continuity, transcript collection, and reuse of existing judge types.
| - turn_response_contains: # Assert turn 2 response contains rejection keywords | ||
| turn: 2 | ||
| contains_any: ["need to complete first", "cannot skip", "execute in order"] | ||
| failure: |
There was a problem hiding this comment.
Why do we need failure criteria?
What happens if a response meets neither success nor failure criteria?
Summary
This PR introduces the proposal process and infrastructure for skill-up Enhancement Proposals (SUPs), along with the first proposal SUP-0001: Multi-Turn Conversation Evaluation Support.
Currently, skill-up's evaluator concatenates all turns into a single instruction and sends it to the Agent Engine in one shot — there is no actual turn-by-turn interaction, intermediate assertions, or conditional branching. This PR proposes and designs full multi-turn conversation evaluation capabilities.
Changes
proposals/CONTRIBUTING.md: Documents the proposal process, lifecycle statuses, and how to create and review proposals.proposals/README.md: Index of all skill-up Enhancement Proposals; currently lists SUP-0001.proposals/init-proposal.sh: Shell script to bootstrap new proposals from the template (sequential numbering, YAML front matter, slug generation).proposals/proposal-template.md.template: Reusable template with standard sections (Summary, Motivation, Goals, Design Details, Test Plan, etc.).proposals/0001-multi-turn-conversation-eval.md: SUP-0001 (status:provisional) — full design for multi-turn conversation evaluation, covering:SessionResumerinterfacepost_conditionchecks (must_contain_any / must_contain_all / must_not_contain) withskip_remainingandfailstrategiescapturerules (regex + JSONPath) with{{variable}}template substitution across turnsturn_response_contains,turn_response_not_contains,tool_called_in_turn,tool_not_called_in_turnclaude_code(--resume) andcodex(codex exec resume);qoderclifalls back to single-shot modeinput.promptsingle-turn cases are unaffectedRelated issues
Test Plan
Unit tests, integration tests, and E2E tests are described in detail in the proposal document (
proposals/0001-multi-turn-conversation-eval.md#test-plan). Implementation will follow in subsequent PRs across the phases defined in the proposal.