Skip to content

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40

Open
lbfsc wants to merge 1 commit into
mainfrom
docs/add-multi-turn-conversation-eval-support-proposal
Open

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40
lbfsc wants to merge 1 commit into
mainfrom
docs/add-multi-turn-conversation-eval-support-proposal

Conversation

@lbfsc
Copy link
Copy Markdown
Collaborator

@lbfsc lbfsc commented May 21, 2026

Summary

This PR introduces the proposal process and infrastructure for skill-up Enhancement Proposals (SUPs), along with the first proposal SUP-0001: Multi-Turn Conversation Evaluation Support.

Currently, skill-up's evaluator concatenates all turns into a single instruction and sends it to the Agent Engine in one shot — there is no actual turn-by-turn interaction, intermediate assertions, or conditional branching. This PR proposes and designs full multi-turn conversation evaluation capabilities.

Changes

  • proposals/CONTRIBUTING.md: Documents the proposal process, lifecycle statuses, and how to create and review proposals.
    • proposals/README.md: Index of all skill-up Enhancement Proposals; currently lists SUP-0001.
    • proposals/init-proposal.sh: Shell script to bootstrap new proposals from the template (sequential numbering, YAML front matter, slug generation).
    • proposals/proposal-template.md.template: Reusable template with standard sections (Summary, Motivation, Goals, Design Details, Test Plan, etc.).
    • proposals/0001-multi-turn-conversation-eval.md: SUP-0001 (status: provisional) — full design for multi-turn conversation evaluation, covering:
    • Turn-by-turn evaluator execution engine with session resume via SessionResumer interface
    • post_condition checks (must_contain_any / must_contain_all / must_not_contain) with skip_remaining and fail strategies
    • capture rules (regex + JSONPath) with {{variable}} template substitution across turns
    • Per-turn Judge assertions: turn_response_contains, turn_response_not_contains, tool_called_in_turn, tool_not_called_in_turn
    • Agent implementations: claude_code (--resume) and codex (codex exec resume); qodercli falls back to single-shot mode
    • Full backward compatibility: existing input.prompt single-turn cases are unaffected

Related issues

Test Plan

Unit tests, integration tests, and E2E tests are described in detail in the proposal document (proposals/0001-multi-turn-conversation-eval.md#test-plan). Implementation will follow in subsequent PRs across the phases defined in the proposal.

…posals

- Introduced CONTRIBUTING.md to outline the proposal process and requirements.
- Created README.md to provide an overview of enhancement proposals and link to the contributing guide.
- Implemented init-proposal.sh script to bootstrap new proposals with metadata and a template.
- Added proposal-template.md.template for consistent proposal structure and content.
@lbfsc lbfsc requested a review from hittyt as a code owner May 21, 2026 06:40
@lbfsc lbfsc requested review from jwx0925, lijunfeng722 and zpzjzj May 21, 2026 06:40
@lbfsc lbfsc self-assigned this May 21, 2026
@lbfsc lbfsc added the documentation Improvements or additions to documentation label May 21, 2026
@lbfsc lbfsc moved this from Backlog to In review in skill-up project May 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0bbaeeb3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

exit 0
;;
-s|--status)
STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate required values for option flags

The option parser reads $2 directly for --status, --author, and --output without first checking that a value exists, so invoking the script with a dangling flag (for example init-proposal.sh --status) crashes under set -u with an unbound variable error instead of returning a user-facing usage error. This makes the bootstrap tool brittle for normal CLI misuse and is easy to fix by guarding [[ $# -ge 2 ]] (or equivalent) before dereferencing $2 in each flag branch.

Useful? React with 👍 / 👎.

@@ -0,0 +1,132 @@
---
title: {{title}}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quote title in YAML front matter template

The template writes title as an unquoted YAML scalar, so common titles containing : (for example My: Proposal) produce invalid front matter (mapping values are not allowed when parsed). This makes the generated proposal file syntactically broken for valid human titles; quoting and escaping the title field is needed to keep generated markdown parseable.

Useful? React with 👍 / 👎.

@lijunfeng722
Copy link
Copy Markdown
Collaborator

Feedback: Add Scenario-Driven Multi-Turn Evaluation

The current proposal is valuable because it makes multi-turn evaluation real: each turn is executed separately, the Agent session is resumed between turns, and the evaluator can inspect intermediate responses through post_condition, capture, and per-turn judge assertions.

However, I think there is an important user-experience risk worth calling out: asking users to author static multi-turn cases can become difficult and fragile for many real-world Skills.

Concern: Static turns Are Hard to Maintain

Multi-turn conversation is not just a fixed sequence of user messages. In many cases, the next user message depends on the Agent's previous response. Because LLM outputs are probabilistic, the expected response at each turn may vary across different Agents, different models, and even repeated runs with the same Agent and model.

Even if the framework provides loose assertion patterns such as contains_any, must_contain_any, regex capture, or template variables, users may still spend a lot of time tuning case YAML instead of describing the behavior they actually want to evaluate.

For example, suppose we have a Membership Management Skill that can create and update member profiles. A member may have fields such as name, gender, age, phone number, membership level, and tags.

The user's actual evaluation goal might be:

The Agent should create a new member named Alice. If required information is missing, it should ask for clarification. After receiving the missing information, it should create the member and then support a follow-up update to the member's membership level.

With the current static turns approach, the case author may try to write:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  turns:
    - role: user
      content: "Create a member named Alice"
      post_condition:
        must_contain_any:
          - "age"
          - "gender"
          - "phone"
          - "missing information"
          - "please provide"
        on_fail: fail

    - role: user
      content: "Alice is 28, female, phone number is 13800000000"
      post_condition:
        must_contain_any:
          - "created"
          - "member ID"
          - "success"
        on_fail: fail

    - role: user
      content: "Upgrade her to Gold member"

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
    - tool_called:
        name: update_member_level

But different Agents may reasonably behave in different ways:

  • One Agent may ask only for age and phone number.
  • Another Agent may ask for gender, age, phone number, and initial membership level.
  • Another Agent may create a draft member first and ask for confirmation.
  • Another Agent may create the member directly with defaults and then ask whether more profile fields should be completed.
  • Another Agent may call a tool successfully but summarize the result with wording that does not match the configured keywords.

The case author then needs to encode many acceptable branches using loose matching, optional post-conditions, capture rules, and retries. This can quickly turn multi-turn case writing into prompt/output chasing.

This does not mean input.turns is wrong. It is still very useful for deterministic protocol tests, such as phase gating, double confirmation, exact regression cases, and cases where each user turn must be precisely controlled. But for more open-ended business workflows, a higher-level mode may be easier for users to author and maintain.

Proposal: Add a ChatterAgent-Driven Conversation Mode

As a complementary mode to static input.turns, we could introduce a scenario-driven multi-turn evaluation mode powered by another Agent, tentatively called ChatterAgent.

In this mode, the case author does not need to predefine every user turn. Instead, each case describes:

  • the objective that the simulated user wants to complete;
  • the information available to the simulated user;
  • optional behavior constraints for how the simulated user should interact;
  • the maximum number of turns allowed.

Then the ChatterAgent talks with the Agent-With-Skill until the task appears complete, impossible, or the turn limit is reached. After that, the framework reuses the existing judge.type logic to evaluate the resulting transcript, tool calls, files, workspace state, or final answer.

Conceptually:

Case objective and chatter context
  -> ChatterAgent simulates the user
      -> talks with Agent-With-Skill for up to max_turns
          -> full transcript
              -> existing judge(type: rule_based / agent_judge / etc.)

The important point is that this should not introduce a separate evaluation framework. The ChatterAgent is responsible for generating adaptive user turns. The existing Judge remains responsible for deciding whether the Skill succeeded.

Possible Generic Schema

The schema should avoid fields that are too specific to a membership example, such as user_facts at the top level. A more generic design could place all simulated-user configuration under chatter, with knowledge as an arbitrary structured YAML object.

Also, task and chatter_instructions may overlap. A clearer split is:

  • objective: what the simulated user is trying to accomplish;
  • knowledge: information the simulated user knows and may provide during the conversation;
  • behavior: optional interaction constraints or style.

Example:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  conversation:
    mode: chatter_agent
    max_turns: 6

    chatter:
      role: "business user"
      objective: >
        Create a new member named Alice, then upgrade her membership level to Gold.
        If the Agent asks for missing information, provide the required information.

      knowledge:
        member:
          name: Alice
          gender: female
          age: 28
          phone: "13800000000"
          initial_level: Silver
          target_level: Gold

      behavior:
        - "Act like a normal user, not an evaluator."
        - "Do not reveal all information at once unless the Agent asks for it."
        - "Do not invent information outside the provided knowledge."
        - "Stop once the task appears completed."

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
        args:
          name: Alice
          age: 28
          gender: female
    - tool_called:
        name: update_member_level
        args:
          level: Gold

This keeps the case closer to the user's intent:

Can the Skill complete this task through conversation?

instead of forcing the user to answer:

Can I predict every intermediate Agent response and manually maintain the exact next user turn?

Suggested Positioning

The two modes can be complementary:

Mode Best For User Responsibility
input.turns Deterministic scripted multi-turn tests, protocol checks, guardrail verification, exact regressions Author each turn and per-turn assertions
input.conversation with chatter_agent Adaptive business workflows, clarification flows, goal-oriented task completion Describe the objective, available knowledge, and judge criteria

In other words, input.turns is a low-level deterministic mechanism, while chatter_agent is a higher-level scenario mechanism.

Additional Design Considerations

To make chatter_agent mode reliable enough for evaluation, the proposal may need to define several constraints:

  • max_turns should be required to control cost and prevent infinite conversations.
  • The ChatterAgent should have a clear stop condition, such as task completed, task impossible, or max turns reached.
  • The transcript should mark which messages came from the ChatterAgent and which came from the Agent-With-Skill.
  • The Judge should receive the full transcript exactly like static multi-turn mode.
  • The final report should include the generated user turns so failures remain debuggable.
  • The ChatterAgent should be instructed not to evaluate the Agent directly, but only to play the simulated user role.

This approach would reduce the burden of authoring multi-turn cases while still preserving the current proposal's core architecture: real turn-by-turn execution, session continuity, transcript collection, and reuse of existing judge types.

- turn_response_contains: # Assert turn 2 response contains rejection keywords
turn: 2
contains_any: ["need to complete first", "cannot skip", "execute in order"]
failure:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need failure criteria?
What happens if a response meets neither success nor failure criteria?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants