docs: add multi-turn conversation evaluation support proposal (SUP-0001) by lbfsc · Pull Request #40 · alibaba/skill-up

lbfsc · 2026-05-21T06:40:17Z

Summary

This PR introduces the proposal process and infrastructure for skill-up Enhancement Proposals (SUPs), along with the first proposal SUP-0001: Multi-Turn Conversation Evaluation Support.

Currently, skill-up's evaluator concatenates all turns into a single instruction and sends it to the Agent Engine in one shot — there is no actual turn-by-turn interaction, intermediate assertions, or conditional branching. This PR proposes and designs full multi-turn conversation evaluation capabilities.

Changes

proposals/CONTRIBUTING.md: Documents the proposal process, lifecycle statuses, and how to create and review proposals.
- proposals/README.md: Index of all skill-up Enhancement Proposals; currently lists SUP-0001.
- proposals/init-proposal.sh: Shell script to bootstrap new proposals from the template (sequential numbering, YAML front matter, slug generation).
- proposals/proposal-template.md.template: Reusable template with standard sections (Summary, Motivation, Goals, Design Details, Test Plan, etc.).
- proposals/0001-multi-turn-conversation-eval.md: SUP-0001 (status: provisional) — full design for multi-turn conversation evaluation, covering:
- Turn-by-turn evaluator execution engine with session resume via SessionResumer interface
- post_condition checks (must_contain_any / must_contain_all / must_not_contain) with skip_remaining and fail strategies
- capture rules (regex + JSONPath) with {{variable}} template substitution across turns
- Per-turn Judge assertions: turn_response_contains, turn_response_not_contains, tool_called_in_turn, tool_not_called_in_turn
- Agent implementations: claude_code (--resume) and codex (codex exec resume); qodercli falls back to single-shot mode
- Full backward compatibility: existing input.prompt single-turn cases are unaffected

Related issues

Test Plan

Unit tests, integration tests, and E2E tests are described in detail in the proposal document (proposals/0001-multi-turn-conversation-eval.md#test-plan). Implementation will follow in subsequent PRs across the phases defined in the proposal.

…posals - Introduced CONTRIBUTING.md to outline the proposal process and requirements. - Created README.md to provide an overview of enhancement proposals and link to the contributing guide. - Implemented init-proposal.sh script to bootstrap new proposals with metadata and a template. - Added proposal-template.md.template for consistent proposal structure and content.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0bbaeeb3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T06:47:06Z

+            exit 0
+            ;;
+        -s|--status)
+            STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]')


Validate required values for option flags

The option parser reads $2 directly for --status, --author, and --output without first checking that a value exists, so invoking the script with a dangling flag (for example init-proposal.sh --status) crashes under set -u with an unbound variable error instead of returning a user-facing usage error. This makes the bootstrap tool brittle for normal CLI misuse and is easy to fix by guarding [[ $# -ge 2 ]] (or equivalent) before dereferencing $2 in each flag branch.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-21T06:47:06Z

@@ -0,0 +1,132 @@
+---
+title: {{title}}


Quote title in YAML front matter template

The template writes title as an unquoted YAML scalar, so common titles containing : (for example My: Proposal) produce invalid front matter (mapping values are not allowed when parsed). This makes the generated proposal file syntactically broken for valid human titles; quoting and escaping the title field is needed to keep generated markdown parseable.

Useful? React with 👍 / 👎.

lijunfeng722 · 2026-05-21T08:58:28Z

Feedback: Add Scenario-Driven Multi-Turn Evaluation

The current proposal is valuable because it makes multi-turn evaluation real: each turn is executed separately, the Agent session is resumed between turns, and the evaluator can inspect intermediate responses through post_condition, capture, and per-turn judge assertions.

However, I think there is an important user-experience risk worth calling out: asking users to author static multi-turn cases can become difficult and fragile for many real-world Skills.

Concern: Static `turns` Are Hard to Maintain

Multi-turn conversation is not just a fixed sequence of user messages. In many cases, the next user message depends on the Agent's previous response. Because LLM outputs are probabilistic, the expected response at each turn may vary across different Agents, different models, and even repeated runs with the same Agent and model.

Even if the framework provides loose assertion patterns such as contains_any, must_contain_any, regex capture, or template variables, users may still spend a lot of time tuning case YAML instead of describing the behavior they actually want to evaluate.

For example, suppose we have a Membership Management Skill that can create and update member profiles. A member may have fields such as name, gender, age, phone number, membership level, and tags.

The user's actual evaluation goal might be:

The Agent should create a new member named Alice. If required information is missing, it should ask for clarification. After receiving the missing information, it should create the member and then support a follow-up update to the member's membership level.

With the current static turns approach, the case author may try to write:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  turns:
    - role: user
      content: "Create a member named Alice"
      post_condition:
        must_contain_any:
          - "age"
          - "gender"
          - "phone"
          - "missing information"
          - "please provide"
        on_fail: fail

    - role: user
      content: "Alice is 28, female, phone number is 13800000000"
      post_condition:
        must_contain_any:
          - "created"
          - "member ID"
          - "success"
        on_fail: fail

    - role: user
      content: "Upgrade her to Gold member"

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
    - tool_called:
        name: update_member_level

But different Agents may reasonably behave in different ways:

One Agent may ask only for age and phone number.
Another Agent may ask for gender, age, phone number, and initial membership level.
Another Agent may create a draft member first and ask for confirmation.
Another Agent may create the member directly with defaults and then ask whether more profile fields should be completed.
Another Agent may call a tool successfully but summarize the result with wording that does not match the configured keywords.

The case author then needs to encode many acceptable branches using loose matching, optional post-conditions, capture rules, and retries. This can quickly turn multi-turn case writing into prompt/output chasing.

This does not mean input.turns is wrong. It is still very useful for deterministic protocol tests, such as phase gating, double confirmation, exact regression cases, and cases where each user turn must be precisely controlled. But for more open-ended business workflows, a higher-level mode may be easier for users to author and maintain.

Proposal: Add a ChatterAgent-Driven Conversation Mode

As a complementary mode to static input.turns, we could introduce a scenario-driven multi-turn evaluation mode powered by another Agent, tentatively called ChatterAgent.

In this mode, the case author does not need to predefine every user turn. Instead, each case describes:

the objective that the simulated user wants to complete;
the information available to the simulated user;
optional behavior constraints for how the simulated user should interact;
the maximum number of turns allowed.

Then the ChatterAgent talks with the Agent-With-Skill until the task appears complete, impossible, or the turn limit is reached. After that, the framework reuses the existing judge.type logic to evaluate the resulting transcript, tool calls, files, workspace state, or final answer.

Conceptually:

Case objective and chatter context
  -> ChatterAgent simulates the user
      -> talks with Agent-With-Skill for up to max_turns
          -> full transcript
              -> existing judge(type: rule_based / agent_judge / etc.)

The important point is that this should not introduce a separate evaluation framework. The ChatterAgent is responsible for generating adaptive user turns. The existing Judge remains responsible for deciding whether the Skill succeeded.

Possible Generic Schema

The schema should avoid fields that are too specific to a membership example, such as user_facts at the top level. A more generic design could place all simulated-user configuration under chatter, with knowledge as an arbitrary structured YAML object.

Also, task and chatter_instructions may overlap. A clearer split is:

objective: what the simulated user is trying to accomplish;
knowledge: information the simulated user knows and may provide during the conversation;
behavior: optional interaction constraints or style.

Example:

id: membership-create-and-upgrade
title: Create a member after clarification, then upgrade membership level

input:
  conversation:
    mode: chatter_agent
    max_turns: 6

    chatter:
      role: "business user"
      objective: >
        Create a new member named Alice, then upgrade her membership level to Gold.
        If the Agent asks for missing information, provide the required information.

      knowledge:
        member:
          name: Alice
          gender: female
          age: 28
          phone: "13800000000"
          initial_level: Silver
          target_level: Gold

      behavior:
        - "Act like a normal user, not an evaluator."
        - "Do not reveal all information at once unless the Agent asks for it."
        - "Do not invent information outside the provided knowledge."
        - "Stop once the task appears completed."

judge:
  type: rule_based
  success:
    - tool_called:
        name: create_member
        args:
          name: Alice
          age: 28
          gender: female
    - tool_called:
        name: update_member_level
        args:
          level: Gold

This keeps the case closer to the user's intent:

Can the Skill complete this task through conversation?

instead of forcing the user to answer:

Can I predict every intermediate Agent response and manually maintain the exact next user turn?

Suggested Positioning

The two modes can be complementary:

Mode	Best For	User Responsibility
`input.turns`	Deterministic scripted multi-turn tests, protocol checks, guardrail verification, exact regressions	Author each turn and per-turn assertions
`input.conversation` with `chatter_agent`	Adaptive business workflows, clarification flows, goal-oriented task completion	Describe the objective, available knowledge, and judge criteria

In other words, input.turns is a low-level deterministic mechanism, while chatter_agent is a higher-level scenario mechanism.

Additional Design Considerations

To make chatter_agent mode reliable enough for evaluation, the proposal may need to define several constraints:

max_turns should be required to control cost and prevent infinite conversations.
The ChatterAgent should have a clear stop condition, such as task completed, task impossible, or max turns reached.
The transcript should mark which messages came from the ChatterAgent and which came from the Agent-With-Skill.
The Judge should receive the full transcript exactly like static multi-turn mode.
The final report should include the generated user turns so failures remain debuggable.
The ChatterAgent should be instructed not to evaluate the Agent directly, but only to play the simulated user role.

This approach would reduce the burden of authoring multi-turn cases while still preserving the current proposal's core architecture: real turn-by-turn execution, session continuity, transcript collection, and reuse of existing judge types.

hittyt · 2026-05-21T09:06:46Z

+    - turn_response_contains:      # Assert turn 2 response contains rejection keywords
+        turn: 2
+        contains_any: ["need to complete first", "cannot skip", "execute in order"]
+  failure:


Why do we need failure criteria?
What happens if a response meets neither success nor failure criteria?

lbfsc requested a review from hittyt as a code owner May 21, 2026 06:40

lbfsc requested review from jwx0925, lijunfeng722 and zpzjzj May 21, 2026 06:40

lbfsc self-assigned this May 21, 2026

lbfsc added the documentation Improvements or additions to documentation label May 21, 2026

lbfsc added this to skill-up project May 21, 2026

github-project-automation Bot moved this to Backlog in skill-up project May 21, 2026

lbfsc moved this from Backlog to In review in skill-up project May 21, 2026

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

hittyt reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40

docs: add multi-turn conversation evaluation support proposal (SUP-0001)#40
lbfsc wants to merge 1 commit into
mainfrom
docs/add-multi-turn-conversation-eval-support-proposal

lbfsc commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

lijunfeng722 commented May 21, 2026

Uh oh!

hittyt May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lbfsc commented May 21, 2026

Summary

Changes

Related issues

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

lijunfeng722 commented May 21, 2026

Feedback: Add Scenario-Driven Multi-Turn Evaluation

Concern: Static turns Are Hard to Maintain

Proposal: Add a ChatterAgent-Driven Conversation Mode

Possible Generic Schema

Suggested Positioning

Additional Design Considerations

Uh oh!

hittyt May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Concern: Static `turns` Are Hard to Maintain