Skip to content

Fix eval set for GPT 53 upgrade#507

Open
harsheetjain wants to merge 1 commit intomicrosoft:mainfrom
harsheetjain:u/harsheetjain/fix-evalset-for-gpt53
Open

Fix eval set for GPT 53 upgrade#507
harsheetjain wants to merge 1 commit intomicrosoft:mainfrom
harsheetjain:u/harsheetjain/fix-evalset-for-gpt53

Conversation

@harsheetjain
Copy link
Copy Markdown

@harsheetjain harsheetjain commented May 8, 2026

Why prompt changes are needed between model upgrades

Eval prompts are not model-agnostic. Each model generation interprets intent,
phrasing, and ambiguity differently — a query that resolves cleanly on one model can
become ambiguous on the next.

In this case, GPT-5.3 was reading the yes/no framing ("Can I…?") as a
permissions question rather than the intended info + update-process intent.
The reworded query makes both sub-intents explicit.

🔑 When we upgrade the underlying model, eval sets must be re-tuned so the
scorecard measures agent quality, not phrasing artifacts the new model handles
differently. Otherwise regressions reflect prompt drift, not real behavior changes.


Test plan

  • Re-run ESS eval suite against GPT-5.3
  • Confirm the reworded query scores ≥ threshold on CompareMeaning
  • Spot-check no regression on prior model baseline

@harsheetjain harsheetjain requested a review from a team as a code owner May 8, 2026 06:11
@harsheetjain
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree company="Microsoft"

Copy link
Copy Markdown
Contributor

@nkemms nkemms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@harsheetjain harsheetjain reopened this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants