Introducing Code Review category by haoranpb · Pull Request #610 · microsoft/BC-Bench

haoranpb · 2026-04-12T08:56:09Z

A POC showing how Code Review category can be implemented.

Try it out locally:

# See the placeholder task in the dataset
uv run bcbench dataset view synthetic__performance-001 --category code-review

# See the default copilot review live
uv run bcbench -v evaluate copilot synthetic__performance-001 --category code-review --repo-path C:\depot\BCApps

Getting Started

Code Review team to implement

Replace the dataset, see codereview.jsonl
Implement the scoring and evaluation metrics calculation, we currently have placeholder metrics: count of comments
Add the AL Code Review skill

~~Leaderboard related logic is intentionally left unimplemented for now.~~

…C-Bench into category/code-review

…display and refactoring comment parsing logic

…egory/code-review

haoranpb

Solid progress.

One thing we should discuss: do we want this synthatic dataset? Or do we want to invest in real-world production PRs?

…egory/code-review

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

…egory/code-review

haoranpb · 2026-06-01T07:04:25Z

+            "Generated": str(len(self.generated_comments)),
+            "Matched": str(self.matched_comment_count),
+            "Expected": str(len(self.expected_comments)),


Probably not needed for the single run display, the results will be available in Kusto for analysis, here is just a quick summery

Acutally, maybe verse versa, we probably don't care about metrics like f1 score per task, it's the aggregate that really matters

haoranpb · 2026-06-01T07:04:53Z

+def _severity_mae(matched_pairs: list[tuple[ReviewComment, ReviewComment]]) -> float:
+    if not matched_pairs:
+        return 0.0
+    total_error: int = sum(abs(expected.severity.level - generated.severity.level) for expected, generated in matched_pairs)
+    return total_error / len(matched_pairs)
+


docstring for this function, and fullname instead of shorthand

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

kmhansen · 2026-06-02T06:55:59Z

+    generated_comments: list[ReviewComment],
+    line_tolerance: int,
+) -> list[tuple[ReviewComment, ReviewComment]]:
+    """Greedily pair each expected comment with the nearest unused generated comment in the same file."""


nit: better to do bipartite matching

kmhansen · 2026-06-02T06:56:47Z

+
+        if best_index is not None:
+            used_generated.add(best_index)
+            matched.append((expected, generated_comments[best_index]))


Hmmh. So not consideration of whether comments match semantically?

Yea, I had this in initially, but not sure if this is something we want in v1 to avoid going down the rabbit hole of judging the judge.

I can simple LLM judge that do semantic comparison

Groenbech96

I would consider using a LLM as a judge to compare the review comment. Probably instruct that is has to be the same intent with the comment.

…egory/code-review

haoranpb added 9 commits April 8, 2026 13:57

few more udpates for new categories

db388bd

Refactor evaluation and dataset operations for improved workspace setup

57c004e

enable skipping container setup in action

8e2f216

fix missing implementation for MockEvaluationPipeline

69a8db8

Refactor evaluation result classes to be more generic

7549d92

Merge branch 'main' into fix/more-ready-for-categories

f32dd00

Improve readabilty of GitHub Action summary

a4089b9

fix failing tests

99af6b2

Code Review POC

e1b0b93

haoranpb assigned darjoo and WaelAbuSeada Apr 12, 2026

haoranpb and others added 12 commits April 13, 2026 08:15

Merge branch 'main' into category/code-review

1a68d78

fix merge conflict resolution mistake

3ec10a0

Merge branch 'main' into category/code-review

4e52832

Make container parameters optional in evaluate and run commands

a9f59d9

Merge branch 'category/code-review' of https://github.com/microsoft/B…

065e1aa

…C-Bench into category/code-review

Enhance code review functionality by adding expected review comments …

4ad4bd9

…display and refactoring comment parsing logic

better hanlding container for not required categories

92951c4

Merge branch 'main' into category/code-review

7902610

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

dad9289

…egory/code-review

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

f1c4894

…egory/code-review

prefer copilot.exe executable

aa48a29

Normalize code-review dataset and preserve eval outputs

a244503

github-code-quality Bot found potential problems May 16, 2026

View reviewed changes

Comment thread src/bcbench/evaluate/codereview.py Fixed

haoranpb commented May 18, 2026

View reviewed changes

WaelAbuSeada added 5 commits May 19, 2026 18:31

Fix code-review branch setup and workflow wiring

9f6c353

Require review.json and add log-based recovery fallback

1a58e44

Harden code-review prompt for Windows copilot.cmd parsing

d0e8076

Expand code-review detailed table metrics

7411246

Update config and container setup action

c7131a4

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

Comment thread src/bcbench/types.py Fixed

haoranpb added 7 commits May 29, 2026 11:57

make run step OS indenpendent

0c58e8c

fix score mismatch

b076b98

extract github action related commands

df11718

test should not test runner name

541f6e4

make code review patches proper git diff

c9193e5

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

4408974

…egory/code-review

refactor to seperate the logics

859ec99

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Comment thread src/bcbench/results/codereview.py Fixed

haoranpb added 7 commits June 1, 2026 08:38

make more steps OS independent

0feba63

skip leaderboard update and stricter field for codereview resutl

db12ed4

simplify import/export

820b767

move CodeReviewResultSummary into codereview result file

7848f4b

strongly type CodeReviewResultSummary and reuse metrics util

64f37c0

saperate leaderboard from summary and make it generic

d216e42

fix failing tests

49a5cef

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Comment thread src/bcbench/results/codereview.py Fixed

haoranpb and others added 3 commits June 1, 2026 15:03

Potential fix for pull request finding 'Module imports itself'

35c5045

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

eaa1a2c

…egory/code-review

add CodeReview to mock tests

e00e939

haoranpb commented Jun 1, 2026

View reviewed changes

haoranpb requested a review from Copilot June 1, 2026 14:19

Copilot started reviewing on behalf of haoranpb June 1, 2026 14:20 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

kmhansen reviewed Jun 2, 2026

View reviewed changes

Groenbech96 reviewed Jun 3, 2026

View reviewed changes

haoranpb and others added 3 commits June 3, 2026 14:10

Merge branch 'main' of https://github.com/microsoft/BC-Bench into cat…

13b568c

…egory/code-review

Remove skills and instructions from category branch

d34742b

Restore al-test-generation assets on category branch

9eab86e

Conversation

haoranpb commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review team to implement

Uh oh!

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haoranpb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

haoranpb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

haoranpb Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

kmhansen Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

kmhansen Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

WaelAbuSeada Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Groenbech96 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

haoranpb commented Apr 12, 2026 •

edited

Loading