Introducing Code Review category#610
Conversation
…C-Bench into category/code-review
…display and refactoring comment parsing logic
…egory/code-review
…egory/code-review
haoranpb
left a comment
There was a problem hiding this comment.
Solid progress.
One thing we should discuss: do we want this synthatic dataset? Or do we want to invest in real-world production PRs?
…egory/code-review
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…egory/code-review
| "Generated": str(len(self.generated_comments)), | ||
| "Matched": str(self.matched_comment_count), | ||
| "Expected": str(len(self.expected_comments)), |
There was a problem hiding this comment.
Probably not needed for the single run display, the results will be available in Kusto for analysis, here is just a quick summery
There was a problem hiding this comment.
Acutally, maybe verse versa, we probably don't care about metrics like f1 score per task, it's the aggregate that really matters
| def _severity_mae(matched_pairs: list[tuple[ReviewComment, ReviewComment]]) -> float: | ||
| if not matched_pairs: | ||
| return 0.0 | ||
| total_error: int = sum(abs(expected.severity.level - generated.severity.level) for expected, generated in matched_pairs) | ||
| return total_error / len(matched_pairs) | ||
|
|
There was a problem hiding this comment.
docstring for this function, and fullname instead of shorthand
| generated_comments: list[ReviewComment], | ||
| line_tolerance: int, | ||
| ) -> list[tuple[ReviewComment, ReviewComment]]: | ||
| """Greedily pair each expected comment with the nearest unused generated comment in the same file.""" |
There was a problem hiding this comment.
nit: better to do bipartite matching
|
|
||
| if best_index is not None: | ||
| used_generated.add(best_index) | ||
| matched.append((expected, generated_comments[best_index])) |
There was a problem hiding this comment.
Hmmh. So not consideration of whether comments match semantically?
There was a problem hiding this comment.
Yea, I had this in initially, but not sure if this is something we want in v1 to avoid going down the rabbit hole of judging the judge.
I can simple LLM judge that do semantic comparison
Groenbech96
left a comment
There was a problem hiding this comment.
I would consider using a LLM as a judge to compare the review comment. Probably instruct that is has to be the same intent with the comment.
A POC showing how Code Review category can be implemented.
Try it out locally:
Getting Started
Code Review team to implement
codereview.jsonlLeaderboard related logic is intentionally left unimplemented for now.