Pilot lab: risky diff review where confidence is not enough
A seeded review-quality lab that focuses on hidden regressions, weak assumptions, and whether the agent can challenge a plausible-looking diff.
The better review product makes risk discovery easy to follow and does not bury the engineer under low-signal comments.
Model quality appears in whether the system can reason about side effects and hidden assumptions under ambiguity.
The workflow succeeds only when the review would materially improve a real pull request conversation.
Medium-size product diff that appears reasonable at first glance but contains behavior risks and weak assumptions.
Review the proposed change, identify hidden regressions or unsupported assumptions, and separate high-confidence concerns from speculation.
Why this lab matters
Code review is one of the easiest areas for confident-sounding output to hide shallow reasoning.
A useful review report should answer:
- What is the concrete risk?
- Why does it matter in behavior terms?
- How confident is the reviewer in that conclusion?
What not to reward
Long review output is not automatically good review output.
If the agent produces a long list of cosmetic comments but misses the behavior-changing regression, the report should call that a failure even if the prose sounds polished.