Task reports
Public same-prompt examples first, then seeded pilot reports that define the future test structure.
Public result: same product brief, Claude Code and Codex branches
A public GitHub comparison where the same competitive-intelligence app prompts produced separate Claude Code and Codex implementations.
Public result: same todo CLI prompt across Claude Code and Codex
A public benchmark folder with generated Node.js todo CLI implementations from Claude Code and Codex using the same prompt.
Pilot lab: legacy repo onboarding without architecture hallucination
A seeded lab report that demonstrates how AgentScope should document repository onboarding tasks, evidence trails, and reviewer burden.
Pilot lab: bug fix under constraints with tight patch scope
A seeded bug-fix report focused on whether an agent can isolate a defect, keep edits narrow, and avoid collateral damage.
Pilot lab: risky diff review where confidence is not enough
A seeded review-quality lab that focuses on hidden regressions, weak assumptions, and whether the agent can challenge a plausible-looking diff.
Pilot lab: refactor with intent preservation instead of style drift
A seeded refactor report that evaluates whether a system can improve structure while preserving behavior, boundaries, and local conventions.
Pilot lab: UI generation from a brief without falling into generic patterns
A seeded design-and-implementation lab for judging whether a coding agent can translate a product brief into intentional interface choices.
Pilot lab: recovery after command failure and partial evidence
A seeded operational lab that evaluates whether the agent can recover after a failed command, revise its plan, and stay useful without hiding uncertainty.