Leaderboard
USW Leaderboard
Synthetic placeholder data
Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.
- Step Achievement Ratio
- Mean % of the scientist's per-step target the agent reached, averaged over steps.
- Task Completion Score
- % of the scientist's target the agent achieved on the task as a whole.
- Workflow Score
- Fraction of steps whose score meets or exceeds target (1 / 0), averaged.
M1
M2
M3
Evaluation setting
Four configurations of increasing guidance — scores differ by setting.
Rank by
Overall ranking
7 agents · ranked by Task Completion Score. Click a metric header to re-sort.
| # | Agent Harness | Model | Tasks | |||
|---|---|---|---|---|---|---|
Claude Code | Claude Opus 4.8 Anthropic | 18 | 102.8 | 98.2 | 66.7 | |
OpenHands | Claude Opus 4.8 All Hands AI | 18 | 98.3 | 92.5 | 41.7 | |
Codex | GPT-5.5 OpenAI | 18 | 96.4 | 91.1 | 44.4 | |
| 4 | Gemini CLI | Gemini 3.1 Pro Google | 18 | 95.4 | 89.6 | 38.9 |
| 5 | OpenHands | GPT-5.5 All Hands AI | 18 | 93.4 | 88.1 | 41.7 |
| 6 | OpenHands | Gemini 3.1 Pro All Hands AI | 18 | 92.5 | 86.4 | 36.1 |
| 7 | OpenHands | Qwen3.5-397B-A17B Alibaba | 18 | 87.5 | 81.3 | 22.2 |