USW

Leaderboard

USW Leaderboard

Synthetic placeholder data

Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.

M1
Step Achievement Ratio
Mean % of the scientist's per-step target the agent reached, averaged over steps.
M2
Task Completion Score
% of the scientist's target the agent achieved on the task as a whole.
M3
Workflow Score
Fraction of steps whose score meets or exceeds target (1 / 0), averaged.

Evaluation setting

Four configurations of increasing guidance — scores differ by setting.

Overall ranking

7 agents · ranked by Task Completion Score. Click a metric header to re-sort.

≥ 100 = target met
#Agent HarnessModel Tasks
Claude Code
Claude Opus 4.8
Anthropic
18
102.8
98.2
66.7
OpenHands
Claude Opus 4.8
All Hands AI
18
98.3
92.5
41.7
Codex
GPT-5.5
OpenAI
18
96.4
91.1
44.4
4
Gemini CLI
Gemini 3.1 Pro
Google
18
95.4
89.6
38.9
5
OpenHands
GPT-5.5
All Hands AI
18
93.4
88.1
41.7
6
OpenHands
Gemini 3.1 Pro
All Hands AI
18
92.5
86.4
36.1
7
OpenHands
Qwen3.5-397B-A17B
Alibaba
18
87.5
81.3
22.2