Leaderboard

USW Leaderboard

Synthetic placeholder data

Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.

Step Achievement Ratio: Mean % of the scientist's per-step target the agent reached, averaged over steps.
Task Completion Score: % of the scientist's target the agent achieved on the task as a whole.
Workflow Score: Fraction of steps whose score meets or exceeds target (1 / 0), averaged.

Evaluation setting

Four configurations of increasing guidance — scores differ by setting.

Rank by

Overall ranking

7 agents · ranked by Task Completion Score. Click a metric header to re-sort.

≥ 100 = target met

#	Agent Harness	Model	Tasks
	Claude Code	Claude Opus 4.8 Anthropic	18	102.8	98.2	66.7
	OpenHands	Claude Opus 4.8 All Hands AI	18	98.3	92.5	41.7
	Codex	GPT-5.5 OpenAI	18	96.4	91.1	44.4
4	Gemini CLI	Gemini 3.1 Pro Google	18	95.4	89.6	38.9
5	OpenHands	GPT-5.5 All Hands AI	18	93.4	88.1	41.7
6	OpenHands	Gemini 3.1 Pro All Hands AI	18	92.5	86.4	36.1
7	OpenHands	Qwen3.5-397B-A17B Alibaba	18	87.5	81.3	22.2