USW
Research preview10 science domains · atomized eval

The benchmark for agents that run the experiment — not just propose it.

USW asks a different question — not how much science an agent knows, but whether it can drive genuine discovery through a real experimental loop, the way a working scientist does. Every task is drawn from a Nature-family paper and broken into atomic, individually verifiable steps.

Flagship task Verified
Protein EngineeringExpert

Evolve an HACS enzyme with record activity toward formaldehyde

End goal

Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.

1Sampling process3m
2Fitness function prediction2m
3In silico screening2m
4Experimental validation3m
5Repeat — next active-learning round2m
Open task page

The mission

Most benchmarks score what an agent knows. USW measures what it can do — closing the gap between benchmark success and real-world impact.

Pass it, and an agent can collaborate with scientists in practice — or autonomously run research the way a working scientist does.

Why the experimental workflow

Proposing research is one thing. Carrying it out is another.

A scientific workflow runs as an evolutionary loop — literature insight, hypothesis, computational discovery (the loop popularized by systems like Google's Co-Scientist). Outside CS, that discovery stage means operating real tools and long-running simulations against open-ended goals with no single correct answer.

Real tools & simulations

Scientists drive a wide variety of instruments and solvers — many GUI-based, many slow.

Long, complex procedures

Reactions and pipelines unfold over hours to weeks across many dependent steps.

Frontier & open-ended

Many tasks are open problems with no published answer yet — better or worse outcomes, no single right one.

Atomized verification

Each step is checked against a target the scientist set, so progress is legible.

Anatomy of a task

An end goal, and a workflow of atomic steps

Each task pairs one open-ended discovery goal with the human scientist's own procedure — every step carrying the metrics and target scores they require.

STEP 11

Sampling process

jackhmmermmseqs2clean
3 target metrics
STEP 22

Fitness function prediction

2 target metrics
STEP 33

In silico screening

esm3ucb-acquisition
2 target metrics
STEP 44

Experimental validation

hplc
3 target metrics
STEP 55

Repeat — next active-learning round

2 target metrics

Evaluation protocol

Four settings, from fully autonomous to fully guided

The same task is run under increasingly scaffolded conditions to isolate where agents succeed — and where error compounds across a workflow.

01

No Workflow

Autonomous · unguided

The agent gets only the problem and the tool list, and decides the entire procedure itself.

02

Workflow-Guided

Scientist's protocol given

The agent is handed the scientist's workflow and is scored on each step as well as the final outcome.

03

Stepwise · Self-produced

Agent's own carry-over

Each step consumes the agent's own previous output, so errors compound across the workflow.

04

Stepwise · Human Outcome

Scientist's carry-over

Each step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.

Three metrics

Scored against the targets a scientist set

Every metric is normalized to the human scientist's required score — so a number means the same thing across wildly different domains.

SAR

Step Achievement Ratio

Per step, the percentage of the scientist's target score the agent reached — averaged over steps.

TCS

Task Completion Score

For the whole task, the percentage of the scientist's target the agent achieved.

WFS

Workflow Score

The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.

View the leaderboard

Positioning

How USW compares

Closer to how science is actually done — department-driven, Nature-level tasks, mixed environments, and verification at every step.

Dimension
Terminal-Science
LifeSciBench
USW (Ours)
Task collection
Community-driven
Expert & peer-review
Department-driven
Tasks / domains
3 (100+) / 5
750 / 1 (Biology)
50+ / 10
Environment
Terminal
Terminal / VM
GUI-based simulation
Supported
Realistic workflow
Yes
From real papers
Evaluation
Threshold (discrete)
Rubric-based
Threshold: discrete + continuous
Workflow verification
Partial (category-wise)
Instance & step-wise
Task quality bar
Low / unclear
High, less promising
Comparable to Nature-level
Science skills & simulations
Partial (no full access)
Full access

Bring your lab's workflow to the benchmark

Submit the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect. After expert review it becomes a University of Scientific Workflow task.