Research preview10 science domains · atomized eval

The benchmark for agents that run the experiment — not just propose it.

USW asks a different question — not how much science an agent knows, but whether it can drive genuine discovery through a real experimental loop, the way a working scientist does. Every task is drawn from a Nature-family paper and broken into atomic, individually verifiable steps.

Explore tasks Contribute a workflow

17+

Flagship task Verified

Protein EngineeringExpert

Evolve an HACS enzyme with record activity toward formaldehyde

End goal

Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.

1Sampling process3m

2Fitness function prediction2m

3In silico screening2m

4Experimental validation3m

5Repeat — next active-learning round2m

Open task page

The mission

Most benchmarks score what an agent knows. USW measures what it can do — closing the gap between benchmark success and real-world impact.

Pass it, and an agent can collaborate with scientists in practice — or autonomously run research the way a working scientist does.

Why the experimental workflow

Proposing research is one thing. Carrying it out is another.

A scientific workflow runs as an evolutionary loop — literature insight, hypothesis, computational discovery (the loop popularized by systems like Google's Co-Scientist). Outside CS, that discovery stage means operating real tools and long-running simulations against open-ended goals with no single correct answer.

Real tools & simulations

Scientists drive a wide variety of instruments and solvers — many GUI-based, many slow.

Long, complex procedures

Reactions and pipelines unfold over hours to weeks across many dependent steps.

Frontier & open-ended

Many tasks are open problems with no published answer yet — better or worse outcomes, no single right one.

Atomized verification

Each step is checked against a target the scientist set, so progress is legible.

Anatomy of a task

An end goal, and a workflow of atomic steps

Each task pairs one open-ended discovery goal with the human scientist's own procedure — every step carrying the metrics and target scores they require.

STEP 11

Sampling process

jackhmmermmseqs2clean

3 target metrics

STEP 22

Fitness function prediction

2 target metrics

STEP 33

In silico screening

esm3ucb-acquisition

2 target metrics

STEP 44

Experimental validation

hplc

3 target metrics

STEP 55

Repeat — next active-learning round

2 target metrics

See the full workflow, metrics & simulations

Evaluation protocol

Four settings, from fully autonomous to fully guided

The same task is run under increasingly scaffolded conditions to isolate where agents succeed — and where error compounds across a workflow.

No Workflow

Autonomous · unguided

The agent gets only the problem and the tool list, and decides the entire procedure itself.

Workflow-Guided

Scientist's protocol given

The agent is handed the scientist's workflow and is scored on each step as well as the final outcome.

Stepwise · Self-produced

Agent's own carry-over

Each step consumes the agent's own previous output, so errors compound across the workflow.

Stepwise · Human Outcome

Scientist's carry-over

Each step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.

Three metrics

Scored against the targets a scientist set

Every metric is normalized to the human scientist's required score — so a number means the same thing across wildly different domains.

SAR

Step Achievement Ratio

Per step, the percentage of the scientist's target score the agent reached — averaged over steps.

TCS

Task Completion Score

For the whole task, the percentage of the scientist's target the agent achieved.

WFS

Workflow Score

The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.

View the leaderboard

Coverage

Ten domains, growing department by department

University of Scientific Workflow is collected by going to the scientists themselves. Sparse domains are where your contribution counts most.

Full statistics

Protein Engineering

Directed evolution, enzyme design & fitness optimization.

Genomics

Sequence assembly, variant calling & regulatory inference.

Structural Biology

Folding, cryo-EM reconstruction & complex prediction.

Computational Chemistry

Reaction modeling, DFT & molecular dynamics.

Materials Science

Crystal discovery, property prediction & synthesis routes.

Drug Discovery

Virtual screening, ADMET & lead optimization.

Neuroscience

Connectomics, spike inference & neural decoding.

Systems & Synthetic Biology

Pathway design, flux balance & circuit engineering.

Climate & Earth Science

Downscaling, extreme-event detection & carbon modeling.

Astrophysics

Transient detection, spectral fitting & N-body simulation.

Positioning

How USW compares

Closer to how science is actually done — department-driven, Nature-level tasks, mixed environments, and verification at every step.

Dimension

Terminal-Science

LifeSciBench

USW (Ours)

Task collection

Community-driven

Expert & peer-review

Department-driven

Tasks / domains

3 (100+) / 5

750 / 1 (Biology)

50+ / 10

Environment

Terminal

—

Terminal / VM

GUI-based simulation

—

Supported

Realistic workflow

—

Yes

From real papers

Evaluation

Threshold (discrete)

Rubric-based

Threshold: discrete + continuous

Workflow verification

—

Partial (category-wise)

Instance & step-wise

Task quality bar

Low / unclear

High, less promising

Comparable to Nature-level

Science skills & simulations

Partial (no full access)

—

Full access

Full methodology & review process

Bring your lab's workflow to the benchmark

Submit the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect. After expert review it becomes a University of Scientific Workflow task.

Submit a task Browse the hub