The benchmark for agents that run the experiment — not just propose it.
USW asks a different question — not how much science an agent knows, but whether it can drive genuine discovery through a real experimental loop, the way a working scientist does. Every task is drawn from a Nature-family paper and broken into atomic, individually verifiable steps.
Evolve an HACS enzyme with record activity toward formaldehyde
End goal
Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.
The mission
Most benchmarks score what an agent knows. USW measures what it can do — closing the gap between benchmark success and real-world impact.
Pass it, and an agent can collaborate with scientists in practice — or autonomously run research the way a working scientist does.
Why the experimental workflow
Proposing research is one thing. Carrying it out is another.
A scientific workflow runs as an evolutionary loop — literature insight, hypothesis, computational discovery (the loop popularized by systems like Google's Co-Scientist). Outside CS, that discovery stage means operating real tools and long-running simulations against open-ended goals with no single correct answer.
Real tools & simulations
Scientists drive a wide variety of instruments and solvers — many GUI-based, many slow.
Long, complex procedures
Reactions and pipelines unfold over hours to weeks across many dependent steps.
Frontier & open-ended
Many tasks are open problems with no published answer yet — better or worse outcomes, no single right one.
Atomized verification
Each step is checked against a target the scientist set, so progress is legible.
Anatomy of a task
An end goal, and a workflow of atomic steps
Each task pairs one open-ended discovery goal with the human scientist's own procedure — every step carrying the metrics and target scores they require.
Sampling process
Fitness function prediction
In silico screening
Experimental validation
Repeat — next active-learning round
Evaluation protocol
Four settings, from fully autonomous to fully guided
The same task is run under increasingly scaffolded conditions to isolate where agents succeed — and where error compounds across a workflow.
No Workflow
Autonomous · unguided
The agent gets only the problem and the tool list, and decides the entire procedure itself.
Workflow-Guided
Scientist's protocol given
The agent is handed the scientist's workflow and is scored on each step as well as the final outcome.
Stepwise · Self-produced
Agent's own carry-over
Each step consumes the agent's own previous output, so errors compound across the workflow.
Stepwise · Human Outcome
Scientist's carry-over
Each step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.
Three metrics
Scored against the targets a scientist set
Every metric is normalized to the human scientist's required score — so a number means the same thing across wildly different domains.
Step Achievement Ratio
Per step, the percentage of the scientist's target score the agent reached — averaged over steps.
Task Completion Score
For the whole task, the percentage of the scientist's target the agent achieved.
Workflow Score
The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.
Coverage
Ten domains, growing department by department
University of Scientific Workflow is collected by going to the scientists themselves. Sparse domains are where your contribution counts most.
Protein Engineering
Directed evolution, enzyme design & fitness optimization.
Genomics
Sequence assembly, variant calling & regulatory inference.
Structural Biology
Folding, cryo-EM reconstruction & complex prediction.
Computational Chemistry
Reaction modeling, DFT & molecular dynamics.
Materials Science
Crystal discovery, property prediction & synthesis routes.
Drug Discovery
Virtual screening, ADMET & lead optimization.
Neuroscience
Connectomics, spike inference & neural decoding.
Systems & Synthetic Biology
Pathway design, flux balance & circuit engineering.
Climate & Earth Science
Downscaling, extreme-event detection & carbon modeling.
Astrophysics
Transient detection, spectral fitting & N-body simulation.
Positioning
How USW compares
Closer to how science is actually done — department-driven, Nature-level tasks, mixed environments, and verification at every step.
Bring your lab's workflow to the benchmark
Submit the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect. After expert review it becomes a University of Scientific Workflow task.