Methodology

Not how much science an agent knows — whether it can drive discovery.

The University of Scientific Workflow (USW) does not ask how much an agent knows. It asks whether an agent can — like a working scientist — drive genuine, progressive discovery through a real experimental loop: adhering to a workflow taken from the actual procedure behind a Nature-family paper. The aim is to close the gap between benchmark success and real-world impact — so that passing USW means an agent can collaborate with scientists, or autonomously conduct research, in practice.

Research previewMulti-round review · atomized eval

Key point 01

Multi-round review, by the people who do the science

Every task is built through a carefully staged review system — proposed by a Ph.D. student, judged by their in-lab advisor, then revised for agent execution, with the loop run on OpenReview.

Ph.D. student

Propose

A Ph.D. student proposes a practical, meaningful task following the task-construction guideline.

In-lab advisor

Evaluate

The student's advisor — a domain expert in the same lab — judges whether the direction is promising enough to anchor a Nature-family paper.

Lead student

Revise for agents

The lead student lightly revises the proposal — dataset paths, the main workspace path — so an agent can execute it in a computer environment.

OpenReview

Multi-round review

The loop runs on OpenReview in participant-restricted mode, mirroring agents4science.stanford.edu.

Evaluation model

Four settings of increasing guidance, three normalized metrics

Evaluation runs inside an evolutionary loop — literature insight → hypothesis generation → computational discovery, repeated. The same task is replayed under increasingly scaffolded settings to isolate where agents succeed and where error compounds across the workflow.

No Workflow

Autonomous · unguided

(Problem, Tool list) → Final outcome

The agent gets only the problem and the tool list, and decides the entire procedure itself.

Workflow-Guided

Scientist's protocol given

(Problem, Tools, Scientist's workflow) → Step + final outcomes

The agent is handed the scientist's workflow and is scored on each step as well as the final outcome.

Stepwise · Self-produced

Agent's own carry-over

Step N: (Step N−1 agent output, Sub-problem, Tools) → Step N outcome

Each step consumes the agent's own previous output, so errors compound across the workflow.

Stepwise · Human Outcome

Scientist's carry-over

Step N: (Step N−1 scientist output, Sub-problem, Tools) → Step N outcome

Each step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.

SAR

Step Achievement Ratio

Per step, the percentage of the scientist's target score the agent reached — averaged over steps.

TCS

Task Completion Score

For the whole task, the percentage of the scientist's target the agent achieved.

WFS

Workflow Score

The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.

Task coverage

Known, frontier, and rubric — a verifiability taxonomy

Task scope varies by field. USW spans problems whose outcome is already known and measurable, open frontier problems that remain verifiable, and frontier problems judged by a scientist's rubric — so even non-verifiable discovery stays legible.

KnownC1

Known · verifiable

Outcome is known and quantitatively verifiable; the workflow is measurable step by step.

FrontierC2

Frontier · verifiable

An open problem with no established answer, yet the result remains quantitatively verifiable.

Real example: spin-glass D _U

RubricC3

Frontier · rubric

An open, non-verifiable problem judged by a scientist's rubric for a meaningful, impactful discovery.

Positioning

Where USW sits — a tiered view

Science benchmarks can be read as a ladder of capability, from isolated knowledge recall up to driving a real, end-to-end experimental workflow. USW sits at the top.

T1
Knowledge recall
Tests what a model knows — scientific facts and reasoning in isolation.
e.g. Science QA / MCQA
T2
Tool & code use
Can the agent operate tools and write working code for a bounded, well-defined task.
e.g. Terminal-Science
T3
Partial workflow
Follows part of a research procedure, verified category-wise rather than step-wise.
e.g. LifeSciBench
T4
End-to-end scientific workflow
USW
Drives a real experimental loop, step by atomic step, toward an open-ended discovery goal.
e.g. USW — this benchmark

This tier framing is a research-preview sketch — the proposal leaves the ladder under-specified.

How USW compares

Against other scientific-workflow benchmarks

Closer to how science is actually done — department-driven tasks, mixed environments, GUI-based simulation, and verification at every step.

Dimension

Terminal-Science

LifeSciBench

USW (Ours)

Task collection

Community-driven

Expert & peer-review

Department-driven

Tasks / domains

3 (100+) / 5

750 / 1 (Biology)

50+ / 10

Environment

Terminal

—

Terminal / VM

GUI-based simulation

—

Supported

Realistic workflow

—

Yes

From real papers

Evaluation

Threshold (discrete)

Rubric-based

Threshold: discrete + continuous

Workflow verification

—

Partial (category-wise)

Instance & step-wise

Task quality bar

Low / unclear

High, less promising

Comparable to Nature-level

Science skills & simulations

Partial (no full access)

—

Full access

Get involved

USW is built with working scientists, not just for them. We close the gap between what AI experts imagine and what scientists actually need by going to the departments themselves.

Submit a task

Bring the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect.

Join the study

Take part in our scientist survey and interviews on which tasks matter most for AI agents to solve.

Co-authorship

Qualifying participants receive ~$10 compensation, or — depending on response quality — may be offered co-authorship.

Submit a task Contact us on GitHub

Survey and interview details are shared with participants directly.