Methodology
Not how much science an agent knows — whether it can drive discovery.
The University of Scientific Workflow (USW) does not ask how much an agent knows. It asks whether an agent can — like a working scientist — drive genuine, progressive discovery through a real experimental loop: adhering to a workflow taken from the actual procedure behind a Nature-family paper. The aim is to close the gap between benchmark success and real-world impact — so that passing USW means an agent can collaborate with scientists, or autonomously conduct research, in practice.
Key point 01
Multi-round review, by the people who do the science
Every task is built through a carefully staged review system — proposed by a Ph.D. student, judged by their in-lab advisor, then revised for agent execution, with the loop run on OpenReview.
Ph.D. student
Propose
A Ph.D. student proposes a practical, meaningful task following the task-construction guideline.
In-lab advisor
Evaluate
The student's advisor — a domain expert in the same lab — judges whether the direction is promising enough to anchor a Nature-family paper.
Lead student
Revise for agents
The lead student lightly revises the proposal — dataset paths, the main workspace path — so an agent can execute it in a computer environment.
OpenReview
Multi-round review
The loop runs on OpenReview in participant-restricted mode, mirroring agents4science.stanford.edu.
Evaluation model
Four settings of increasing guidance, three normalized metrics
Evaluation runs inside an evolutionary loop — literature insight → hypothesis generation → computational discovery, repeated. The same task is replayed under increasingly scaffolded settings to isolate where agents succeed and where error compounds across the workflow.
No Workflow
Autonomous · unguided
(Problem, Tool list) → Final outcomeThe agent gets only the problem and the tool list, and decides the entire procedure itself.
Workflow-Guided
Scientist's protocol given
(Problem, Tools, Scientist's workflow) → Step + final outcomesThe agent is handed the scientist's workflow and is scored on each step as well as the final outcome.
Stepwise · Self-produced
Agent's own carry-over
Step N: (Step N−1 agent output, Sub-problem, Tools) → Step N outcomeEach step consumes the agent's own previous output, so errors compound across the workflow.
Stepwise · Human Outcome
Scientist's carry-over
Step N: (Step N−1 scientist output, Sub-problem, Tools) → Step N outcomeEach step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.
SARStep Achievement Ratio
Per step, the percentage of the scientist's target score the agent reached — averaged over steps.
TCSTask Completion Score
For the whole task, the percentage of the scientist's target the agent achieved.
WFSWorkflow Score
The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.
Task coverage
Known, frontier, and rubric — a verifiability taxonomy
Task scope varies by field. USW spans problems whose outcome is already known and measurable, open frontier problems that remain verifiable, and frontier problems judged by a scientist's rubric — so even non-verifiable discovery stays legible.
Known · verifiable
Outcome is known and quantitatively verifiable; the workflow is measurable step by step.
Frontier · verifiable
An open problem with no established answer, yet the result remains quantitatively verifiable.
Real example: spin-glass D UFrontier · rubric
An open, non-verifiable problem judged by a scientist's rubric for a meaningful, impactful discovery.
Positioning
Where USW sits — a tiered view
Science benchmarks can be read as a ladder of capability, from isolated knowledge recall up to driving a real, end-to-end experimental workflow. USW sits at the top.
- T1
Knowledge recall
Tests what a model knows — scientific facts and reasoning in isolation.
e.g. Science QA / MCQA
- T2
Tool & code use
Can the agent operate tools and write working code for a bounded, well-defined task.
e.g. Terminal-Science
- T3
Partial workflow
Follows part of a research procedure, verified category-wise rather than step-wise.
e.g. LifeSciBench
- T4
End-to-end scientific workflow
USWDrives a real experimental loop, step by atomic step, toward an open-ended discovery goal.
e.g. USW — this benchmark
This tier framing is a research-preview sketch — the proposal leaves the ladder under-specified.
How USW compares
Against other scientific-workflow benchmarks
Closer to how science is actually done — department-driven tasks, mixed environments, GUI-based simulation, and verification at every step.
Get involved
USW is built with working scientists, not just for them. We close the gap between what AI experts imagine and what scientists actually need by going to the departments themselves.
Submit a task
Bring the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect.
Join the study
Take part in our scientist survey and interviews on which tasks matter most for AI agents to solve.
Co-authorship
Qualifying participants receive ~$10 compensation, or — depending on response quality — may be offered co-authorship.
Survey and interview details are shared with participants directly.