Protein EngineeringDirected EvolutionExpertVerifiedFrontier · verifiable

Evolve an HACS enzyme with record activity toward formaldehyde

Run a full active-learning directed-evolution loop — homolog mining, fitness modeling, ESM3 design, and HPLC validation — to find an HACS variant more active toward formaldehyde than any known or engineered enzyme.

Nhat (first author)

University of Scientific Workflow founding contributor

Active-learning-guided discovery of high-activity formaldehyde-converting enzymes

Nature (in submission) · 2025

Multi-round · days–weeks per loop

registered 2025-11-02

active-learningenzymedirected-evolutionESM3HPLCopen-ended

End goal

Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.

Overview

This task reproduces a real, multi-round protein-engineering campaign. Starting from a seed HACS sequence, the agent must mine distant homologs, reduce and filter them into a synthesizable library, train a fitness model on experimental activity, generate new variants with an inverse-folding model, prioritize them with an acquisition function, and validate the top picks against an HPLC oracle. The loop repeats: each round of ground-truth measurements retrains the model and sharpens the next batch. Success is open-ended — there is no single correct sequence — but every step is individually verifiable against the targets the scientist set.

Tools allowed

HMMER / Jackhmmer·HPCMMseqs2·TerminalCLEAN·TerminalEnzymeCAGE·HPCESM3·HPCUCB Acquisition·TerminalHPLC Activity Screen·VM (GUI)

Constraints

Software

HMMER / JackhmmerMMseqs2CLEANEnzymeCAGEESM3HPLC analysis pipeline

Hardware

GPU for ESM3 inference & fitness modelingWet-lab HPLC instrument (activity oracle)

Datasets

JGI / NCBI sequence databases
Public protein databases mined for distant HACS homologs in the sampling step.
Experimental activity dataset
Round-by-round HPLC activity measurements that train and update the fitness model (the ground-truth oracle).

Workflow

5-step protocol

Each step is verified against the scientist's targets. Open any simulation to test it live.

1
Sampling process
Step 1 / 5
Build a diverse, reaction-relevant candidate library from public sequence space, then reduce it to a set affordable to synthesize and screen.
Protocol
1. aApply HMMER/Jackhmmer to JGI/NCBI databases to find distant homologs.
2. bMMseqs2 to cluster and reduce the total number in the distribution.
3. cCLEAN to filter by EC number.
4. dEnzymeCAGE to filter by specific reactions.
5. eProduce the final distribution for synthesis and HPLC screening.
Targets
Distant-homolog recall≥85% known actives
EC-filter precision≥90%
Reaction-filter enrichment≥3×
Expected output
A curated, non-redundant candidate library (FASTA) annotated by EC and reaction compatibility.
Simulations · click to test
output carries into step 2
2
Fitness function prediction
Step 2 / 5
Train a sequence-to-activity model on the experimental data collected so far and quantify how much it improved over the previous round.
Protocol
1. aTrain the model with the experimental data.
2. bCalculate the accuracy score.
3. cEvaluate improvement from the last round.
Targets
Model accuracy≥0.7Spearman ρ
Round-over-round gain≥0.05Δρ
Expected output
A fitness model with held-out accuracy and a round-over-round improvement report.
output carries into step 3
3
In silico screening
Step 3 / 5
Generate new candidate variants, score them with the fitness model under an exploration-aware acquisition function, and assemble the next synthesis batch.
Protocol
1. aGenerate a candidate distribution with ESM3 inverse folding.
2. bCalculate the UCB score for each candidate.
3. cChoose top candidates for synthesis — including degenerate codons and saturation variants plus the original jackhmmer library.
Targets
Predicted activity (top-32 mean)≥4U/mg
Batch diversity≥60% unique clusters
Expected output
A ranked synthesis batch (plate map) combining generated, saturation, and library variants.
Simulations · click to test
output carries into step 4
4
Experimental validation
Step 4 / 5
Measure true activity for the synthesized batch with HPLC, then fold the new ground truth back into the model to begin the next round.
Protocol
1. aHPLC screening of the synthesized variants.
2. bUpdate the model with the extra data (ground truth).
Targets
Best-variant activity≥6.5U/mg
Fold-improvement over SOTA≥1.5×
Hit rate (> wild type)≥25%
Expected output
Measured activities for the batch and an updated, retrained fitness model.
Simulations · click to test
output carries into step 5
5
Repeat — next active-learning round
Step 5 / 5
Close the loop: with the new HPLC ground truth folded in, return to fitness modeling (Step 2) and run another round until activity gains plateau.
Protocol
1. aRe-enter Step 2 (fitness prediction) with the augmented dataset.
2. bRe-screen and re-synthesize (Steps 3–4) for the next batch.
3. cStop when round-over-round activity gains plateau.
Targets
Rounds to record activity≤4rounds
Cumulative activity gain≥2× over round 1
Expected output
A new round's best-variant activity and a convergence trace across rounds.

Overview

Tools allowed

Constraints

Datasets

Workflow

Sampling process

Fitness function prediction

In silico screening

Experimental validation

Repeat — next active-learning round