Evolve an HACS enzyme with record activity toward formaldehyde
Run a full active-learning directed-evolution loop — homolog mining, fitness modeling, ESM3 design, and HPLC validation — to find an HACS variant more active toward formaldehyde than any known or engineered enzyme.
End goal
Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.
Overview
This task reproduces a real, multi-round protein-engineering campaign. Starting from a seed HACS sequence, the agent must mine distant homologs, reduce and filter them into a synthesizable library, train a fitness model on experimental activity, generate new variants with an inverse-folding model, prioritize them with an acquisition function, and validate the top picks against an HPLC oracle. The loop repeats: each round of ground-truth measurements retrains the model and sharpens the next batch. Success is open-ended — there is no single correct sequence — but every step is individually verifiable against the targets the scientist set.
Tools allowed
7Constraints
Software
Hardware
Datasets
- JGI / NCBI sequence databases
Public protein databases mined for distant HACS homologs in the sampling step.
- Experimental activity dataset
Round-by-round HPLC activity measurements that train and update the fitness model (the ground-truth oracle).
Workflow
5-step protocol
Each step is verified against the scientist's targets. Open any simulation to test it live.
- 1
Sampling process
Step 1 / 5Build a diverse, reaction-relevant candidate library from public sequence space, then reduce it to a set affordable to synthesize and screen.
Protocol
- aApply HMMER/Jackhmmer to JGI/NCBI databases to find distant homologs.
- bMMseqs2 to cluster and reduce the total number in the distribution.
- cCLEAN to filter by EC number.
- dEnzymeCAGE to filter by specific reactions.
- eProduce the final distribution for synthesis and HPLC screening.
Targets
Distant-homolog recall≥85% known activesEC-filter precision≥90%Reaction-filter enrichment≥3×Expected outputA curated, non-redundant candidate library (FASTA) annotated by EC and reaction compatibility.
Simulations · click to test
output carries into step 2 - 2
Fitness function prediction
Step 2 / 5Train a sequence-to-activity model on the experimental data collected so far and quantify how much it improved over the previous round.
Protocol
- aTrain the model with the experimental data.
- bCalculate the accuracy score.
- cEvaluate improvement from the last round.
Targets
Model accuracy≥0.7Spearman ρRound-over-round gain≥0.05ΔρExpected outputA fitness model with held-out accuracy and a round-over-round improvement report.
output carries into step 3 - 3
In silico screening
Step 3 / 5Generate new candidate variants, score them with the fitness model under an exploration-aware acquisition function, and assemble the next synthesis batch.
Protocol
- aGenerate a candidate distribution with ESM3 inverse folding.
- bCalculate the UCB score for each candidate.
- cChoose top candidates for synthesis — including degenerate codons and saturation variants plus the original jackhmmer library.
Targets
Predicted activity (top-32 mean)≥4U/mgBatch diversity≥60% unique clustersExpected outputA ranked synthesis batch (plate map) combining generated, saturation, and library variants.
Simulations · click to test
output carries into step 4 - 4
Experimental validation
Step 4 / 5Measure true activity for the synthesized batch with HPLC, then fold the new ground truth back into the model to begin the next round.
Protocol
- aHPLC screening of the synthesized variants.
- bUpdate the model with the extra data (ground truth).
Targets
Best-variant activity≥6.5U/mgFold-improvement over SOTA≥1.5×Hit rate (> wild type)≥25%Expected outputMeasured activities for the batch and an updated, retrained fitness model.
Simulations · click to test
output carries into step 5 - 5
Repeat — next active-learning round
Step 5 / 5Close the loop: with the new HPLC ground truth folded in, return to fitness modeling (Step 2) and run another round until activity gains plateau.
Protocol
- aRe-enter Step 2 (fitness prediction) with the augmented dataset.
- bRe-screen and re-synthesize (Steps 3–4) for the next batch.
- cStop when round-over-round activity gains plateau.
Targets
Rounds to record activity≤4roundsCumulative activity gain≥2× over round 1Expected outputA new round's best-variant activity and a convergence trace across rounds.