USW
Protein EngineeringDirected EvolutionExpertVerifiedFrontier · verifiable

Evolve an HACS enzyme with record activity toward formaldehyde

Run a full active-learning directed-evolution loop — homolog mining, fitness modeling, ESM3 design, and HPLC validation — to find an HACS variant more active toward formaldehyde than any known or engineered enzyme.

Nhat (first author)
University of Scientific Workflow founding contributor
Active-learning-guided discovery of high-activity formaldehyde-converting enzymes
Nature (in submission) · 2025
Multi-round · days–weeks per loop
registered 2025-11-02
active-learningenzymedirected-evolutionESM3HPLCopen-ended

End goal

Find an HACS enzyme that has higher activity toward formaldehyde than all previously discovered and engineered variants.

Overview

This task reproduces a real, multi-round protein-engineering campaign. Starting from a seed HACS sequence, the agent must mine distant homologs, reduce and filter them into a synthesizable library, train a fitness model on experimental activity, generate new variants with an inverse-folding model, prioritize them with an acquisition function, and validate the top picks against an HPLC oracle. The loop repeats: each round of ground-truth measurements retrains the model and sharpens the next batch. Success is open-ended — there is no single correct sequence — but every step is individually verifiable against the targets the scientist set.

Tools allowed

7
HMMER / Jackhmmer·HPCMMseqs2·TerminalCLEAN·TerminalEnzymeCAGE·HPCESM3·HPCUCB Acquisition·TerminalHPLC Activity Screen·VM (GUI)

Constraints

Software

HMMER / JackhmmerMMseqs2CLEANEnzymeCAGEESM3HPLC analysis pipeline

Hardware

GPU for ESM3 inference & fitness modelingWet-lab HPLC instrument (activity oracle)

Datasets

  • JGI / NCBI sequence databases

    Public protein databases mined for distant HACS homologs in the sampling step.

  • Experimental activity dataset

    Round-by-round HPLC activity measurements that train and update the fitness model (the ground-truth oracle).

Workflow

5-step protocol

Each step is verified against the scientist's targets. Open any simulation to test it live.

  1. 1

    Sampling process

    Step 1 / 5

    Build a diverse, reaction-relevant candidate library from public sequence space, then reduce it to a set affordable to synthesize and screen.

    Protocol

    1. aApply HMMER/Jackhmmer to JGI/NCBI databases to find distant homologs.
    2. bMMseqs2 to cluster and reduce the total number in the distribution.
    3. cCLEAN to filter by EC number.
    4. dEnzymeCAGE to filter by specific reactions.
    5. eProduce the final distribution for synthesis and HPLC screening.

    Targets

    Distant-homolog recall85% known actives
    EC-filter precision90%
    Reaction-filter enrichment3×
    Expected output

    A curated, non-redundant candidate library (FASTA) annotated by EC and reaction compatibility.

    Simulations · click to test

    output carries into step 2
  2. 2

    Fitness function prediction

    Step 2 / 5

    Train a sequence-to-activity model on the experimental data collected so far and quantify how much it improved over the previous round.

    Protocol

    1. aTrain the model with the experimental data.
    2. bCalculate the accuracy score.
    3. cEvaluate improvement from the last round.

    Targets

    Model accuracy0.7Spearman ρ
    Round-over-round gain0.05Δρ
    Expected output

    A fitness model with held-out accuracy and a round-over-round improvement report.

    output carries into step 3
  3. 3

    In silico screening

    Step 3 / 5

    Generate new candidate variants, score them with the fitness model under an exploration-aware acquisition function, and assemble the next synthesis batch.

    Protocol

    1. aGenerate a candidate distribution with ESM3 inverse folding.
    2. bCalculate the UCB score for each candidate.
    3. cChoose top candidates for synthesis — including degenerate codons and saturation variants plus the original jackhmmer library.

    Targets

    Predicted activity (top-32 mean)4U/mg
    Batch diversity60% unique clusters
    Expected output

    A ranked synthesis batch (plate map) combining generated, saturation, and library variants.

    Simulations · click to test

    output carries into step 4
  4. 4

    Experimental validation

    Step 4 / 5

    Measure true activity for the synthesized batch with HPLC, then fold the new ground truth back into the model to begin the next round.

    Protocol

    1. aHPLC screening of the synthesized variants.
    2. bUpdate the model with the extra data (ground truth).

    Targets

    Best-variant activity6.5U/mg
    Fold-improvement over SOTA1.5×
    Hit rate (> wild type)25%
    Expected output

    Measured activities for the batch and an updated, retrained fitness model.

    Simulations · click to test

    output carries into step 5
  5. 5

    Repeat — next active-learning round

    Step 5 / 5

    Close the loop: with the new HPLC ground truth folded in, return to fitness modeling (Step 2) and run another round until activity gains plateau.

    Protocol

    1. aRe-enter Step 2 (fitness prediction) with the augmented dataset.
    2. bRe-screen and re-synthesize (Steps 3–4) for the next batch.
    3. cStop when round-over-round activity gains plateau.

    Targets

    Rounds to record activity4rounds
    Cumulative activity gain2× over round 1
    Expected output

    A new round's best-variant activity and a convergence trace across rounds.