The energy landscape
BAGEL formalizes protein design as the sampling of an energy function over sequence space. Rather than following the traditional inverse-folding pipeline (backbone design → sequence painting → folding), BAGEL lets you define what you want directly as a set of energy terms and then searches for sequences that minimize the total energy. A System defines the full energy landscape: Each State contributes its own energy, computed as a weighted sum of individual energy terms: where is the weight of energy term in state , is the unweighted energy value, and is the set of chains in that state. This formulation lets you encode arbitrary design objectives — structural constraints, confidence metrics, embedding similarities, geometric properties — as energy terms and combine them freely.System, State, Chain, Residue
BAGEL organizes proteins into a four-level hierarchy:- System — the top-level container holding all states. The total energy is the sum of all state energies.
- State — a specific context for evaluating a set of chains against a set of energy terms. For example, a binder–target complex is one state; the same binder with a different target is another state.
- Chain — a single polypeptide sequence , where is the amino acid alphabet. Chains can be shared across states — when a shared chain is mutated, all states containing it update automatically.
- Residue — a single amino acid position. Each residue is either mutable (allowed to change during optimization) or immutable (fixed).
Oracles
An oracle is any algorithm that maps a set of chains to a useful representation. Oracles provide the predictions that energy terms use to evaluate designs. BAGEL defines three categories of oracles: Folding oracles predict 3D structures and associated confidence metrics: where are atomic coordinates and encodes confidence metrics (pLDDT, pTM, PAE). Currently implemented:ESMFold.
Embedding oracles produce learned high-dimensional features capturing biochemical or evolutionary context:
where is a per-residue embedding vector. Currently implemented: ESM2.
Property oracles compute scalar properties from sequences:
For example, counting polar residues or predicting unfolding temperature.
BAGEL uses boileroom to run oracle inference. boileroom provides a unified Python API for protein prediction models, with support for serverless GPU execution via Modal or local GPU execution. See the boileroom documentation for details on available models and setup.
Energy terms
Energy terms encode specific design objectives. Each term is associated with an oracle and operates on one or more residue groups — subsets of residues within a chain. BAGEL adopts an N-body formalism:- Zero-body terms apply globally to all residues (e.g.,
PTMEnergy,OverallPLDDTEnergy) - One-body terms act on a single residue group (e.g.,
PLDDTEnergy,HydrophobicEnergy) - Two-body terms operate on a pair of residue groups (e.g.,
PAEEnergy,SeparationEnergy)
| Category | Terms |
|---|---|
| Folding metrics | PTMEnergy, PLDDTEnergy, OverallPLDDTEnergy, PAEEnergy |
| Structural | SurfaceAreaEnergy, SeparationEnergy, GlobularEnergy, TemplateMatchEnergy, SecondaryStructureEnergy, RingSymmetryEnergy |
| Sequence | HydrophobicEnergy, ChemicalPotentialEnergy |
| Embedding | EmbeddingsSimilarityEnergy |
| Composite | LISEnergy, FlexEvoBindEnergy |
Mutation protocols
Mutation protocols define how BAGEL explores sequence space by proposing changes to chain sequences. Canonical — fixed-length point mutations. At each step:- Pick a chain with probability proportional to its number of mutable residues
- Draw residues uniformly from that chain
- Resample their amino acid identity from a categorical distribution (by default, cysteines are excluded to avoid disulfide bond complications)
ChemicalPotentialEnergy to control the target sequence length.
The Monte Carlo loop
BAGEL explores the energy landscape using Markov Chain Monte Carlo (MCMC). At each step , a mutation is proposed and accepted or rejected according to the Metropolis criterion: where and is the temperature at step . BAGEL provides three minimizer strategies:- MonteCarloMinimizer — general MC with a user-defined temperature schedule (constant, custom array, etc.). At constant temperature, this acts as a sampler for generating diverse candidates around an energy basin.
- SimulatedAnnealing — linearly decreasing temperature from to . Used for optimization, gradually narrowing the search.
- SimulatedTempering — cyclical schedule alternating between and across multiple cycles. The high-temperature phases help escape local minima, while low-temperature phases refine the current best solution.
- For optimization (finding the best single design), use
SimulatedAnnealingorSimulatedTempering. - For sampling (generating diverse candidates), use
MonteCarloMinimizerat constant temperature.
