Skip to main content
ESM2 generates per-residue embeddings from protein sequences using Meta’s ESM-2 protein language model. These embeddings capture structural and functional information and can be used for downstream tasks like similarity comparison, clustering, or as features for other models.

Quick example

from boileroom import ESM2

model = ESM2(backend="modal")
result = model.embed("MKTVRQERLKSIVRI")

print(result.embeddings.shape)  # (1, seq_len, embedding_dim)

Methods

.embed()

Generate embeddings for one or more protein sequences.
result = model.embed(sequences, options=None)
sequences
str | Sequence[str]
required
A single amino acid sequence string or a list of sequences. Use ":" to separate chains in a multimer (e.g., "CHAIN_A:CHAIN_B").
options
dict | None
default:"None"
Per-call configuration overrides. Only dynamic config keys can be set here — static keys raise ValueError. See Configuration.
Returns: ESM2Output (see Output below)

Output

The ESM2Output dataclass returned by .embed().

Always included

metadata
PredictionMetadata
Prediction metadata with timing information. See PredictionMetadata.
embeddings
np.ndarray
Per-residue embeddings from the final transformer layer. Shape: (batch, seq_len, embedding_dim).
chain_index
np.ndarray
Per-residue chain assignment. Shape: (batch, seq_len).
residue_index
np.ndarray
Per-residue residue numbering. Shape: (batch, seq_len).

Optional

hidden_states
np.ndarray | None
Embeddings from all intermediate transformer layers. Shape: (batch, num_layers, seq_len, embedding_dim). Only included when include_fields contains "hidden_states" or "*".

Model sizes

ESM-2 is available in 6 sizes. Set the model via config={"model_name": "..."}.
Model nameParametersLayersHidden dimAttention heads
esm2_t6_8M_UR50D8M632020
esm2_t12_35M_UR50D35M1248020
esm2_t30_150M_UR50D150M3064012
esm2_t33_650M_UR50D650M33128020
esm2_t36_3B_UR50D3B36256040
esm2_t48_15B_UR50D15B48512040
The default model is esm2_t33_650M_UR50D (650M parameters).

Configuration

These keys can be set via config={} at initialization or options={} per call (unless marked static).
KeyTypeDefaultStaticDescription
devicestr"cuda:0"YesGPU device identifier
model_namestr"esm2_t33_650M_UR50D"YesESM-2 model variant to use (see table above)
glycine_linkerstr""NoLinker string inserted between chains for multimer tokenization
position_ids_skipint512NoPosition ID offset between chains in multimer mode
include_fieldslist[str] | NoneNoneNoWhich output fields to return. Use ["hidden_states"] to include all intermediate layer representations. Use ["*"] for everything

Multimer embedding

Separate chains with ":" in the sequence string:
result = model.embed("MKTVRQERLKSIVRI:LERSKEPVSGAQLAEE")

print(result.chain_index.shape)    # per-residue chain assignment
print(result.residue_index.shape)  # per-residue residue numbering

Hidden states

To retrieve embeddings from all intermediate transformer layers, include "hidden_states" in include_fields:
model = ESM2(backend="modal")
result = model.embed(
    "MKTVRQERLKSIVRI",
    options={"include_fields": ["hidden_states"]}
)

# hidden_states shape: (batch_size, num_layers, seq_len, embedding_dim)
print(result.hidden_states.shape)