.fold() or .embed() the same way, and get the same output types back.
Backend selection
Set the backend when creating a model:Modal backend
Modal provides serverless GPU execution. Your code runs locally, but the model inference runs on Modal’s cloud GPUs. Containers scale to zero when idle and spin up on demand.Setup
- Install Modal:
pip install modal - Authenticate:
modal token new - Use the model with
backend="modal"(this is the default)
GPU options
Modal automatically selects a GPU based on the model’s requirements. The following GPU types are available on Modal:| GPU | VRAM |
|---|---|
| T4 | 16 GB |
| L4 | 24 GB |
| A10G | 24 GB |
| A100-40GB | 40 GB |
| A100-80GB | 80 GB |
| L40S | 48 GB |
| H100 | 80 GB |
Serverless scaling
- Containers cold-start when first called (may take 30-60 seconds)
- Subsequent calls reuse warm containers
- Containers automatically scale down after a period of inactivity
Apptainer backend
Apptainer (formerly Singularity) runs models in a container on your local machine or HPC cluster. This is useful when you need local execution, cannot use cloud services, or are running on institutional compute.Setup
- Install Apptainer on your system
- Use
backend="apptainer"orbackend="apptainer:tag"to specify a Docker image tag
Tag syntax
By default, the Apptainer backend uses thelatest Docker image tag. You can specify a different tag:
Local execution
The Apptainer backend runs the model on your local GPU. Ensure you have a CUDA-compatible GPU with sufficient VRAM for the model.Context manager
Both backends support context manager usage to ensure proper cleanup:ModelWrapper instance is garbage collected or when the Python process exits.
Device selection
Thedevice parameter works the same way across backends:
