SmolVLA is easy to use for fine-tuning or integration into robotics workflows.
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - Make sure you have a working GPU and CUDA drivers (tested: Ubuntu 24.04, RTX 5080, CUDA 12.8).
- Compiled and build FFmpeg 7.1.1 from source (guide)
Note
Tested System Configuration
- OS: Ubuntu 24.04.3 LTS (x86_64)
- Kernel: 6.16.0-061600-generic
- GPU: NVIDIA GeForce RTX 5080
- NVIDIA Driver: 580.65.06
- CUDA: 12.8
- Python: 3.10
- uv: 0.8.4
- FFmpeg: 7.1.1
Clone the repo and install SmolVLA dependencies:
git clone --recurse-submodules https://github.com/weijieyong/smolvla-ws.git
# If you already cloned without submodules:
# git submodule update --init --recursive
cd smolvla-ws/lerobot
uv venv --python 3.10
uv pip install -e ".[smolvla]"Use nightly builds for torch/torchcodec for compatibility:
uv pip uninstall torch torchvision torchcodec
uv pip install --pre torch torchvision torchcodec --index-url https://download.pytorch.org/whl/nightly/cu128Login to HuggingFace and Weights & Biases:
hf auth login
wandb loginexport TOKENIZERS_PARALLELISM=falseRun fine-tuning on a base model with a HuggingFace dataset:
Adjust batch_size value based on your GPU's VRAM. to prevent OOM
# from the lerobot dir
uv run src/lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=aractingi/il_gym0 \
--batch_size=48 \
--steps=20000 \
--output_dir=outputs/train/my_smolvla \
--job_name=my_smolvla_training \
--policy.device=cuda \
--wandb.enable=true \
--policy.push_to_hub=falsemore about the training script here
- visualizing with rerun, on the lerobot/pusht dataset
# from the lerobot dir
uv run src/lerobot/scripts/visualize_dataset.py \
--repo-id lerobot/aloha_static_coffee_new \
--episode-index 0