Open Driving World Models (OpenDWM)

demo2_3_out2.mp4

Welcome to the OpenDWM project! This is an open-source initiative, focusing on autonomous driving video generation. Our mission is to provide a high-quality, controllable tool for generating autonomous driving videos using the latest technology. We aim to build a codebase that is both user-friendly and highly reusable, and hope to continuously improve the project through the collective wisdom of the community.

The driving world models generate multi-view images or videos of autonomous driving scenes based on text and road environment layout conditions. Whether it's the environment, weather conditions, vehicle type, or driving path, you can adjust them according to your needs.

The highlights are as follows:

Transparent and reproducable training. We provide complete training codes and configurations, allowing everyone to reproduce experiments, fine-tune on their own data, and customize development features as needed.
Significant improvement in the environmental diversity. Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
Greatly improved generation quality. Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
Convenient evaluation. Evaluation follows the popular framework torchmetrics, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.

Furthermore, our code modules are designed with high reusability in mind, for easy application in other projects.

Currently, the project has implemented the following papers:

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen^1,2, Zehuan Wu², Yichen Liu², Yuxin Guo², Jingcheng Ni², Haifeng Xia¹, Siyu Xia¹
¹Southeast University ²SenseTime Research

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
SenseTime Research

News

[2025/5/6] Release the CTSD 3.5 with CogVideoX VAE for faster generation.
[2025/4/23] Update the LiDAR VQVAE (including KITTI-360), LiDAR generation models, and release the DFoT on CTSD 3.5 model.
[2025/3/17] Experimental release the Interactive Generation with Carla
[2025/3/7] Release the LiDAR Generation
[2025/3/4] Release the CTSD 3.5 with layout condition
[2025/2/7] Release the UniMLVG

Setup

Hardware requirement:

Training and testing multi-view image generation or short video (<= 6 frames per iteration) generation requires 32GB GPU memory (e.g. V100)
Training and testing multi-view long video (6 ~ 40 frames per iteration) generation requires 80GB GPU memory (e.g. A100, H100)

Software requirement:

git (>= 2.25)
python (>= 3.9)

Install the PyTorch >= 2.5:

python -m pip install torch==2.5.1 torchvision==0.20.1

Clone the repository, then install the dependencies.

cd OpenDWM
git submodule update --init --recursive
python -m pip install -r requirements.txt

Models

Video Models

Our cross-view temporal SD (CTSD) pipeline support loading the pretrained SD 2.1, 3.0, 3.5, or the checkpoints we trained on the autonomous driving datasets.

Base model	Text conditioned driving generation	Text and layout (box, map) conditioned driving generation
SD 2.1	Config, Download	Config, Download
SD 3.0		UniMLVG Config, Download
SD 3.5	Config, Download	Config, Download
DFoT on SD 3.5		Config, Download
SD 3.5 with CogVideoX VAE		Config, Download

The FVD evaluation results for all downloadable models can be found at the bottom of the corresponding configuration files.

LiDAR Models

You can download our pre-trained tokenzier and generation model in the following link.

Model Architecture	Dataset	Configs	Checkpoint Download
VQVAE	nuscene, waymo, argoverse	Config	checkpoint, blank code
	nuscene, waymo, argoverse, kitti360	Config	checkpoint, blank code
VAE	nuscene, waymo, argoverse, kitti360	Config	checkpoint
MaskGIT	nuscene	Config	ckpt_with_vqvae_nwa ckpt_with_vqvae_nwak
	kitti360	Config	checkpoint
Temporal MaskGIT	nuscene	Config	checkpoint
	kitti360	Config	checkpoint
Temporal DiT	nuscene	Config	checkpoint
	kitti360	Config	checkpoint

Examples

T2I, T2V generation with CTSD pipeline

Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path and prompts in the JSON config, then run this command.

PYTHONPATH=src python examples/ctsd_generation_example.py -c examples/ctsd_35_6views_image_generation.json -o output/ctsd_35_6views_image_generation

Layout conditioned T2V generation with CTSD pipeline

Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path in the JSON config.
Download layout resource package (nuscenes_scene-0627_package.zip, or carla_town04_package) and unzip to the {RESOURCE_PATH}. Then edit the meta path as {RESOURCE_PATH}/data.json in the JSON config.
Run this command to generate the video.

PYTHONPATH=src python src/dwm/preview.py -c examples/ctsd_35_6views_video_generation_with_layout.json -o output/ctsd_35_6views_video_generation_with_layout

Layout conditioned LiDAR generation with MaskGIT pipeline

Download LiDAR VQVAE and LiDAR MaskGIT generation model checkpoint.
Prepare the dataset ( nuscenes_scene-0627_lidar_package.zip ).
Modify the values of json_file, vq_point_cloud_ckpt_path, vq_blank_code_path and model_ckpt_path to the paths of your dataset and checkpoints in the json file examples/lidar_maskgit_preview.json or examples/lidar_maskgit_temporal_preview.json .
For single-frame lidar generation, run the following command to visualize the LiDAR of the validation set and save the generated point cloud as .bin file.

PYTHONPATH=src python src/dwm/preview.py -c examples/lidar_maskgit_preview.json -o output/single_frame_maskgit

For lidar sequence generation, enable_autoregressive_inference flag is enabled in the config file to support autoregressive generation. If you would like to use ground truth data as reference frames, set use_ground_truth_as_reference as true. Alternatively, you can set it as false for generation from layout condition only. After setting up the config file, run the following command

PYTHONPATH=src python3 -m torch.distributed.run --nnodes 1 --nproc-per-node 2 --node-rank 0 --master-addr 127.0.0.1 --master-port 29000 src/dwm/preview.py -c examples/lidar_maskgit_temporal_preview.json -o output/temporal_maskgit

Layout conditioned LiDAR generation with Diffusion pipeline

Download LiDAR VAE and LiDAR Diffusion generation model checkpoint.
Prepare the dataset ( nuscenes_scene-0627_lidar_package.zip ).
Modify the values of json_file, autoencoder_ckpt_path, and diffusion_model_ckpt_path to the paths of your dataset and checkpoints in the json file examples/lidar_diffusion_temporal_preview.json.
Run the following command to generate LiDAR data according to the reference frame autoregressively.

PYTHONPATH=src python3 -m torch.distributed.run --nnodes 1 --nproc-per-node 2 --node-rank 0 --master-addr 127.0.0.1 --master-port 29000 src/dwm/preview.py -c examples/lidar_diffusion_temporal_preview.json -o output/temporal_diffusion

Train

Preparation:

Download the base models.
Download and process datasets.
Edit the configuration file (mainly the path of the model and dataset under the user environment).

Once the config file is updated with the correct model and data information, launch training by:

PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Or distributed training by:

OMP_NUM_THREADS=1 TOKENIZERS_PARALLELISM=false PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python -m torch.distributed.run --nnodes $WORLD_SIZE --nproc-per-node 8 --node-rank $RANK --master-addr $MASTER_ADDR --master-port $MASTER_PORT src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Then you can check the preview under output/{YOUR_WORKSPACE}/preview, and get the checkpoint files from output/{YOUR_WORKSPACE}/checkpoints.

Some training tasks require multi stages (for the configurations with names of train_warmup.json and train.json), you should fill the path of the saved checkpoint from the previous stage into the following stage (for example), then launch the training of this following stage.

Evaluation

We have integrated the functions of FID and FVD metric evaluation in the pipeline, which involves filling in the validation set (source, sampling interval) and evaluation parameters (for example, the number of frames of each video segment to be measured in FVD) in the configuration file.

The specific call method is as follows.

PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/evaluate.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}

Or distributed evaluation by torch.distributed.run, similar to the distributed training.

Development

Folder structure

configs The config files for data and pipeline with different arguments.
examples The inference code and configurations.
externals The dependency projects.
src/dwm The shared components of this project.
- datasets implements torch.utils.data.Dataset for our training pipeline by reading multi-view, LiDAR and temporal data, with optional text, 3D box, HD map, pose, camera parameters as conditions.
- fs provides flexible access methods following fsspec to the data stored in ZIP blobs, or in the S3 compatible storage services.
- metrics implements torchmetrics compatible classes for quantitative evaluation.
- models implements generation models and their building blocks.
- pipelines implements the training logic for different models.
- tools provides dataset and file processing scripts for faster initialization and reading.

Introduction about the file system, and dataset.

Citation

If you find our OpenDWM useful in your research or refer to the provided baseline results, please star ⭐ this repository and consider citing our repo or papers 📝:

@misc{opendwm,
  Year = {2025},
  Note = {https://github.com/SenseTime-FVG/OpenDWM},
  Title = {OpenDWM: Open Driving World Models}
}

@article{chen2024unimlvg,
  title={UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving},
  author={Chen, Rui and Wu, Zehuan and Liu, Yichen and Guo, Yuxin and Ni, Jingcheng and Xia, Haifeng and Xia, Siyu},
  journal={arXiv preprint arXiv:2412.04842},
  year={2024}
}

@article{ni2025maskgwm,
  title={MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction},
  author={Ni, Jingcheng and Guo, Yuxin and Liu, Yichen and Chen, Rui and Lu, Lewei and Wu, Zehuan},
  journal={arXiv preprint arXiv:2502.11663},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
docs		docs
examples		examples
externals		externals
src/dwm		src/dwm
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_intro_zh.md		README_intro_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open Driving World Models (OpenDWM)

News

Setup

Models

Video Models

LiDAR Models

Examples

T2I, T2V generation with CTSD pipeline

Layout conditioned T2V generation with CTSD pipeline

Layout conditioned LiDAR generation with MaskGIT pipeline

Layout conditioned LiDAR generation with Diffusion pipeline

Train

Evaluation

Development

Folder structure

Citation

About

Uh oh!

Releases 9

Uh oh!

Contributors 5

Uh oh!

Languages

License

SenseTime-FVG/OpenDWM

Folders and files

Latest commit

History

Repository files navigation

Open Driving World Models (OpenDWM)

News

Setup

Models

Video Models

LiDAR Models

Examples

T2I, T2V generation with CTSD pipeline

Layout conditioned T2V generation with CTSD pipeline

Layout conditioned LiDAR generation with MaskGIT pipeline

Layout conditioned LiDAR generation with Diffusion pipeline

Train

Evaluation

Development

Folder structure

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Uh oh!

Contributors 5

Uh oh!

Languages