demo2_3_out2.mp4
Welcome to the OpenDWM project! This is an open-source initiative, focusing on autonomous driving video generation. Our mission is to provide a high-quality, controllable tool for generating autonomous driving videos using the latest technology. We aim to build a codebase that is both user-friendly and highly reusable, and hope to continuously improve the project through the collective wisdom of the community.
The driving world models generate multi-view images or videos of autonomous driving scenes based on text and road environment layout conditions. Whether it's the environment, weather conditions, vehicle type, or driving path, you can adjust them according to your needs.
The highlights are as follows:
-
Transparent and reproducable training. We provide complete training codes and configurations, allowing everyone to reproduce experiments, fine-tune on their own data, and customize development features as needed.
-
Significant improvement in the environmental diversity. Through the use of multiple datasets, the model's generalization ability has been enhanced like never before. Take the example of a generation task controlled by layout conditions, such as a snowy city street or a lakeside highway with distant snow mountains, these scenarios are impossible tasks for generative models trained with a single dataset.
-
Greatly improved generation quality. Support for popular model architectures (SD 2.1, 3.5) enables more convenient utilization of the advanced pre-training generation capabilities within the community. Various training techniques, including multitasking and self-supervision, allow the model to utilize the information in autonomous driving video data more effectively.
-
Convenient evaluation. Evaluation follows the popular framework
torchmetrics, which is easy to configure, develop, and integrate into the pipeline. Public configurations (such as FID, FVD on the nuScenes validation set) are provided to align other research works.
Furthermore, our code modules are designed with high reusability in mind, for easy application in other projects.
Currently, the project has implemented the following papers:
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen1,2, Zehuan Wu2, Yichen Liu2, Yuxin Guo2, Jingcheng Ni2, Haifeng Xia1, Siyu Xia1
1Southeast University 2SenseTime Research
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
SenseTime Research
- [2025/5/6] Release the CTSD 3.5 with CogVideoX VAE for faster generation.
- [2025/4/23] Update the LiDAR VQVAE (including KITTI-360), LiDAR generation models, and release the DFoT on CTSD 3.5 model.
- [2025/3/17] Experimental release the Interactive Generation with Carla
- [2025/3/7] Release the LiDAR Generation
- [2025/3/4] Release the CTSD 3.5 with layout condition
- [2025/2/7] Release the UniMLVG
Hardware requirement:
- Training and testing multi-view image generation or short video (<= 6 frames per iteration) generation requires 32GB GPU memory (e.g. V100)
- Training and testing multi-view long video (6 ~ 40 frames per iteration) generation requires 80GB GPU memory (e.g. A100, H100)
Software requirement:
- git (>= 2.25)
- python (>= 3.9)
Install the PyTorch >= 2.5:
python -m pip install torch==2.5.1 torchvision==0.20.1
Clone the repository, then install the dependencies.
cd OpenDWM
git submodule update --init --recursive
python -m pip install -r requirements.txt
Our cross-view temporal SD (CTSD) pipeline support loading the pretrained SD 2.1, 3.0, 3.5, or the checkpoints we trained on the autonomous driving datasets.
| Base model | Text conditioned driving generation |
Text and layout (box, map) conditioned driving generation |
|---|---|---|
| SD 2.1 | Config, Download | Config, Download |
| SD 3.0 | UniMLVG Config, Download | |
| SD 3.5 | Config, Download | Config, Download |
| DFoT on SD 3.5 | Config, Download | |
| SD 3.5 with CogVideoX VAE | Config, Download |
The FVD evaluation results for all downloadable models can be found at the bottom of the corresponding configuration files.
You can download our pre-trained tokenzier and generation model in the following link.
| Model Architecture | Dataset | Configs | Checkpoint Download |
|---|---|---|---|
| VQVAE | nuscene, waymo, argoverse | Config | checkpoint, blank code |
| nuscene, waymo, argoverse, kitti360 | Config | checkpoint, blank code | |
| VAE | nuscene, waymo, argoverse, kitti360 | Config | checkpoint |
| MaskGIT | nuscene | Config | ckpt_with_vqvae_nwa ckpt_with_vqvae_nwak |
| kitti360 | Config | checkpoint | |
| Temporal MaskGIT | nuscene | Config | checkpoint |
| kitti360 | Config | checkpoint | |
| Temporal DiT | nuscene | Config | checkpoint |
| kitti360 | Config | checkpoint |
Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path and prompts in the JSON config, then run this command.
PYTHONPATH=src python examples/ctsd_generation_example.py -c examples/ctsd_35_6views_image_generation.json -o output/ctsd_35_6views_image_generation- Download base model (for VAE, text encoders, scheduler config) and driving generation model checkpoint, and edit the path in the JSON config.
- Download layout resource package (nuscenes_scene-0627_package.zip, or carla_town04_package) and unzip to the
{RESOURCE_PATH}. Then edit the meta path as{RESOURCE_PATH}/data.jsonin the JSON config. - Run this command to generate the video.
PYTHONPATH=src python src/dwm/preview.py -c examples/ctsd_35_6views_video_generation_with_layout.json -o output/ctsd_35_6views_video_generation_with_layout- Download LiDAR VQVAE and LiDAR MaskGIT generation model checkpoint.
- Prepare the dataset ( nuscenes_scene-0627_lidar_package.zip ).
- Modify the values of
json_file,vq_point_cloud_ckpt_path,vq_blank_code_pathandmodel_ckpt_pathto the paths of your dataset and checkpoints in the json fileexamples/lidar_maskgit_preview.jsonorexamples/lidar_maskgit_temporal_preview.json. - For single-frame lidar generation, run the following command to visualize the LiDAR of the validation set and save the generated point cloud as
.binfile.
PYTHONPATH=src python src/dwm/preview.py -c examples/lidar_maskgit_preview.json -o output/single_frame_maskgit- For lidar sequence generation,
enable_autoregressive_inferenceflag is enabled in the config file to support autoregressive generation. If you would like to use ground truth data as reference frames, setuse_ground_truth_as_referenceastrue. Alternatively, you can set it asfalsefor generation from layout condition only. After setting up the config file, run the following command
PYTHONPATH=src python3 -m torch.distributed.run --nnodes 1 --nproc-per-node 2 --node-rank 0 --master-addr 127.0.0.1 --master-port 29000 src/dwm/preview.py -c examples/lidar_maskgit_temporal_preview.json -o output/temporal_maskgit- Download LiDAR VAE and LiDAR Diffusion generation model checkpoint.
- Prepare the dataset ( nuscenes_scene-0627_lidar_package.zip ).
- Modify the values of
json_file,autoencoder_ckpt_path, anddiffusion_model_ckpt_pathto the paths of your dataset and checkpoints in the json fileexamples/lidar_diffusion_temporal_preview.json. - Run the following command to generate LiDAR data according to the reference frame autoregressively.
PYTHONPATH=src python3 -m torch.distributed.run --nnodes 1 --nproc-per-node 2 --node-rank 0 --master-addr 127.0.0.1 --master-port 29000 src/dwm/preview.py -c examples/lidar_diffusion_temporal_preview.json -o output/temporal_diffusionPreparation:
- Download the base models.
- Download and process datasets.
- Edit the configuration file (mainly the path of the model and dataset under the user environment).
Once the config file is updated with the correct model and data information, launch training by:
PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Or distributed training by:
OMP_NUM_THREADS=1 TOKENIZERS_PARALLELISM=false PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python -m torch.distributed.run --nnodes $WORLD_SIZE --nproc-per-node 8 --node-rank $RANK --master-addr $MASTER_ADDR --master-port $MASTER_PORT src/dwm/train.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Then you can check the preview under output/{YOUR_WORKSPACE}/preview, and get the checkpoint files from output/{YOUR_WORKSPACE}/checkpoints.
Some training tasks require multi stages (for the configurations with names of train_warmup.json and train.json), you should fill the path of the saved checkpoint from the previous stage into the following stage (for example), then launch the training of this following stage.
We have integrated the functions of FID and FVD metric evaluation in the pipeline, which involves filling in the validation set (source, sampling interval) and evaluation parameters (for example, the number of frames of each video segment to be measured in FVD) in the configuration file.
The specific call method is as follows.
PYTHONPATH=src:externals/waymo-open-dataset/src:externals/TATS/tats/fvd python src/dwm/evaluate.py -c {YOUR_CONFIG} -o output/{YOUR_WORKSPACE}
Or distributed evaluation by torch.distributed.run, similar to the distributed training.
configsThe config files for data and pipeline with different arguments.examplesThe inference code and configurations.externalsThe dependency projects.src/dwmThe shared components of this project.datasetsimplementstorch.utils.data.Datasetfor our training pipeline by reading multi-view, LiDAR and temporal data, with optional text, 3D box, HD map, pose, camera parameters as conditions.fsprovides flexible access methods followingfsspecto the data stored in ZIP blobs, or in the S3 compatible storage services.metricsimplementstorchmetricscompatible classes for quantitative evaluation.modelsimplements generation models and their building blocks.pipelinesimplements the training logic for different models.toolsprovides dataset and file processing scripts for faster initialization and reading.
Introduction about the file system, and dataset.
If you find our OpenDWM useful in your research or refer to the provided baseline results, please star ⭐ this repository and consider citing our repo or papers 📝:
@misc{opendwm,
Year = {2025},
Note = {https://github.com/SenseTime-FVG/OpenDWM},
Title = {OpenDWM: Open Driving World Models}
}
@article{chen2024unimlvg,
title={UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving},
author={Chen, Rui and Wu, Zehuan and Liu, Yichen and Guo, Yuxin and Ni, Jingcheng and Xia, Haifeng and Xia, Siyu},
journal={arXiv preprint arXiv:2412.04842},
year={2024}
}
@article{ni2025maskgwm,
title={MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction},
author={Ni, Jingcheng and Guo, Yuxin and Liu, Yichen and Chen, Rui and Lu, Lewei and Wu, Zehuan},
journal={arXiv preprint arXiv:2502.11663},
year={2025}
}