Although video depth estimation models should be trained mainly on synthetic RGB-D video datasets I decided to add two synthetic RGB-D image datasets because of their unique features.
Dataset | Venue | Resolution | Unique features | |
---|---|---|---|---|
1 | SynthHuman 📌 Human faces 😍 |
384×512 | The dataset contains 98040 samples feature the face, 99976 sample feature the full body and 99992 samples feature the upper body. DAViD trained on this dataset alone achieved better depth estimation results than Depth Anything V2 Large, Depth Pro and even Sapiens-2B on the Goliath-Face test set. See the results in Table 2. | |
2 | MegaSynth | 512×512 | Huge size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6. |
The following list contains only synthetic RGB-D datasets in which at least some of the images can be composited into a video sequence of at least 32 frames. The minimum number of frames was chosen on the basis of the ablation studies shown in Table 5 by the Video Depth Anything researchers.
Most datasets contain ready-to-use video sequences of appropriately numbered images in individual folders, but in the case of the PLT-D3 dataset, images from at least two folders have to be combined to make a longer video sequence and in the case of the ClaraVid dataset, images have to be arranged in the correct order to make a 32-frame video sequence, for example in the order given in Appendix 4: Notes for "Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models".
Researchers, if you are going to use the following list to select datasets to train your models check their quality very carefully and choose the best ones. I have only visually checked a few of them and have marked on the list 2 datasets to check particularly carefully and 2 datasets that in my opinion are not suitable for training video depth estimation models. I have given the reasons for such markings in the same Appendix 4.
In selecting the best datasets, comparisons of their quality can be very helpful, such as in Table 9, Table 6, another Table 6 for depth estimation models and TABLE V plus TABLE IV for stereo matching models, although a similar technique can also be used for depth estimation models.
Dataset | Venue | Resolution | G C |
C 3 R |
M o 2 |
D P |
S T 2 |
U D 2 |
V D A |
D 2 U |
P O M |
R D |
B o T |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | OmniWorld-Game 📌 18,515K frames 😍 |
1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
2 | ClaraVid | 4032x3024 | - | - | - | - | - | - | - | - | - | - | - | |
3 | Spring | 1920×1080 | T | T | E | E | T | - | - | T | - | - | - | |
4 | HorizonGS | 1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
5 | PLT-D3 | 1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
6 | MVS-Synth | 1920×1080 | T | T | T | T | T | - | - | - | - | - | - | |
7 | SYNTHIA-SF | 1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
8 | SynDrone Check before use! |
1920×1080 | - | - | - | - | - | - | - | - | - | - | - | |
9 | Mid-Air | 1024×1024 | T | - | T | - | - | - | - | - | - | - | - | |
10 | MatrixCity | 1000×1000 | T | - | T | - | - | T | - | - | - | - | - | |
11 | StereoCarla | 1600×900 | - | - | - | - | - | - | - | - | - | - | - | |
12 | SAIL-VOS 3D | 1280×800 | - | - | - | T | - | - | - | - | - | - | - | |
13 | SHIFT | 1280×800 | - | - | - | - | - | - | - | - | - | - | - | |
14 | SYNTHIA-Seqs 🚫 Do not use! 🚫 |
1280×760 | T | - | T | - | - | - | - | - | - | - | - | |
15 | BEDLAM | 1280×720 | - | T | - | T | T | T | - | - | - | - | - | |
16 | Dynamic Replica | 1280×720 | T | T | - | T | T | T | - | - | T | - | - | |
17 | Infinigen SV | 1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
18 | Infinigen | 1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
19 | DigiDogs 🚫 Do not use! 🚫 |
1280×720 | - | - | - | - | - | - | - | - | - | - | - | |
20 | Aria Synthetic Environments Check before use! |
- | 704×704 | - | - | - | - | - | - | - | - | - | - | - |
21 | TartanGround | 640×640 | - | - | - | - | - | - | - | - | - | - | - | |
22 | TartanAir V2 | - | 640×640 | - | - | - | - | - | - | - | - | - | - | - |
23 | BlinkVision | 960×540 | - | - | - | - | - | - | - | T | - | - | - | |
24 | PointOdyssey | 960×540 | - | T | - | - | T | T | T | T | T | E | - | |
25 | DyDToF | 960×540 | - | - | - | - | - | - | - | - | - | E | - | |
26 | IRS | 960×540 | T | T | T | T | - | - | T | - | - | - | - | |
27 | Scene Flow | 960×540 | E | - | - | - | - | - | - | - | - | - | - | |
28 | THUD++ | 730×530 | - | - | - | - | - | - | - | - | - | - | - | |
29 | 3D Ken Burns | 512×512 | T | T | T | T | - | - | - | - | - | - | - | |
30 | SynPhoRest | - | 848×480 | - | - | - | - | - | - | - | - | - | - | - |
31 | C3I-SynFace | 640×480 | - | - | - | - | - | - | - | - | - | - | - | |
32 | TartanAir | 640×480 | T | T | T | T | T | T | T | T | T | T | - | |
33 | ParallelDomain-4D | 640×480 | - | - | - | - | - | - | - | - | T | - | - | |
34 | EDEN | 640×480 | - | T | T | T | - | T | - | - | - | - | - | |
35 | GTA-SfM | 640×480 | T | - | T | - | - | - | - | - | - | - | - | |
36 | InteriorNet | 640×480 | - | - | - | - | - | - | - | - | - | - | - | |
37 | SYNTHIA-AL | 640×480 | - | - | - | - | - | - | - | - | - | - | - | |
38 | MPI Sintel | 1024×436 | E | E | E | E | E | E | E | E | E | - | E | |
39 | Virtual KITTI 2 | 1242×375 | T | T | - | T | T | - | T | - | - | - | - | |
40 | TartanAir Shibuya | 640×360 | - | - | - | - | - | - | - | - | - | - | E | |
Total: T (training) | 11 | 10 | 9 | 9 | 7 | 6 | 4 | 4 | 4 | 1 | 0 | |||
Total: E (testing) | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 2 |
- ScanNet (170 frames): TAE<=2.2
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052
- NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)
- NYU-Depth V2: AbsRel<=0.051 (metric depth)
- iBims-1: F-score>=0.303
- Appendix 1: Selection of rankings for this repository (to do)
- Appendix 2: Selection of metrics for the rankings (to do)
- Appendix 3: Rules for qualifying models for the rankings (to do)
- Appendix 4: Notes for "Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models"
- Appendix 5: List of all research papers from the above rankings
- Appendix 6: List of other research papers
RK | Model Links: Venue Repository |
LPIPS ↓ {Input fr.} Table 1 M2SVid |
---|---|---|
1 | M2SVid |
0.180 {MF} |
2 | SVG |
0.217 {MF} |
3 | StereoCrafter |
0.242 {MF} |
📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4
📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4
Appendix 4: Notes for "Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models"
📝 Note 1: Example of arranging images in the correct order to make a 32-frame video sequence for the ClaraVid dataset:
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00360.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00320.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00280.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00240.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00200.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00160.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00120.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00080.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00040.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00000.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00001.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00002.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00003.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00004.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00005.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00006.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00007.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00008.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00009.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00010.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00011.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00012.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00013.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00014.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00015.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00016.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00017.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00018.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00019.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00059.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00099.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00139.jpg
📝 Note 2: Do not use the SYNTHIA-Seqs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves:
<your-data-path>/SYNTHIA-SEQS-01-SPRING/Depth/Stereo_Left/Omni_F/000071.png
<your-data-path>/SYNTHIA-SEQS-01-SPRING/RGB/Stereo_Left/Omni_F/000071.png
.
📝 Note 3: Do not use the DigiDogs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. See the objects behind the campfire, the shifting position of the vegetation on the left and the clear banding on the depth map:
<your-data-path>/DigiDogs2024_full/09_22_2022/00054/images/img_00012.tiff
.
📝 Note 4: Check before use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
<your-data-path>/Town01_Opt_120_depth/Town01_Opt_120/ClearNoon/height20m/depth/00031.png
.
📝 Note 5: Check before use the Aria Synthetic Environments dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
<your-data-path>/75/depth/depth0000109.png
.
📝 Note: This list includes the research papers of models that dropped out of the "Bonn RGB-D Dynamic ranking (5 video clips with 110 frames each): AbsRel" as a result of a change in the entry threshold for this ranking in August 2025 and are simultaneously ineligible for the other rankings.