Skip to content

We’re looking forward to new models based on DINOv3. For now rankings include: Align3R BetterDepth ChronoDepth CUT3R Depth Any Video Depth Anything Depth Pro DepthCrafter Geo4D GRIN L4P M2SVid MASt3R Metric3D Metric-Solver MoGe MonST3R NVDS RollingDepth SpatialTrackerV2 StereoCrafter SVG Uni4D UniDepth UniK3D VGGT Video Depth Anything π^3

Notifications You must be signed in to change notification settings

AIVFI/Video-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Video Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

Although video depth estimation models should be trained mainly on synthetic RGB-D video datasets I decided to add two synthetic RGB-D image datasets because of their unique features.

Dataset      Venue      Resolution Unique features
1 SynthHuman
📌 Human faces 😍
ICCV 384×512 The dataset contains 98040 samples feature the face, 99976 sample feature the full body and 99992 samples feature the upper body. DAViD trained on this dataset alone achieved better depth estimation results than Depth Anything V2 Large, Depth Pro and even Sapiens-2B on the Goliath-Face test set. See the results in Table 2.
2 MegaSynth CVPR 512×512 Huge size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.

Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models

The following list contains only synthetic RGB-D datasets in which at least some of the images can be composited into a video sequence of at least 32 frames. The minimum number of frames was chosen on the basis of the ablation studies shown in Table 5 by the Video Depth Anything researchers.

Most datasets contain ready-to-use video sequences of appropriately numbered images in individual folders, but in the case of the PLT-D3 dataset, images from at least two folders have to be combined to make a longer video sequence and in the case of the ClaraVid dataset, images have to be arranged in the correct order to make a 32-frame video sequence, for example in the order given in Appendix 4: Notes for "Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models".

Researchers, if you are going to use the following list to select datasets to train your models check their quality very carefully and choose the best ones. I have only visually checked a few of them and have marked on the list 2 datasets to check particularly carefully and 2 datasets that in my opinion are not suitable for training video depth estimation models. I have given the reasons for such markings in the same Appendix 4.

In selecting the best datasets, comparisons of their quality can be very helpful, such as in Table 9, Table 6, another Table 6 for depth estimation models and TABLE V plus TABLE IV for stereo matching models, although a similar technique can also be used for depth estimation models.

Dataset       Venue       Resolution G
C
C
3
R
M
o
2
D
P
S
T
2
U
D
2
V
D
A
D
2
U
P
O
M
R
D
B
o
T
1 OmniWorld-Game
📌 18,515K frames 😍
arXiv 1280×720 - - - - - - - - - - -
2 ClaraVid ICCV 4032x3024 - - - - - - - - - - -
3 Spring CVPR 1920×1080 T T E E T - - T - - -
4 HorizonGS CVPR 1920×1080 - - - - - - - - - - -
5 PLT-D3 HD 1920×1080 - - - - - - - - - - -
6 MVS-Synth CVPR 1920×1080 T T T T T - - - - - -
7 SYNTHIA-SF BMVC 1920×1080 - - - - - - - - - - -
8 SynDrone
Check before use!
ICCVW 1920×1080 - - - - - - - - - - -
9 Mid-Air CVPRW 1024×1024 T - T - - - - - - - -
10 MatrixCity ICCV 1000×1000 T - T - - T - - - - -
11 StereoCarla arXiv 1600×900 - - - - - - - - - - -
12 SAIL-VOS 3D CVPR 1280×800 - - - T - - - - - - -
13 SHIFT CVPR 1280×800 - - - - - - - - - - -
14 SYNTHIA-Seqs
🚫 Do not use! 🚫
CVPR 1280×760 T - T - - - - - - - -
15 BEDLAM CVPR 1280×720 - T - T T T - - - - -
16 Dynamic Replica CVPR 1280×720 T T - T T T - - T - -
17 Infinigen SV arXiv 1280×720 - - - - - - - - - - -
18 Infinigen CVPR 1280×720 - - - - - - - - - - -
19 DigiDogs
🚫 Do not use! 🚫
WACVW 1280×720 - - - - - - - - - - -
20 Aria Synthetic Environments
Check before use!
- 704×704 - - - - - - - - - - -
21 TartanGround IROS 640×640 - - - - - - - - - - -
22 TartanAir V2 - 640×640 - - - - - - - - - - -
23 BlinkVision ECCV 960×540 - - - - - - - T - - -
24 PointOdyssey ICCV 960×540 - T - - T T T T T E -
25 DyDToF CVPR 960×540 - - - - - - - - - E -
26 IRS ICME 960×540 T T T T - - T - - - -
27 Scene Flow CVPR 960×540 E - - - - - - - - - -
28 THUD++ arXiv 730×530 - - - - - - - - - - -
29 3D Ken Burns TOG 512×512 T T T T - - - - - - -
30 SynPhoRest - 848×480 - - - - - - - - - - -
31 C3I-SynFace DIB 640×480 - - - - - - - - - - -
32 TartanAir IROS 640×480 T T T T T T T T T T -
33 ParallelDomain-4D ECCV 640×480 - - - - - - - - T - -
34 EDEN WACV 640×480 - T T T - T - - - - -
35 GTA-SfM RAL 640×480 T - T - - - - - - - -
36 InteriorNet BMVC 640×480 - - - - - - - - - - -
37 SYNTHIA-AL ICCVW 640×480 - - - - - - - - - - -
38 MPI Sintel ECCV 1024×436 E E E E E E E E E - E
39 Virtual KITTI 2 arXiv 1242×375 T T - T T - T - - - -
40 TartanAir Shibuya ICRA 640×360 - - - - - - - - - - E
Total: T (training) 11 10 9 9 7 6 4 4 4 1 0
Total: E (testing) 2 1 2 2 1 1 1 1 1 2 2

List of Rankings

2D to 3D Video Conversion Rankings

  1. Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

Video Depth Estimation Rankings

  1. ScanNet (170 frames): TAE<=2.2
  2. Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979
  3. Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052

Other Monocular Depth Estimation Rankings

  1. NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)
  2. NYU-Depth V2: AbsRel<=0.051 (metric depth)
  3. iBims-1: F-score>=0.303

Appendices


Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

RK Model
Links:
         Venue   Repository    
   LPIPS ↓   
{Input fr.}
arXiv
Table 1
M2SVid
1 M2SVid
arXiv
0.180 {MF}
2 SVG
ICLR GitHub Stars
0.217 {MF}
3 StereoCrafter
arXiv GitHub Stars
0.242 {MF}

Back to Top Back to the List of Rankings

ScanNet (170 frames): TAE<=2.2

RK Model
Links:
         Venue   Repository    
  TAE ↓  
{Input fr.}
CVPR
VDA
1 VDA-L
CVPR GitHub Stars
0.570 {MF}
2 DepthCrafter
CVPR GitHub Stars
0.639 {MF}
3 Depth Any Video
ICLR GitHub Stars
0.967 {MF}
4 ChronoDepth
CVPR GitHub Stars
1.022 {MF}
5 Depth Anything V2 Large
NeurIPS GitHub Stars
1.140 {1}
6 NVDS
ICCV GitHub Stars
2.176 {4}

Back to Top Back to the List of Rankings

Bonn RGB-D Dynamic (5 video clips with 110 frames each): δ1>=0.979

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK Model
Links:
         Venue   Repository    
     δ1 ↑     
{Input fr.}
ICCV
Table 2
ST2
     δ1 ↑     
{Input fr.}
CVPR
Table 2
Uni4D
     δ1 ↑     
{Input fr.}
CVPR
Table S1
VDA
1 SpatialTrackerV2
ICCV GitHub Stars
0.988 {MF} - -
2 Depth Pro
ICLR GitHub Stars
- 0.986 {1} -
3-4 Metric3D v2
TPAMI GitHub Stars
- 0.985 {1} -
3-4 UniDepth
CVPR GitHub Stars
- 0.985 {1} -
5 Uni4D
CVPR GitHub Stars
- 0.983 {MF} -
6 VDA-L
CVPR GitHub Stars
0.982 {MF} - 0.972 {MF}
7 Depth Any Video
ICLR GitHub Stars
- - 0.981 {MF}
8 DepthCrafter
CVPR GitHub Stars
0.979 {MF} 0.976 {MF} 0.979 {MF}

Back to Top Back to the List of Rankings

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.052

📝 Note 1: Alignment: per-sequence scale & shift
📝 Note 2: See Figure 4

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
ICCV
Table 2
ST2
  AbsRel ↓  
{Input fr.}
CVPR
Table 2
Uni4D
  AbsRel ↓  
{Input fr.}
arXiv
Table 5
π3
  AbsRel ↓  
{Input fr.}
CVPR
Table S1
VDA
1 SpatialTrackerV2
ICCV GitHub Stars
0.028 {MF} - - -
2 MegaSaM
CVPR GitHub Stars
0.037 {MF} - - -
3 Uni4D
CVPR GitHub Stars
- 0.038 {MF} - -
4 UniDepth
CVPR GitHub Stars
- 0.040 {1} - -
5 π3
arXiv GitHub Stars
- - 0.043 {MF} -
6 Metric3D v2
TPAMI GitHub Stars
- 0.044 {1} - -
7-8 Depth Pro
ICLR GitHub Stars
- 0.049 {1} - -
7-8 VDA-L
CVPR GitHub Stars
0.049 {MF} - - 0.053 {MF}
9 Depth Any Video
ICLR GitHub Stars
- - - 0.051 {MF}
10 VGGT
CVPR GitHub Stars
0.056 {MF} - 0.052 {MF} -

Back to Top Back to the List of Rankings

NYU-Depth V2: AbsRel<=0.0421 (affine-invariant disparity)

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
arXiv
Table B.4
MoGe-2
  AbsRel ↓  
{Input fr.}
CVPR
Table A2
MoGe
  AbsRel ↓  
{Input fr.}
NeurIPS
BD
   AbsRel ↓   
{Input fr.}
arXiv
M3D v2
  AbsRel ↓  
{Input fr.}
CVPR
DA
    AbsRel ↓    
{Input fr.}
NeurIPS
DA V2
1 MoGe-2
arXiv GitHub Stars
0.0335 {1} - - - - -
2-3 MoGe
CVPR GitHub Stars
0.0338 {1} 0.0338 {1} - - - -
2-3 UniDepthV2
arXiv GitHub Stars
0.0338 {1} - - - - -
4 UniDepth
CVPR GitHub Stars
0.0378 {1} 0.0378 {1} - - - -
5 Depth Anything V2 Large
NeurIPS GitHub Stars
0.0414 {1} 0.0414 {1} - - - 0.045 {1}
6-8 BetterDepth
NeurIPS
- - 0.042 {1} - - -
6-8 Depth Anything Large
CVPR GitHub Stars
0.0420 {1} 0.0420 {1} 0.043 {1} 0.043 {1} 0.043 {1} 0.043 {1}
6-8 Metric3D v2 ViT-Large
TPAMI GitHub Stars
0.134 {1} 0.134 {1} - 0.042 {1} - -
9 Depth Pro
ICLR GitHub Stars
0.0421 {1} - - - - -

Back to Top Back to the List of Rankings

NYU-Depth V2: AbsRel<=0.051 (metric depth)

RK Model
Links:
         Venue   Repository    
  AbsRel ↓  
{Input fr.}
CVPR
Table 16
UniK3D
  AbsRel ↓  
{Input fr.}
arXiv
UD2
   AbsRel ↓   
{Input fr.}
arXiv
M3D v2
  AbsRel ↓  
{Input fr.}
arXiv
Table 2
MS
  AbsRel ↓  
{Input fr.}
arXiv
GRIN
1 UniK3D
CVPR GitHub Stars
0.0443 {1} - - - -
2 UniDepthV2
arXiv GitHub Stars
- 0.0468 {1} - - -
3 Metric3D v2 ViT-L FT
TPAMI GitHub Stars
0.0470 {1} 0.0470 {1} 0.047 {1} - -
4 Metric-Solver
arXiv GitHub Stars
- - - 0.049 {1} -
5 GRIN_FT_NI
arXiv
- - - - 0.051 {1}

Back to Top Back to the List of Rankings

iBims-1: F-score>=0.303

RK Model
Links:
         Venue   Repository    
  F-score ↑  
{Input fr.}
arXiv
TABLE I
UD2
  F-score ↑  
{Input fr.}
CVPR
Table 20
UniK3D
1 UniDepthV2-Large
arXiv GitHub Stars
0.709 {1} -
2 UniK3D-Large
CVPR GitHub Stars
- 0.698 {1}
3 Depth Pro
ICLR GitHub Stars
0.628 {1} 0.628 {1}
4 MASt3R
ECCV GitHub Stars
0.557 {2} 0.557 {2}
5 UniDepth
CVPR GitHub Stars
0.303 {1} 0.303 {1}

Back to Top Back to the List of Rankings

Appendix 4: Notes for "Awesome Synthetic RGB-D Video Datasets for Training and Testing HD Video Depth Estimation Models"

📝 Note 1: Example of arranging images in the correct order to make a 32-frame video sequence for the ClaraVid dataset:

<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00360.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00320.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00280.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00240.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00200.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00160.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00120.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00080.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00040.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00000.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00001.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00002.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00003.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00004.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00005.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00006.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00007.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00008.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00009.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00010.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00011.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00012.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00013.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00014.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00015.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00016.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00017.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00018.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00019.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00059.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00099.jpg
<your-data-path>/008_urban_dense_1/left_rgb/45deg_low_h/00139.jpg

📝 Note 2: Do not use the SYNTHIA-Seqs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. This is particularly evident in the example of tree leaves:
<your-data-path>/SYNTHIA-SEQS-01-SPRING/Depth/Stereo_Left/Omni_F/000071.png
<your-data-path>/SYNTHIA-SEQS-01-SPRING/RGB/Stereo_Left/Omni_F/000071.png.
📝 Note 3: Do not use the DigiDogs dataset for training HD video depth estimation models! The depth maps in this dataset do not match the corresponding RGB images. See the objects behind the campfire, the shifting position of the vegetation on the left and the clear banding on the depth map:
<your-data-path>/DigiDogs2024_full/09_22_2022/00054/images/img_00012.tiff.
📝 Note 4: Check before use the SynDrone dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
<your-data-path>/Town01_Opt_120_depth/Town01_Opt_120/ClearNoon/height20m/depth/00031.png.
📝 Note 5: Check before use the Aria Synthetic Environments dataset for training HD video depth estimation models! The depth maps in this dataset have large white areas of unknown depth, which should not happen with a synthetic dataset. Example depth map:
<your-data-path>/75/depth/depth0000109.png.

Back to Top Back to the List of Rankings

Appendix 5: List of all research papers from the above rankings

Method Abbr. Paper      Venue     
(Alt link)
Official
  repository  
BetterDepth BD BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation NeurIPS -
ChronoDepth - Learning Temporally Consistent Video Depth from Video Diffusion Priors CVPR GitHub Stars
Depth Any Video DAV Depth Any Video with Scalable Synthetic Data ICLR GitHub Stars
Depth Anything DA Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data CVPR GitHub Stars
Depth Anything V2 DA V2 Depth Anything V2 NeurIPS GitHub Stars
Depth Pro DP Depth Pro: Sharp Monocular Metric Depth in Less Than a Second ICLR GitHub Stars
DepthCrafter DC DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos CVPR GitHub Stars
GRIN - GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion arXiv -
M2SVid - M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion arXiv -
MASt3R - Grounding Image Matching in 3D with MASt3R ECCV GitHub Stars
MegaSaM - MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos CVPR GitHub Stars
Metric3D v2 M3D v2 Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation TPAMI
arXiv
GitHub Stars
Metric-Solver MS Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image arXiv GitHub Stars
MoGe - MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision CVPR GitHub Stars
MoGe-2 Mo2 MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details arXiv GitHub Stars
NVDS - Neural Video Depth Stabilizer ICCV GitHub Stars
SpatialTrackerV2 ST2 SpatialTrackerV2: 3D Point Tracking Made Easy ICCV GitHub Stars
StereoCrafter - StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos arXiv GitHub Stars
SVG - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix ICLR GitHub Stars
Uni4D - Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video CVPR GitHub Stars
UniDepth - UniDepth: Universal Monocular Metric Depth Estimation CVPR GitHub Stars
UniDepthV2 UD2 UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler arXiv GitHub Stars
UniK3D - UniK3D: Universal Camera Monocular 3D Estimation CVPR GitHub Stars
VGGT - VGGT: Visual Geometry Grounded Transformer CVPR GitHub Stars
Video Depth Anything VDA Video Depth Anything: Consistent Depth Estimation for Super-Long Videos CVPR GitHub Stars
π3 - π3: Scalable Permutation-Equivariant Visual Geometry Learning arXiv GitHub Stars

Back to Top Back to the List of Rankings

Appendix 6: List of other research papers

📝 Note: This list includes the research papers of models that dropped out of the "Bonn RGB-D Dynamic ranking (5 video clips with 110 frames each): AbsRel" as a result of a change in the entry threshold for this ranking in August 2025 and are simultaneously ineligible for the other rankings.

Method Abbr. Paper      Venue     
(Alt link)
Official
  repository  
Align3R - Align3R: Aligned Monocular Depth Estimation for Dynamic Videos CVPR GitHub Stars
CUT3R C3R Continuous 3D Perception Model with Persistent State CVPR GitHub Stars
Geo4D - Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction arXiv GitHub Stars
L4P - L4P: Low-Level 4D Vision Perception Unified arXiv GitHub Stars
MonST3R - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion ICLR GitHub Stars
RollingDepth RD Video Depth without Video Models CVPR GitHub Stars

Back to Top Back to the List of Rankings

About

We’re looking forward to new models based on DINOv3. For now rankings include: Align3R BetterDepth ChronoDepth CUT3R Depth Any Video Depth Anything Depth Pro DepthCrafter Geo4D GRIN L4P M2SVid MASt3R Metric3D Metric-Solver MoGe MonST3R NVDS RollingDepth SpatialTrackerV2 StereoCrafter SVG Uni4D UniDepth UniK3D VGGT Video Depth Anything π^3

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published