Add Center3D

patrick-llgc · patrick-llgc · commit bc5097e64a18 · 2020-11-09T15:02:21.000-08:00
diff --git a/paper_notes/center3d.md b/paper_notes/center3d.md
@@ -7,16 +7,32 @@ tl;dr: CenterNet-based approach, with better distance estimation
 #### Overall impression
 The paper proposed two approaches for distance estimation. One is based on DORN with better discretization strategy, and the second is based on breaking down the distance into two large bins, one for near objects and the other for faraway ones. 
 
-It is [CenterNet](centernet.md) based approach, very similar to [SMOKE](smoke.md) and [KM3D-Net](km3d_net.md).
+Overall this paper is a very solid contribution to monocular 3D object detection. Nothing fancy, but concrete experiment and small design tweaks. 
+
+A quick summary of [CenterNet](centernet.md) monocular 3D object detection.
+
+- [CenterNet](centernet.md) predicts 2D bbox center and uses it as 3D bbox center. 
+- [SMOKE](smoke.md) predicts projected 3D bbox center.
+- [KM3D-Net](km3d_net.md) and [Center3D](center3d.md) predict 2D bbox center and offset from projected 3D bbox center. 
 
 #### Key ideas
+- 2D and projected 3D center are different
+	- the gap decreases for faraway objects and which appear in the center area of the image plane.
+	- The gap becomes significant for objects that are close to the camera or on the image boundary.
 - LID (linear increasing discretization)
-	- The SID (space-increasing discretization) approach used by [DORN](dorn.md) gives too many bins in the nearby range. 
-- DepJoint
+	- The SID (space-increasing discretization) approach used by [DORN](dorn.md) gives too dense bins in the unnecessary nearby range. 
+	- The length of the bins increases linearly in LID (and log-wise in SID).
+	- [DORN](dorn.md) counts the number of bins with proba > 0.5 as ordinal label and use the median value of that bins the estimated depth in meters. 
+	- LID also uses a regression bit to predict the residual value. --> This is very important to ensure good depth estimation as shown in the ablation study.
+- DepJoint: piece wise depth prediction
 	- Breaking the distance into two bins (either overlapping or back-to-back bins)
+	- Eigen's exponential transformation of distance: $\Phi (d) = e ^ {-d}$.
+	- This has very good accuracy in close range, but not so in distance range
+	- Augment the prediction for faraway objects by also predicting $d' = d_{max} - d$. Then during inference, uses the weighted prediction of the two prediction.
+	- The bin breakdown is controlled by two hyper parameters. The bins can have overlap or back-to-back.
 
 #### Technical details
-- Summary of technical details
+- RA (reference area) solves the issue of lack of supervision for attribute prediction. Not only the GT center point contribute to the attribute prediction losses, but a dilated support region is used to predict all the attribute. --> this is inspired by the support region in [SS3D](ss3d.md).
 
 #### Notes
 - Questions and notes on how to improve/revise the current work  
diff --git a/paper_notes/centernet.md b/paper_notes/centernet.md
@@ -9,6 +9,12 @@ CenterNet is a very generic object detection framework that can be used for 2D o
 
 [FCOS](fcos.md) regressed distances to four edges, while [CenterNet](centernet.md) only regresses width and height. The [FCOS](fcos.md) formulation is more general as it can handle amodal bbox cases (the object center may not be the center of bbox).
 
+A quick summary of [CenterNet](centernet.md) monocular 3D object detection.
+
+- [CenterNet](centernet.md) predicts 2D bbox center and uses it as 3D bbox center. 
+- [SMOKE](smoke.md) predicts projected 3D bbox center.
+- [KM3D-Net](km3d_net.md) and [Center3D](center3d.md) predict 2D bbox center and offset from projected 3D bbox center. 
+
 
 #### Key ideas
 - Other properties, such as object size, dimension, 3D extent, orientation, and pose are regressed directly from image features at the center location.
diff --git a/paper_notes/km3d_net.md b/paper_notes/km3d_net.md
@@ -13,6 +13,12 @@ The removal of the depth prediction directly from the neural network makes it po
 
 This is the currently the SOTA, much better than previous SOTA [M3D-RPN](m3d_rpn.md).
 
+A quick summary of [CenterNet](centernet.md) monocular 3D object detection.
+
+- [CenterNet](centernet.md) predicts 2D bbox center and uses it as 3D bbox center. 
+- [SMOKE](smoke.md) predicts projected 3D bbox center.
+- [KM3D-Net](km3d_net.md) and [Center3D](center3d.md) predict 2D bbox center and offset from projected 3D bbox center. 
+
 #### Key ideas
 - Architecture
 	- Based on [CenterNet](centernet.md) and [RTM3D](rtm3d.md). Very similar to [SMOKE](smoke.md).
diff --git a/paper_notes/smoke.md b/paper_notes/smoke.md
@@ -7,8 +7,14 @@ tl;dr: Mono3D based on [CenterNet](centernet.md) and [monoDIS](monodis.md).
 #### Overall impression
 The paper is a solid engineering paper as an extension to [CenterNet](centernet.md), similar to [MonoPair](monopair.md). It does not have a lot of new tricks. It is similar to the popular solutions to the [Kaggle mono3D competition](https://www.kaggle.com/c/pku-autonomous-driving).
 
+A quick summary of [CenterNet](centernet.md) monocular 3D object detection.
+
+- [CenterNet](centernet.md) predicts 2D bbox center and uses it as 3D bbox center. 
+- [SMOKE](smoke.md) predicts projected 3D bbox center.
+- [KM3D-Net](km3d_net.md) and [Center3D](center3d.md) predict 2D bbox center and offset from projected 3D bbox center. 
+
 #### Key ideas
-- SMOKE eliminates 2D object detection altogether. Instead of predicting the 2d bbox center and the 3d/2d center offset, SMOKE predicts 3D center directly.
+- SMOKE eliminates 2D object detection altogether. Instead of predicting the 2d bbox center and the 3d/2d center offset, SMOKE predicts 3D center directly. --> This may have some issues as for cars heavily truncated, the 3D center may not be inside the image. 
 - Rather than regressing the 7 DoF variables with separate loss functions, SMOKE transform the variables into 8 corner representation of 3D boxes and regress them with **a unified loss functions**. This is a nice way to implicitly weigh the loss functions. (cf [To learn or not to learn](to_learn_or_not.md) which regresses an essential matrix.)
 - **Disentangles loss** from [monoDIS](monodis.md) groups the 8 parameters into 3 groups. In each group, use the prediction in that group and the gt from other groups to lift to 3D and calculate overall loss. The final loss is an unweighted averaged of the loss from different group. 
 - Classification
@@ -30,4 +36,5 @@ The paper is a solid engineering paper as an extension to [CenterNet](centernet.
 
 #### Notes
 - [Code on github](https://github.com/lzccccc/SMOKE)
+- Need to implement the 2D center prediction and offset between 2D and 3D to recover heavily truncated 3D bbox. This method can be extended to other scenarios where the predicted location goes out of a ROI. See [KM3D-Net](km3d_net.md) and [Center3D](center3d.md).
 
diff --git a/paper_notes/ss3d.md b/paper_notes/ss3d.md
@@ -20,7 +20,7 @@ This paper also demonstrates **the possibility to directly regress the distance
 - The 26 numbers can also be trained to fit 3D IoU, but the 26 numbers need to be fitted to a valid 3D bbox online. This requires some complex manipulation of gradient.
 
 #### Technical details
-- All pixels in the support (central 20% of bbox) is responsible for detecting the bounding box sizes. Thus NMS is needed to find local optimum. The 26 numbers (from 26 channels most likely) associated with the local optimum point is used to predict the 3D box. 
+- All pixels in the support (central 20% of bbox) is responsible for detecting the bounding box sizes. Thus NMS is needed to find local optimum. The 26 numbers (from 26 channels most likely) associated with the local optimum point is used to predict the 3D box. --> This is also used in [Center3D](center3d.md).
 
 #### Notes
 - Questions and notes on how to improve/revise the current work