Update notes

patrick-llgc · patrick-llgc · commit 36e2156b777c · 2019-11-27T09:46:16.000-08:00
diff --git a/paper_notes/mono3d++.md b/paper_notes/mono3d++.md
@@ -21,4 +21,4 @@ The paper seems to use 3D depth off the shelf but it was not described in detail
 #### Notes
 - Where does the label come from?
 - The wireframe model is fragile and cannot model under-represented cases.
-- 
+- von Mises distribution: circular Gaussian distribution
diff --git a/paper_notes/mono_3dod_2d3d_constraints.md b/paper_notes/mono_3dod_2d3d_constraints.md
@@ -2,7 +2,7 @@
 
 _October 2019_
 
-tl;dr: Summary of the main idea.
+tl;dr: Summary of the main idea. The review is published at [towarddatacience](https://towardsdatascience.com/geometric-reasoning-based-cuboid-generation-in-monocular-3d-object-detection-5ee2996270d1?source=friends_link&sk=ebead4b51a3f75476d308997dd88dd75).
 
 ### Deep 3D Box
 from [Deep3DBox](https://arxiv.org/pdf/1612.00496.pdf) and its [Supplementary material](https://cs.gmu.edu/~amousavi/papers/3D-Deepbox-Supplementary.pdf), and [review in 中文](https://blog.csdn.net/qq_29462849/article/details/91314777)
diff --git a/paper_notes/monogrnet.md b/paper_notes/monogrnet.md
@@ -5,7 +5,9 @@ _August 2019_
 tl;dr: Use the same network to estimate instance depth, 2D and 3D bbox.
 
 #### Overall impression
-The authors raises a critical issue in the current depth estimation that the evaluation metrics are not focused on instance level depth estimation. That means all SOTA methods are sub-optimal in terms of estimating instance level depth. This point echoes mine that the pixel wise depth map need finetune for 3D object detection, as opposed to freezing it as done in [pseudo-lidar end2end](pseudo_lidar_e2e.md).
+The authors raises a critical issue in the current depth estimation that the evaluation metrics are not focused on instance level depth estimation. That means all SOTA methods are sub-optimal in terms of estimating instance level depth. This point echoes mine that the pixel wise depth map need finetune for 3D object detection, as opposed to freezing it as done in [pseudo-lidar end2end](pseudo_lidar_e2e.md). This point is further elaborated in [ForeSeE](foresee_mono3dod.md) which separates FG and BG depth estimation.
+
+It also mentioned that pixel-wise depth is too expensive for mono3dod, and instance depth should be enough. --> similar to [TLNet](tlnet.md).
 
 The depth estimation is instance based sparse depth info. --> There should be a way to marry depth estimation and 3D object detection together. Or is this sparse depth info is already enough? This is the first tunable depth estimation in almost all mono3DOD research. Most just use pretrained depth estimation off the shelf.
 
diff --git a/paper_notes/monogrnet_russian.md b/paper_notes/monogrnet_russian.md
@@ -5,9 +5,9 @@ _October 2019_
 tl;dr: Regress keypoints in 2D images and use 3D CAD model to infer depth. 
 
 #### Overall impression
-The training is based on 3D CAD model with minimal keypoint annotation. This is valuable as it saves much annotation effort on 2D images, which is inefficient and inaccurate. It only assumes that the intrinsics are given. 
+The training is based on 3D CAD model with minimal keypoint annotation. This is valuable as it saves much annotation effort on 2D images, which is inefficient and inaccurate. It also seems to use the semi-automatic way to annotate 2D keypoints as in [deep MANTA](deep_manta.md).
 
-It is related to [deepMANTA](deep_manta.md) that it relies on keypoint regression for monocular 3DOD. It is inspired by [monoGRNet](monogrnet.md) that it uses simple geometric reasoning for Mono3DOD. The idea of using keypoint to estimate depth can also be found in [GS3D](gs3d.md).
+It is related to [deepMANTA](deep_manta.md) that it relies on keypoint regression for monocular 3DOD. The idea of using keypoint to estimate depth can also be found in [GS3D](gs3d.md). It is not actually that related to MonoGRNet.
 
 It follows the Mono3DOD tradition that regresses local yaw and dimension offset from image patches and infer depth from these results.
 
@@ -20,7 +20,7 @@ It follows the Mono3DOD tradition that regresses local yaw and dimension offset
 	- CAD and dimension model: CAD cls + dimension offset. 
 - Scale according to keypoint distance. This idea is valid as the dimension or keypoint distances do not vary much.
 - Regress the local orientation, then convert to global orientation.
-- Reprojection consistency loss: this would need accurate **extrinsic** information. We could get this from localization, but this is better left off to sensor fusion to fuse the info. --> But sensor fusion is better to do a sanity check of the consistency from different methods. Or DL could use a nominal extrinsic information to check the consistency.
+- Reprojection consistency loss to make the results from multiple heads consistent. (e.g., 2D 3D tight constraints, 2D keypoint location)
 
 #### Technical details
 - Dense depth estimation may be redundant in context of 3D object detection. [MonoGRNet](monogrnet.md) only regresses instance level depth. This paper focuses on salient features (keypoints). 
diff --git a/paper_notes/review_descriptors.md b/paper_notes/review_descriptors.md
@@ -28,4 +28,7 @@ How to use multiple features per image? Keypoint matching or bag-of-visual-words
 ![](https://docs.opencv.org/3.4/fast_speedtest.jpg)
 - [openCV example](https://docs.opencv.org/3.1.0/df/d0c/tutorial_py_fast.html#gsc.tab=0)
 
+#### Harris
+- Quite fast (although slower than FAST), more accurate
+
 #### SIFT
diff --git a/paper_notes/review_mono_3dod.md b/paper_notes/review_mono_3dod.md
@@ -4,6 +4,7 @@ _October 2019_
 
 - Update: 10/24/2019, initial creation of table
 - Update: 10/28/2019, added centerNet (from UT Austin) and mono 3d tracking (from DeepDrive)
+- Update: 11/25/2019, blog post published to [towarddatascience](https://towardsdatascience.com/monocular-3d-object-detection-in-autonomous-driving-2476a3c7f57e?source=friends_link&sk=160d236be1881b6ee1b431a943666fdb) and google [spreadsheet](https://docs.google.com/spreadsheets/d/1X_ViM-W4QbHPbJ2dHouRgkRAyzEnBS6J_9VxPEXvDM4/edit#gid=0).
 
 | name                                        | Time | venue          | title                                                                                                                                                                                | tl;dr                                                                                                                                                 | predecessor               | backbone                                                               | 3d size                                                         | 3d shape                                           | keypoint                     | 3d orientation                                                                  | distance                                                                           | 2D to 3D tight optim                                    | required input                                                          | drawbacks                                                                          | tricks and contributions                                                                                                      | insights                                                                                                                                                                                                                                 |
 |---------------------------------------------|------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------------|------------------------------|---------------------------------------------------------------------------------|------------------------------------------------------------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
diff --git a/paper_notes/tlnet.md b/paper_notes/tlnet.md
@@ -5,7 +5,7 @@ _October 2019_
 tl;dr: Place 3D anchors inside the frustum subtended by 2D object detection as the mono baseline. The stereo branches reweigh feature maps based on their coherence score.
 
 #### Overall impression
-Pixel level depth maps are too expensive for 3DOD. Object level depth should be good enough.
+Pixel level depth maps are too expensive for 3DOD. Object level depth should be good enough. --> this is similar to [MonoGRNet](monogrnet.md).
 
 The paper provides a solid mono baseline. --> this can be perhaps improved by using some huristics such as vehicle size to overcome the dense sampling of 3D anchors.