Add AlphaGo

patrick-llgc · patrick-llgc · commit 6922609bd064 · 2024-06-22T19:21:09.000+08:00
diff --git a/README.md b/README.md
@@ -62,13 +62,14 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
 - [FlowMap: Path Generation for Automated Vehicles in Open Space Using Traffic Flow](https://arxiv.org/abs/2305.01622) <kbd>ICRA 2023</kbd>
 - [EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments](https://arxiv.org/abs/2108.07993) <kbd>TRO 2021</kbd> [Wenchao Ding, encyclopedia of pnc]
 - [Hybrid A-star: Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments](https://www.semanticscholar.org/paper/Path-Planning-for-Autonomous-Vehicles-in-Unknown-Dolgov-Thrun/0e8c927d9c2c46b87816a0f8b7b8b17ed1263e9c) <kbd>IJRR 2010</kbd> [Dolgov, Thrun, Searching]
-- [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https://www.semanticscholar.org/paper/Optimal-trajectory-generation-for-dynamic-street-in-Werling-Ziegler/6bda8fc13bda8cffb3bb426a73ce5c12cc0a1760) <kbd>ICRA 2010</kbd> [Werling, Thrun, Sampling]
+- [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https://www.semanticscholar.org/paper/Optimal-trajectory-generation-for-dynamic-street-in-Werling-Ziegler/6bda8fc13bda8cffb3bb426a73ce5c12cc0a1760) <kbd>ICRA 2010</kbd> [Werling, Thrun, Sampling] [MUST READ for planning folks]
+- [Autonomous Driving on Curvy Roads Without Reliance on Frenet Frame: A Cartesian-Based Trajectory Planning Method](https://ieeexplore.ieee.org/document/9703250) <kbd>TITS 2022</kbd>
 - [Baidu Apollo EM Motion Planner](https://arxiv.org/abs/1807.08048) [[Notes](paper_notes/apollo_em_planner.md)][Optimization]
 - [基于改进混合A*的智能汽车时空联合规划方法](https://www.qichegongcheng.com/CN/abstract/abstract1500.shtml) <kbd>汽车工程: 规划&决策2023年</kbd> [Joint optimization, search]
 - [Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment](https://journals.sagepub.com/doi/abs/10.1177/0954407020906627) <kbd>JAE 2020</kbd> [Joint optimization, search]
 - [Focused Trajectory Planning for Autonomous On-Road Driving](https://www.ri.cmu.edu/pub_files/2013/6/IV2013-Tianyu.pdf) <kbd>IV 2013</kbd> [Joint optimization, Iteration]
 - [SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor](https://arxiv.org/abs/1906.09788) <kbd>RAL 2019</kbd> [Joint optimization, SSC, Wenchao Ding, Motion planning]
-- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind, MTCS]
+- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) [[Notes](paper_notes/alphago.md)] <kbd>Nature 2016</kbd> [DeepMind, MTCS]
 - [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) <kbd>Science 2017</kbd> [DeepMind]
 - [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) <kbd>Nature 2020</kbd> [DeepMind]
 - [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization]
diff --git a/paper_notes/alphago.md b/paper_notes/alphago.md
@@ -0,0 +1,50 @@
+# [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961)
+
+_June 2024_
+
+tl;dr: MCTS trees with neural nets bets human in the game of Go.
+
+#### Overall impression
+Value iteration and policy iterations are systematic, iterative method that solves MDP problems. Yet even with the improved policy iteration, it still have to perform time-consuming operation to update the value of EVERY state. A standard 19x19 Go board has roughly [2e170 possible states](https://senseis.xmp.net/?NumberOfPossibleGoGames). This vast amount of states will be intractable to solve with a vanilla value iteration or policy iteration technique.
+
+AlphaGo and its successors use a [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) algorithm to find its moves guided by a value network and a policy network, trained on from human and computer play. Both the value network and policy network takes in the current state of the board and produces a singular state value function V(s) of the board position and the state-action value function Q(s, a) of all possible moves given the current board position. Neural networks are used to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
+
+Every leaf node (an unexplored board position) in the MCTS is evaluated in two very different ways: by the value network; and second, by the outcome of a random rollout played out using the fast rollout policy. Note that a single evaluation of the value network also approached the accuracy of Monte Carlo rollouts using the RL policy network, but using 15,000 times less computation. This is very similar to a fast-slow system design, intuition vs reasoning, [system 1 vs system 2](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow) by Nobel laureate Daniel Kahneman (We can see similar design in more recent work such as [DriveVLM](https://arxiv.org/abs/2402.12289)).
+
+#### Key ideas
+- MCTS: policy estimation, focuses on decision-making from the current state. It has 4 steps process of selection-expansion-simulation-backprop.
+    - **Selection**: Follow the most promising path based on previous simulations until you reach a leaf node (a position that hasn’t been fully explored yet).
+    - **Expansion**: add one or more child nodes to represent the possible next moves.
+    - **Simulation**: From the new node, play out a random game until the end (this is called a “rollout”).
+    - **Backpropagation**: Go back through the nodes on the path you took and update their values based on the result of the game. If you won, increase the value; if you lost, decrease it.
+- MCTS guided by value network and policy network.
+    - Value network reduce the search depth by summarizing values of sub-trees, so we can avoid going deep for good estimations. Policy network to prune search space. Balanced breadth and width.
+    - MCTS used both value network and reward from rollout.
+    - Policy network reduce the breadth of the search tree by identifying sensible moves, so we can avoid non-sensible moves.
+    - Value network V evaluates winning rate from a state (棋面).
+    - Trained with state-outcome pairs. Trained with much more self-play data to reduce overfit.
+    - Policy network evaluates action distribution
+    - Value network is more like instinct (heuristic), value network provides policy gradient to update policy network. Tesla learned collision network, and heuristic network for hybrid A-star.
+- With autonomous driving
+    - World model
+    - AlphaGo tells us how to extract very good policy with a good world model (simulation)
+    - Autonomous driving still needs a very good simulation to be able to leverage alphaGo algorithm. —> It this a dead end, vs FSD v12?
+    - Tesla AI day 2021 and 2022 are heavily affected by AlphaGo. FSDv12 is a totally different ballgame though.
+    - Go is limited and noiseless.
+- Policy networks
+    - P_s trained with SL, more like e2e
+    - P_p trained with SL, shallower for fast rollout in MCTS
+    - P_r trained with RL to play with P_s
+
+
+
+#### Technical details
+- Summary of technical details, such as important training details, or bugs of previous benchmarks.
+
+#### Notes
+- Value iteration is to MCTS as Dijkstra's algorithm is to (hybrid) A-star: both value iteration and Dijkstra's systematically consider all possibilities (covering the entire search space) without heuristics, while MCTS and A* use heuristics to focus on the most promising options, making them more efficient for large and complex problems. 
+- Question: PN is essentially already e2e, why need VN and MCTS?
+    - My guess: Small scale SL generate PN not strong enough, so need RL and MCTS to boost performance.
+    - E2E demonstrates that with enough data, e2e SL can generate strong enough PN itself.
+    - Maybe later MCTS will come back again to generate superhuman behavior for driving.
+
diff --git a/paper_notes/eudm.md b/paper_notes/eudm.md
@@ -23,19 +23,18 @@ It is further improved by [MARC](marc.md) where it considers risk-aware continge
 #### Key ideas
 - DCP-Tree (domain specific closed-loop policy tree), ego-centric
 	- Guided branching in action space
-	- Each trace only contains ONE change of action (more flexible than MPDM but still manageable). This is a tree with pruning mechanism built-in. [MCDM](mcdm.md) essentially has a much more aggressive pruning as only one type of action is allowed (KKK, RRR, LLL, etc)
+	- Each trace only contains ONE change of action (more flexible than MPDM but still manageable). This is a tree with pruning mechanism built-in. [MPDM](mpdm.md) essentially has a much more aggressive pruning as only one type of action is allowed (KKK, RRR, LLL, etc)
 	- Each semantic action is 2s, 4 levels deep, so planning horizon of 8s.
 - CFB (conditional focused branching), for other agents
 	- conditioned on ego intention
 	- Pick out the potentially risky scenarios using **open loop** safety assement. (Open loop ignores interaction among agents, and allows checking of how serious the situation wil be if surrounding agents are completely uncoorpoerates and does not react to other agents.)
 	- select key vehicles first, only a subset of all vehicles. --> Like Tesla's AI day 2022.
-- Forward simulation
-	- IDM for longitudinal simulation
-	- PP (Pure pursuit) for lateral simulation
 - EUDM output the best policy represented by ego waypoints (0.4s apart). Then it is sent to motion planner (such as [SCC](scc.md)) for trajectory generation.
 
 #### Technical details
-- Summary of technical details, such as important training details, or bugs of previous benchmarks.
+- Forward simulation
+	- IDM for longitudinal simulation
+	- PP (Pure pursuit) for lateral simulation
 
 #### Notes
 - What are the predictions are fed into MP alongside the BP results from EUDM?
diff --git a/paper_notes/mpdm.md b/paper_notes/mpdm.md
@@ -27,10 +27,10 @@ Despite simple design, MPDM is a pioneering work in decision making, and improve
 	- The decoupling of vehicle behavior as the instantaneous behavior is independent of each other.
 	- The formulation is highly inspiring and is the foundation of [EPSILON](epsilon.md) and all follow-up works.
 	- The horizon is 10s with 0.25s timesteps, so a 40-layer deep tree. 
-- MPDM How important is the closed-loop realism? The paper seems to argue that the inaccuracy in closed-loop simulation does not affect final algorithm performance that much. Close-loop or not is the key.
 
 #### Technical details
-- Summary of technical details, such as important training details, or bugs of previous benchmarks.
+- MPDM How important is the closed-loop realism? The paper seems to argue that the inaccuracy in closed-loop simulation does not affect final algorithm performance that much. Close-loop or not seems to be the key.
+
 
 #### Notes
 - The white paper from [May Mobility](https://maymobility.com/resources/autonomy-at-scale-white-paper/) explains the idea with more plain language and examples.