Skip to content

Commit 6922609

Browse files
committed
Add AlphaGo
1 parent 2bc6008 commit 6922609

File tree

4 files changed

+59
-9
lines changed

4 files changed

+59
-9
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,14 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
6262
- [FlowMap: Path Generation for Automated Vehicles in Open Space Using Traffic Flow](https://arxiv.org/abs/2305.01622) <kbd>ICRA 2023</kbd>
6363
- [EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments](https://arxiv.org/abs/2108.07993) <kbd>TRO 2021</kbd> [Wenchao Ding, encyclopedia of pnc]
6464
- [Hybrid A-star: Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments](https://www.semanticscholar.org/paper/Path-Planning-for-Autonomous-Vehicles-in-Unknown-Dolgov-Thrun/0e8c927d9c2c46b87816a0f8b7b8b17ed1263e9c) <kbd>IJRR 2010</kbd> [Dolgov, Thrun, Searching]
65-
- [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https://www.semanticscholar.org/paper/Optimal-trajectory-generation-for-dynamic-street-in-Werling-Ziegler/6bda8fc13bda8cffb3bb426a73ce5c12cc0a1760) <kbd>ICRA 2010</kbd> [Werling, Thrun, Sampling]
65+
- [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https://www.semanticscholar.org/paper/Optimal-trajectory-generation-for-dynamic-street-in-Werling-Ziegler/6bda8fc13bda8cffb3bb426a73ce5c12cc0a1760) <kbd>ICRA 2010</kbd> [Werling, Thrun, Sampling] [MUST READ for planning folks]
66+
- [Autonomous Driving on Curvy Roads Without Reliance on Frenet Frame: A Cartesian-Based Trajectory Planning Method](https://ieeexplore.ieee.org/document/9703250) <kbd>TITS 2022</kbd>
6667
- [Baidu Apollo EM Motion Planner](https://arxiv.org/abs/1807.08048) [[Notes](paper_notes/apollo_em_planner.md)][Optimization]
6768
- [基于改进混合A*的智能汽车时空联合规划方法](https://www.qichegongcheng.com/CN/abstract/abstract1500.shtml) <kbd>汽车工程: 规划&决策2023年</kbd> [Joint optimization, search]
6869
- [Enable Faster and Smoother Spatio-temporal Trajectory Planning for Autonomous Vehicles in Constrained Dynamic Environment](https://journals.sagepub.com/doi/abs/10.1177/0954407020906627) <kbd>JAE 2020</kbd> [Joint optimization, search]
6970
- [Focused Trajectory Planning for Autonomous On-Road Driving](https://www.ri.cmu.edu/pub_files/2013/6/IV2013-Tianyu.pdf) <kbd>IV 2013</kbd> [Joint optimization, Iteration]
7071
- [SSC: Safe Trajectory Generation for Complex Urban Environments Using Spatio-Temporal Semantic Corridor](https://arxiv.org/abs/1906.09788) <kbd>RAL 2019</kbd> [Joint optimization, SSC, Wenchao Ding, Motion planning]
71-
- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) <kbd>Nature 2016</kbd> [DeepMind, MTCS]
72+
- [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) [[Notes](paper_notes/alphago.md)] <kbd>Nature 2016</kbd> [DeepMind, MTCS]
7273
- [AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play](https://www.science.org/doi/full/10.1126/science.aar6404) <kbd>Science 2017</kbd> [DeepMind]
7374
- [MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model](https://www.nature.com/articles/s41586-020-03051-4) <kbd>Nature 2020</kbd> [DeepMind]
7475
- [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https://arxiv.org/abs/1610.03295) [MobileEye, desire and traj optimization]

paper_notes/alphago.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961)
2+
3+
_June 2024_
4+
5+
tl;dr: MCTS trees with neural nets bets human in the game of Go.
6+
7+
#### Overall impression
8+
Value iteration and policy iterations are systematic, iterative method that solves MDP problems. Yet even with the improved policy iteration, it still have to perform time-consuming operation to update the value of EVERY state. A standard 19x19 Go board has roughly [2e170 possible states](https://senseis.xmp.net/?NumberOfPossibleGoGames). This vast amount of states will be intractable to solve with a vanilla value iteration or policy iteration technique.
9+
10+
AlphaGo and its successors use a [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) algorithm to find its moves guided by a value network and a policy network, trained on from human and computer play. Both the value network and policy network takes in the current state of the board and produces a singular state value function V(s) of the board position and the state-action value function Q(s, a) of all possible moves given the current board position. Neural networks are used to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
11+
12+
Every leaf node (an unexplored board position) in the MCTS is evaluated in two very different ways: by the value network; and second, by the outcome of a random rollout played out using the fast rollout policy. Note that a single evaluation of the value network also approached the accuracy of Monte Carlo rollouts using the RL policy network, but using 15,000 times less computation. This is very similar to a fast-slow system design, intuition vs reasoning, [system 1 vs system 2](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow) by Nobel laureate Daniel Kahneman (We can see similar design in more recent work such as [DriveVLM](https://arxiv.org/abs/2402.12289)).
13+
14+
#### Key ideas
15+
- MCTS: policy estimation, focuses on decision-making from the current state. It has 4 steps process of selection-expansion-simulation-backprop.
16+
- **Selection**: Follow the most promising path based on previous simulations until you reach a leaf node (a position that hasn’t been fully explored yet).
17+
- **Expansion**: add one or more child nodes to represent the possible next moves.
18+
- **Simulation**: From the new node, play out a random game until the end (this is called a “rollout”).
19+
- **Backpropagation**: Go back through the nodes on the path you took and update their values based on the result of the game. If you won, increase the value; if you lost, decrease it.
20+
- MCTS guided by value network and policy network.
21+
- Value network reduce the search depth by summarizing values of sub-trees, so we can avoid going deep for good estimations. Policy network to prune search space. Balanced breadth and width.
22+
- MCTS used both value network and reward from rollout.
23+
- Policy network reduce the breadth of the search tree by identifying sensible moves, so we can avoid non-sensible moves.
24+
- Value network V evaluates winning rate from a state (棋面).
25+
- Trained with state-outcome pairs. Trained with much more self-play data to reduce overfit.
26+
- Policy network evaluates action distribution
27+
- Value network is more like instinct (heuristic), value network provides policy gradient to update policy network. Tesla learned collision network, and heuristic network for hybrid A-star.
28+
- With autonomous driving
29+
- World model
30+
- AlphaGo tells us how to extract very good policy with a good world model (simulation)
31+
- Autonomous driving still needs a very good simulation to be able to leverage alphaGo algorithm. —> It this a dead end, vs FSD v12?
32+
- Tesla AI day 2021 and 2022 are heavily affected by AlphaGo. FSDv12 is a totally different ballgame though.
33+
- Go is limited and noiseless.
34+
- Policy networks
35+
- P_s trained with SL, more like e2e
36+
- P_p trained with SL, shallower for fast rollout in MCTS
37+
- P_r trained with RL to play with P_s
38+
39+
40+
41+
#### Technical details
42+
- Summary of technical details, such as important training details, or bugs of previous benchmarks.
43+
44+
#### Notes
45+
- Value iteration is to MCTS as Dijkstra's algorithm is to (hybrid) A-star: both value iteration and Dijkstra's systematically consider all possibilities (covering the entire search space) without heuristics, while MCTS and A* use heuristics to focus on the most promising options, making them more efficient for large and complex problems.
46+
- Question: PN is essentially already e2e, why need VN and MCTS?
47+
- My guess: Small scale SL generate PN not strong enough, so need RL and MCTS to boost performance.
48+
- E2E demonstrates that with enough data, e2e SL can generate strong enough PN itself.
49+
- Maybe later MCTS will come back again to generate superhuman behavior for driving.
50+

paper_notes/eudm.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,19 +23,18 @@ It is further improved by [MARC](marc.md) where it considers risk-aware continge
2323
#### Key ideas
2424
- DCP-Tree (domain specific closed-loop policy tree), ego-centric
2525
- Guided branching in action space
26-
- Each trace only contains ONE change of action (more flexible than MPDM but still manageable). This is a tree with pruning mechanism built-in. [MCDM](mcdm.md) essentially has a much more aggressive pruning as only one type of action is allowed (KKK, RRR, LLL, etc)
26+
- Each trace only contains ONE change of action (more flexible than MPDM but still manageable). This is a tree with pruning mechanism built-in. [MPDM](mpdm.md) essentially has a much more aggressive pruning as only one type of action is allowed (KKK, RRR, LLL, etc)
2727
- Each semantic action is 2s, 4 levels deep, so planning horizon of 8s.
2828
- CFB (conditional focused branching), for other agents
2929
- conditioned on ego intention
3030
- Pick out the potentially risky scenarios using **open loop** safety assement. (Open loop ignores interaction among agents, and allows checking of how serious the situation wil be if surrounding agents are completely uncoorpoerates and does not react to other agents.)
3131
- select key vehicles first, only a subset of all vehicles. --> Like Tesla's AI day 2022.
32-
- Forward simulation
33-
- IDM for longitudinal simulation
34-
- PP (Pure pursuit) for lateral simulation
3532
- EUDM output the best policy represented by ego waypoints (0.4s apart). Then it is sent to motion planner (such as [SCC](scc.md)) for trajectory generation.
3633

3734
#### Technical details
38-
- Summary of technical details, such as important training details, or bugs of previous benchmarks.
35+
- Forward simulation
36+
- IDM for longitudinal simulation
37+
- PP (Pure pursuit) for lateral simulation
3938

4039
#### Notes
4140
- What are the predictions are fed into MP alongside the BP results from EUDM?

paper_notes/mpdm.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ Despite simple design, MPDM is a pioneering work in decision making, and improve
2727
- The decoupling of vehicle behavior as the instantaneous behavior is independent of each other.
2828
- The formulation is highly inspiring and is the foundation of [EPSILON](epsilon.md) and all follow-up works.
2929
- The horizon is 10s with 0.25s timesteps, so a 40-layer deep tree.
30-
- MPDM How important is the closed-loop realism? The paper seems to argue that the inaccuracy in closed-loop simulation does not affect final algorithm performance that much. Close-loop or not is the key.
3130

3231
#### Technical details
33-
- Summary of technical details, such as important training details, or bugs of previous benchmarks.
32+
- MPDM How important is the closed-loop realism? The paper seems to argue that the inaccuracy in closed-loop simulation does not affect final algorithm performance that much. Close-loop or not seems to be the key.
33+
3434

3535
#### Notes
3636
- The white paper from [May Mobility](https://maymobility.com/resources/autonomy-at-scale-white-paper/) explains the idea with more plain language and examples.

0 commit comments

Comments
 (0)