|
| 1 | +# [AlphaGo: Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) |
| 2 | + |
| 3 | +_June 2024_ |
| 4 | + |
| 5 | +tl;dr: MCTS trees with neural nets bets human in the game of Go. |
| 6 | + |
| 7 | +#### Overall impression |
| 8 | +Value iteration and policy iterations are systematic, iterative method that solves MDP problems. Yet even with the improved policy iteration, it still have to perform time-consuming operation to update the value of EVERY state. A standard 19x19 Go board has roughly [2e170 possible states](https://senseis.xmp.net/?NumberOfPossibleGoGames). This vast amount of states will be intractable to solve with a vanilla value iteration or policy iteration technique. |
| 9 | + |
| 10 | +AlphaGo and its successors use a [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) algorithm to find its moves guided by a value network and a policy network, trained on from human and computer play. Both the value network and policy network takes in the current state of the board and produces a singular state value function V(s) of the board position and the state-action value function Q(s, a) of all possible moves given the current board position. Neural networks are used to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network. |
| 11 | + |
| 12 | +Every leaf node (an unexplored board position) in the MCTS is evaluated in two very different ways: by the value network; and second, by the outcome of a random rollout played out using the fast rollout policy. Note that a single evaluation of the value network also approached the accuracy of Monte Carlo rollouts using the RL policy network, but using 15,000 times less computation. This is very similar to a fast-slow system design, intuition vs reasoning, [system 1 vs system 2](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow) by Nobel laureate Daniel Kahneman (We can see similar design in more recent work such as [DriveVLM](https://arxiv.org/abs/2402.12289)). |
| 13 | + |
| 14 | +#### Key ideas |
| 15 | +- MCTS: policy estimation, focuses on decision-making from the current state. It has 4 steps process of selection-expansion-simulation-backprop. |
| 16 | + - **Selection**: Follow the most promising path based on previous simulations until you reach a leaf node (a position that hasn’t been fully explored yet). |
| 17 | + - **Expansion**: add one or more child nodes to represent the possible next moves. |
| 18 | + - **Simulation**: From the new node, play out a random game until the end (this is called a “rollout”). |
| 19 | + - **Backpropagation**: Go back through the nodes on the path you took and update their values based on the result of the game. If you won, increase the value; if you lost, decrease it. |
| 20 | +- MCTS guided by value network and policy network. |
| 21 | + - Value network reduce the search depth by summarizing values of sub-trees, so we can avoid going deep for good estimations. Policy network to prune search space. Balanced breadth and width. |
| 22 | + - MCTS used both value network and reward from rollout. |
| 23 | + - Policy network reduce the breadth of the search tree by identifying sensible moves, so we can avoid non-sensible moves. |
| 24 | + - Value network V evaluates winning rate from a state (棋面). |
| 25 | + - Trained with state-outcome pairs. Trained with much more self-play data to reduce overfit. |
| 26 | + - Policy network evaluates action distribution |
| 27 | + - Value network is more like instinct (heuristic), value network provides policy gradient to update policy network. Tesla learned collision network, and heuristic network for hybrid A-star. |
| 28 | +- With autonomous driving |
| 29 | + - World model |
| 30 | + - AlphaGo tells us how to extract very good policy with a good world model (simulation) |
| 31 | + - Autonomous driving still needs a very good simulation to be able to leverage alphaGo algorithm. —> It this a dead end, vs FSD v12? |
| 32 | + - Tesla AI day 2021 and 2022 are heavily affected by AlphaGo. FSDv12 is a totally different ballgame though. |
| 33 | + - Go is limited and noiseless. |
| 34 | +- Policy networks |
| 35 | + - P_s trained with SL, more like e2e |
| 36 | + - P_p trained with SL, shallower for fast rollout in MCTS |
| 37 | + - P_r trained with RL to play with P_s |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +#### Technical details |
| 42 | +- Summary of technical details, such as important training details, or bugs of previous benchmarks. |
| 43 | + |
| 44 | +#### Notes |
| 45 | +- Value iteration is to MCTS as Dijkstra's algorithm is to (hybrid) A-star: both value iteration and Dijkstra's systematically consider all possibilities (covering the entire search space) without heuristics, while MCTS and A* use heuristics to focus on the most promising options, making them more efficient for large and complex problems. |
| 46 | +- Question: PN is essentially already e2e, why need VN and MCTS? |
| 47 | + - My guess: Small scale SL generate PN not strong enough, so need RL and MCTS to boost performance. |
| 48 | + - E2E demonstrates that with enough data, e2e SL can generate strong enough PN itself. |
| 49 | + - Maybe later MCTS will come back again to generate superhuman behavior for driving. |
| 50 | + |
0 commit comments