SHOT 1,a) 2 UCT SHOT UCT SHOT UCT UCT SHOT UCT An Empirical Evaluation of the Effectiveness of the SHOT algorithm in Go and Gobang Masahiro Honjo 1,a) Yoshimasa Tsuruoka 2 Abstract: Today, UCT is the most widely used Monte-Carlo Tree Search (MCTS) method in games like Go. Recently, a new MCTS method called SHOT has been proposed and compared to UCT in some games. In this paper, we compare the performance of SHOT and UCT in the game of Go and Gobang. Experimental results suggest that SHOT is inferior to UCT when many playouts are performed, and that SHOT is superior to UCT when there are many legal moves in comparison with the number of playouts. We have also conducted experiments with composed Go problems and found that SHOT can perform poorly at situations that require a deep search. 1. 1 Department of Information and Communication Engineering, The University of Tokyo 2 Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo a) honjo@logos.t.u-tokyo.ac.jp MCTS (Monte-Carlo Tree Search) CrazyStone [5] MCTS 2008 MCTS UCT (Upper Confidence Bound applied to Trees) Mogo [6] UCT 19 UCT MCTS SHOT (Sequential Halving applied to Trees) Cazenave [4] UCT Multi-Armed Bandit UCB1 SHOT Sequential Halving [7] - 17 -
Sequential Halving UCB UCB Sequential Halving Arm [4] SHOT UCT Nogo SHOT UCT UCT Nogo UCT SHOT UCT SHOT UCT 2 3 4 2. 2.1 Regret 2.1.1 K X j T 2.1.2 Cumulative Regret Cumulative Regret [2], [8] T n R n = (X X t ) (1) t=0 R n n Cumulative Regret X K X t t 2.1.3 Simple Regret K K X j T K 2.1.1 T T Simple Regret [3] (2) r n = X X S(n) (2) r n n Simple Regret X K X S(n) n 2.1.2 Cumulative Regret 2.1.3 Simple Regret Regret Regret [9] 2.2 Regret Regret Regret Regret Cumulative Regret Simple Regret 2.2.1 Sequential Halving Sequential Halving UCB1 2.1.3 Simple Regret Sequential Halving Algorithm2.1 Algorithm2.1 [7] 0 1-18 -
2 SHOT 1 Sequential Halving 1 Sequential Halving Sequential Halving K 10 K K K/2 Algorithm 2.1 Sequential Halving 1: IMPUT : 2: T total budget, 3: K arms {Arm 1, Arm 2, Arm 3,..., Arm K} 4: OUTPUT: 5: Sequential Halving Arm 6: S 0 {Arm 1, Arm 2, Arm3,..., Arm K} 7: k 0 8: B T 9: while S k 1 do 10: n k max(1, [ T S k [log 2 K] ]) 11: for i = 0 to S k do 12: for j = 0 to n k do 13: S k Arm i 1 14: Arm i 15: B B 1 16: if B = 0 then 17: break WHILE loop 18: 19: end for end i loop 20: end for end j loop 21: S k+1 S k [ Sk ] Arm 2 22: k k + 1 Max k = [log 2 K] 1 23: end while 24: return S k Arm 1 0 ( ) K 2.2.2 SHOT SHOT Sequential Halving MCTS SHOT Algorithm2.2 [4] 2 SHOT SHOT MCTS Sequential Halving Sequential Halving ( ) SHOT 1 1 Sequential Halving 1 Sequential Halving 1 3. 3.1 Sequentail Halving UCB Multi-Armed Bandit Sequential Halving UCB - 19 -
Algorithm 2.2 SHOT struct NODE{ INT possible moves num CHILD child[possible moves num] INT budgetused MOVE move } struct CHILD{ INT num games FLOAT rate NODE my node } Shot(INT budget, NODE node){ budgetused 0 if board is terminal then result CheckWin() return result and update rate and games if budget = 1 then result Playout() return result and update rate and games if possible moves num = 1 then Play(child.move) Shot(budget, child.my move) UnPlay() return result and update rate and games if the c.move with 0 playout exist then for all the c.move with 0 playout do result Playout() update child.games, child.rate, and budgetused if budgetused budgetused then return result and update rate and games end for S child[ ] sort child in S according to their chile.rate b 0 while S > 1 do budgetused+budget b b + max(1, [ S [log 2 possible moves num] ] for child in S by decreasing child.rate do if child.games < b then b1 b child.games if at root S = 2 child is the first in S then b1 budget budgetused (b NEXT child.games) b1 min(b1, budget budgetused) Play(child.move) Shot(b1, child.my node) UnPlay() update budgetused, child.rate and child.games break if budgetused budget end for S the [ S ] child from S with best child.rate 2 break if budgetused budget end while update budgetuset, child.rate and child.games return The child with best child.rate of S UCB Sequential Halving UCB Sequential Halving n(n = 2, 4, 8, 16,..., 1024) x j (j = 1, 2, 3,..., n) x j P (x j ) P (x j ) = 0.1 + 0.8 j 1 n 1 1 () 20,000 UCB, Sequential Halving 1,000 UCB SHOT Sequential Halving 0 1 3 4 4 UCB Sequential Halving 1 3 UCB Sequential Halving 128 UCB, Sequential Halving Sequential Halving UCB 128 Sequential Halving UCB 3.2 SHOT UCT MCTS SHOT SHOT UCT SHOT UCT 10,000 20,000 30,000 40,000 50,000 10 10 20 20 2-20 -
3 1 4 1 1,000 (500 ) SHOT SHOT UCT SHOT 10,000 20,000 50,000 100,000 UCT 10 10 20 20 2 1 1,000 (500 ) SHOT 2 5 *1 *2 *3 SHOT UCT ( ) UCT C 0.41 3.2 SHOT UCT *1 *2 *3 2 2 [4] Nogo 3.3 3.2 10 10 SHOT UCT 10 10 20 20 UCT 10 10 SHOT [4] Nogo 1 SHOT UCT 2 10 10 UCT SHOT SHOT [%] 10,000 60.9 20,000 54.2 30,000 49.9 40,000 45.9 50,000 42.9 UCT SHOT SHOT [ / ] UCT [ / ] SHOT [%] 10 10 10,000 0.072 4,900 0.071 85.6 10 10 20,000 0.142 9,200 0.142 78.4 10 10 50,000 0.343 21,500 0.345 67.4 10 10 100,000 0.686 41,000 0.682 45.3 20 20 10,000 0.216 4,900 0.232 92.1 20 20 20,000 0.434 9,200 0.440 93.7 20 20 50,000 1.078 21,500 1.048 95.8 20 20 100,000 2.146 41,000 2.080 95.4-21 -
UCT UCT SHOT UCT SHOT Sequential Halving UCT SHOT UCT 41,000 UCT 100,000 150,000 200,000 300,000 SHOT 1,000 60,000 80,000 120,000 UCT 1,000 3 UCT 41,000 1.5 2.0 3.0 SHOT SHOT 400, 000 500, 000 49.9 49.3% SHOT UCT 3.3 SHOT UCT UCT SHOT MCTS UCT MCTS SHOT UCT [10], [11] UCT SHOT 500 1,000 2,000 5,000 10,000 9 9 *4 6 100 (50 ) SHOT SHOT UCT 4 3.2 SHOT UCT [4] Nogo 3.2 1 10,000 1 2.3 10 5 [ /1 ] 3.2 10 4 [ /1 ] 10 [4] Nogo UCT SHOT ( ) Nogo 1 4 9 9 SHOT UCT 3 10 10 UCT41,000 [%] 100,000 47.7 SHOT 150,000 44.2 200,000 47.9 300,000 47.6 60,000 63.4 UCT 80,000 68.9 120,000 70.7 (SHOT) [ / ] (UCT) [ / ] SHOT [%] 5,00 0.144 0.145 14 44 29 1,000 0.296 0.298 24 68 46 2,000 0.596 0.600 38 62 50 5,000 1.486 1.511 26 54 40 10,000 3.002 3.034 26 36 31 *4-22 -
SHOT UCT SHOT 2 SHOT UCT UCT SHOT UCT SHOT SHOT UCT 3.2 SHOT UCT 5,000 10,000 50,000 UCT SHOT *5 Web [1] 15 *6 5 9 9 *5 *6 (1) 1 (2) 4 (3) 2 1 (4) 2 2 (5) 2 4 (6) 2 5 (7) 2 7 (8) 2 8 (9) 3 8 (10) 1 2 (11) 1 3 (12) 32 3 (13) 32 6 (14) 32 1 (15) 27 1 5 (1) ( ) 0 5 3.3 3.3 5 5 1 4 5 (1) (9) (12) (13) (15) UCT SHOT UCT 4. UCT SHOT SHOT 1 SHOT UCT SHOT UCT UCT - 23 -
5 UCT SHOT (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) 5,000 UCT 10,000 50,000 5,000 SHOT 10,000 50,000 SHOT UCT SHOT UCT SHOT UCT UCT [1]. http://www.h-eba.com/ tsumego/japan/j310.html. [2] Andrew G. Barto and Richard S. Sutton. Reinforcement Learning. Mit Press, 2012. [3] S Budeck, Munos. R., and G. Stolts. Pure exploration in finitely-armed and continuous-armed bandits. Theoretival Computer Science, pp. 1832 1852, 2010. [4] T. Cazenave. Sequential halving applied to trees. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP-99,, 2014. [5] Remi Coulom. Computing elo ratings of move patterns in the game of go. IGGA Journal, Vol. 30, pp. 198 208, 2007. [6] Syvain Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of uct with patterns in monte-carlo go,technical report pr-6062. INRIA, 2006. [7] Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. Proc of the int. Conf. on Math. Learn, pp. 1238 1246, 2013. [8] T.L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied Mathmatics, Vol. 6, pp. 4 22, 1985. [9] D. Tolpin and S. Shimony. Mcts based on simple regret. Proc., pp. 570 576, 2012. [10].. http://www.yss-aya.com/ book2011/. [11],.., 2012. - 24 -