- PDF 無料ダウンロード

2017 31-156861

31-156861 2 30

1 1 2 MDP 4 2.1...................... 5 2.2................... 5 2.3................................ 6 2.4.................... 7 2.5.................... 9 2.6.................. 9 2.6.1 Progressive Widening.......................... 10 2.6.2 Double Progressive Widening..................... 10 2.6.3 Weighted UCT............................. 12 3 13 3.1............................. 13 3.2................................ 14 3.3................................ 14 3.4....................... 14 3.5.................................. 14 3.6...................... 15 3.7...................... 16 3.8................ 16 3.9...................................... 19 3.9.1............................... 19 3.9.2............................... 20 4 21 4.1 Trap Problem............................. 21 4.1.1................................. 21 4.1.2................................. 22 4.1.3................................. 23 4.1.4................................. 24 4.2 Treasure Hunt Problem....................... 24 4.2.1................................. 24 4.2.2................................. 25 4.2.3................................. 26

4.2.4................................. 27 5 34 5.1................................... 34 5.2.............................. 34 5.3................... 36 5.4....................... 36 5.5.............. 37 5.5.1.............................. 37 5.5.2........................ 38 5.5.3....................... 40 5.5.4......................... 43 5.6.................................. 43 5.7..................................... 43 6 45

1 Markov Decision Process; MDP MDP MDP 1.1 1.1 0, 1, 2,... s 0, s 1, s 2,... A 0, A 1, A 2,... R 0, R 1, R 2,... 1

1.1: MDP [1] 1.1 MDP s 1 s 0 A 0 R 0 s 0 s 1 A 0,, 1.1 x a r MDP X A P x X a A x r MDP MDP r sum = γ t r t (1.1) γ 0 γ 1 γ < 1 γ = 1 MDP t=0 2

2 / MDP 3 4 5 6 3

2 MDP MDP 2 2 [2] 2 2 [3][4][5] [3][4] Arati2600 RGB [5] 3D MuJoCo 1 end-to-end MRP ; MDP [6] [7] MDP 1 http://www.mujoco.org/ 4

2.1 1 (policy) 1 1 2.2 / / 2 1 2 Adaptive Playout[8] 5

[8] [13] UCT [14] 2.3 [9] 1 MDP MDP 2 1. 2. 2 2 1. 2. 6

Thompson Sampling[10] UCB1,UCB1-tuned, UCB2[11] Thompson Sampling UCB1 Thompson Sampling[10] a D A a T S = argmax v a D a (2.1) a A v a D a Thompson Sampling [10] 0 1 UCB1[11] a N a a R a UCB1 v a v a = R a 2 ln a + C A N a UCB (2.2) N a N a a UCB1 = argmax (v a ) (2.3) a A C UCB1 C UCB1 UCB1 optiminizm in the face of uncetrainty; OFU [12] 2.4 [13] [13] 2.1 2.1 2 ; Expansion 2.1 ; Backpropagation 2.1 ; Selection 7

2.1: [13] 2.1 2.1 2 ; Simulation UCB1 UCT[14] UCT 1 transition(statex, actiona) x a γ UCB1 Algorithm 1 UCT N x,a x a R x,a x a N x x = a A x N x,a function simulationinu CT (x) if x then return x end if a sim argmax a Ax ( Rx,a ) 2 ln N x N x,a N x,a + C UCB x, r transition(x, a sim ) r sim r + γ simulationinuct(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim x 0 N x0,a 8

2.5 UCT 1 MDP GP-UCB[15] [16] UCB1 Weighted UCB (WUCB) [17] k((state, action), (state, action)) WUCB a a 1 a 2 k((s 1, a 1 ), (s 2, a 2 )) = 0 P ( y, b, r) n(x, a) = r(x, a) = y,b,r P y,b,r P k((x, a), (y, b)) (2.4) k((x, a), (y, b)) r (2.5) n(x) = a A n(x, a) (2.6) v a = r(x, a) n(x, a) + C 2 ln n(x) UCB n(x, a) (2.7) a UCB1 = argmax (v a ) (2.8) a A r 2.6 MDP Double Progressive Widening Weighted UCT 4 9

2.6.1 Progressive Widening Progressive Widening (PW) A A UCB1 Progressive Widening PW UCT UCT + PW 2 newaction(statex) x α Progressive Widening Algorithm 2 UCT + PW A x x x function simulationinuct P W (x) if x then return x end if if N α x #A x then A x A x newaction(x) end if ( ) R a sim argmax x,a 2 ln a A a A N x,a + C N x,a UCB N x,a x, r transition(x, a sim ) r sim r + γ simulationinuct-pw(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim PW N x α 0 < α < 1 PW 2.6.2 Double Progressive Widening Double Progressive Widening DPW [18] 1 1 DPW PW Double Progressive Widening DPW UCT UCT + DPW DPW [18] UCT DPW 10

UCT + DPW DPW DPW 3 [18] 3 PW PW α PW β Algorithm 3 DPW M x,a x a (, ) (x, a) N x,a,(x,r) x a x r function simulationindp W (x) if x then return x end if if N α x #A x then A x A x newaction(x) end if ( ) R a sim argmax x,a 2 ln a A a A N x,a + C N x,a UCB N x,a if N x,a β #X x,a then Mx, a M x,a transition(x, a sim ) end if x, r M x,a N x,a,(x,r) x,r Mx,a N x,a,(x,r) r sim r + γ simulationindpw(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim, N x,a,(x,r) N x,a,(x,r) + 1 Progressive Widening UCT 2 DPW PW WUCT WUCB 11

2.6.3 Weighted UCT DPW DPW WUCB Weighted UCT (WUCT) [17] WUCB UCB1 UCB1 WUCT 4 DPW 4 A Algorithm 4 WUCT P (,, ) function simulationinw U CT (x) if x then return x end if n 0 r 0 for y, a, r P do n[a] n[a] + k((x, a), (y, a)) r[a] r[a] + k((x, a), (y, a)) r end for n num a A N[a] a sim argmax a A ( r[a] n[a] + C UCB ) 2 ln n sum n[a] x, r transition(x, a sim ) r sim r + γ simulationinwuct(x ) P P (x, a sim, r sim ) 1 12

3 3.1 2 DPW [18] WUCT [17] 1 1 3.1 3.1: 13

3.1 [0, 1) x x UCT 3.2 3.1 D 1 2 D D 2 D 5 30 3.3 1 4, 5 3.4, a s a N s,a R s,a N s a 1 N s,a R s,a 2 3.5 14

d th(d) d th(d) d 3.6 1 d d w(d) w(d) x a ñ(x, a) r(x, a) S x s d(s) s N s,a R s,a a ñ(x, a) = s S w(d(s))n s,a (3.1) r(x, a) = s S w(d(s))r s,a (3.2) x x A ñ(x) = a A ñ(x, a) (3.3) 15

s x 0 x 1 x 0 x 1 ñ(x) = s S w(d(s))n s ñ(x, a) r(x, a) x a UCB w(d) ñ(x, a) x UCT f(n) n(x) = f(ñ(x)) (3.4) ñ(x, a) r(x, a) UCB1 n(x, a) r(x, a) n(x, a) = ñ(x, a) n(x) ñ(x) r(x, a) = r(x, a) n(x) ñ(x) (3.5) (3.6) n(x) n(x, a) r(x, a) UCB1 [11] a try a try = argmax a A ( r(x, a) n(x, a) + C UCB C UCB UCB1 ) 2 ln n(x) n(x, a) (3.7) 3.7 5 6 5 6 3.8 5 6 1 7 7 16

Algorithm 5 N s,a s a R s,a s a N s s nextstate(state, action) w(depth) th(depth) f(frequency) N prior, R prior x function simulation(x) if x then return x end if A x ñ #A N prior r #A R prior s x, r sim simulationonstatet ree(x, 0, s, A, ñ, r) return r sim function simulationonstatet ree(x, d, s, A, ñ, r) W w(d) for i 0 to #A 1 do a A[i] ñ[i] ñ[i] + W N s,a, r[i] r[i] + W R s,a end for if N s th(d) s x then x s s end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n #A i ñ[i] n sum /ñ sum r #A i r[i] n sum /ñ sum a sim A[argmax i 0..#A 1 ( r[i] n[i] + C UCB 2 ln n sum n[i] x, r transition(x, a) r sim r + γ simulation(x ) else s x s a sim, r sim simulationonstatet ree(x, d + 1, s, A, ñ, r) end if N s,asim N s,asim + 1, R s,asim R s,asim + r sim N s N s + 1 return a sim, r sim 17 ) ]

Algorithm 6 N s,a s a R s,a s a nextstate(state, action) w(depth) th(depth) f(frequency) N prior, R prior x function simulation(x) if x then return x end if A x ñ #A N prior r #A R prior s x a, r sim simulationonstatet ree(x, 0, s, A, ñ, r) return r sim function simulationonstatet ree(x, d, s, A, ñ, r) W w(d) for i 0 to #A 1 do a A[i] ñ[i] ñ[i] + W N s,a, r[i] r[i] + W R s,a end for if N s,a th(d) s x then x s s end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n #A i ñ[i] n sum /ñ sum r #A i r[i] n sum /ñ sum a sim A[argmax i 0..#A 1 ( r[i] n[i] + C UCB 2 ln n sum n[i] x, r transition(x, a) r sim r + γ simulation(x ) else s x s a sim, r sim simulationonstatet ree(x, d + 1, s, A, ñ, r) end if N s,asim N s,asim + 1, R s,asim R s,asim + r sim return a sim, r sim 18 ) ]

Algorithm 7 rootaction(state) (ex. UCB1) x root while do a root rootaction(x root ) x root, r root transition(x root, a root ) r sim r root + γ simulation(x root) r sim a root end while return 1 4 3.9 3.9.1 WUCB WUCT,, n #A O(#An) n O(n 2 ) O(#A ) 1. th(d) O(n) 0 MDP WUCT 2. th(d) d th(d) d d 1 th(d) 19

n O(logn) th(d) 3.9.2 WUCT O(n) 1 O(n) 3.4 O(1) O(#A) D d O(Dd) O(1) O(Dd) W UCT 20

4 [18][19] [18][19] 2 γ = 1 4.1 Trap Problem Trap Problem [18] 4.1.1 [18] x x 0 = 0 t a t 8 0 i < 8 a i = i+0.5 8 [18] [0, 1] [0, R) d x x = x + a + d r x x < 1 0.7 x 1.7 1 l x < h 0 t t = 0 t = 2 R = 0.01 R = 0.1 [18] R R = 0.01 R = 0.1 21

UCT [18] 1.7 t = 2 x 1.7 t = 0, t = 1 a 1 x < 1.7 0 t = 1 t = 0 1.4 UCT 1.4 1.4 trap 4.1.2 DPW WUCT DPW PW 8 PW DPW C UCB = 1 α = 0.3 β = 0.25 WUCT C UCB = 1 4 (,, ) P P i x, a P i x i, a i σ i,a = 0.5 (1 + k((x, a), (x j, a j ))) 1 2 j 0..i 1 (4.1) (x x i ) 2 2σ i,a 2 k((x, a), (x i, a i )) = e 2 (4.2) πσ i,a σ i,a 0.5 O(#P) 4 n[a] 22

C UCB = 1 th(d) = 8 1.5 d w(d) = (d + 1) 2 f(n sum ) = ñ 2 3 sum : [ 2R, 2 + 2R] 6 s (s, a) WUCT 1 t = 0 t = 0 UCB1 UCB1 C UCB = 1 Intel Corei7-4790K 4.00 GHz 4.1.3 1 WUCT R = 0.01 R = 1 1, 10, 100, 1000, 10000 5 10000 1, 10, 100, 1000ms 4 1ms 100ms 10000 1000ms 1000 23

4.1.4 4.1 4.2 R = 0.01 R = 0.1 4.1 WUCT DPW 100 1.4 WUCT 1000 4.2 WUCT 10ms 1000ms DPW R = 0.01 R = 0.1 WUCT DPW DPW 4.2 Treasure Hunt Problem Treasure Hunt Problem [19] 4.2.1 [19] 4.3 4.3 (x, y) Start (0, 0) Treasure D 0 x 0 0 y D G D G x D D G y D R treasure h D H x D+H D H y D+H 2 2 2 2 R hole (x, y) A θ A 4.3 ϵ R time t t = T [19] t (x, y ) x t+1 = max(0, min(x, D)) y t+1 = max(0, min(y, D)) 24

(0, 0), (0, D), (D, 0) 3 3.9.1 D = 5 D = 15 G = 1 h = D/2 A = 1 ϵ = 1 R treasure = 1, R hole = 0.5, R time = 0.001 T = D 10 [19] D = 5 D = 15 4.2.2 WUCT DPW WUCT PW DPW WUCT PW PW α = 0.2 DPW C UCB Treasure Hunt Problem Trap Problem C = 0.3 WUCT C UCB 1 WUCT Trap Problem Trap Problem 4.2 (x x i ) 2 (x x i ) 2 + (y y i ) 2 2 D 25

4.1 σ i,a Trap Problem 0.5 D 2 ([0, D], [0, D]) Trap Problem WUCT 10 WUCT WUCT 1 WUCT/ WUCT/ 1 DPW t = 1 WUCT WUCT 4.2.3 D = 5 D = 15 1, 10, 100, 1000, 10000 5 1 100 10000 1000 10000 1000 1, 10, 100, 1000ms 4 1ms, 10ms 10000 100ms 1000ms 1000 D = 15 1, 10, 100, 1000, 10000 5 1 100 10000 1000 10000 1000 1, 10, 100, 1000ms 4 1ms 100ms 10000 1000ms 1000 WUCT 1000 26

4.2.4 4.4 4.5 - D = 5 D = 15 4.6-4.7-4.4 D = 5 D = 15 1 2 D = 15 DPW D = 5 D = 15 4.5 DPW 1ms 10ms WUCT D = 5 D = 15 D = 5 DPW DPW D = 15 1ms WUCT 1 1ms 10 4.6 WUCT 100 10 100 4.7 4.4 4.5 D = 15 WUCT 1 DPW 10000 1000ms 27

4.1: Trap Problem 28

4.2: Trap Problem 29

4.3: Treasure Hunt Problem [19] 30

4.4: Treasure Hunt Problem 31

4.5: Treasure Hunt Problem 32

4.6: Treasure Hunt Problem 4.7: Treasure Hunt Problem 33

5 4 1 2 5.1 MDP 1 1 1 1 8 5.1 1 1 0 1 8 10 1 5.2 [21][22] 34

5.1: 1 Box2D 1 [23] 1 http://box2d.org/ 35

5.3 [24] [25] expectimax [26] Yee [27] kernel regression UCT (KR-UCT) KR-UCT end-to-end 5.4 MDP x a v x, v y w 1 0 1 10 5.1 0.25 1 14 17 32 36

10 10 14 17 1 5.1: 8 0.000195461 8 0.00000547291 7 0.000781844 7 0.0000547291 6 0.00312738 6 0.0000547291 5 0.0125095 5 0.000547291 4 0.0390922 4 0.00547291 3 0.070366 3 0.0156369 2 0.195461 2 0.0547291 1 0.390922 1 0.172006 0 0.0390922 5.5 5.5.1 3.1 x y 1 x y 15 2 [29] 15 d d 3 4 1 15 15 x y 15 0.5mm 15.8m 37

2 100 2 5.5.2 d th(d) th(d) = 1.3 d (5.1) w(d) w(d) = (d + 1) 4 (5.2) w(d) x ñ(x) n(x) f(n sum ) = ñ 0.4 sum (5.3) UCB1 C UCB = 1 5 8 N s,a, R s,a, N s w(depth), th(depth), f(frequency) 5 γ 1 transition(state, action) 5 1 N new 2 l 2l 38

Algorithm 8 A s s R s s N new N policy function simulationincurling(x) if x then return x end if A x x ñ 0 r 0 s x, r sim simulationonstatet ree(x, 0, s, A x, ñ, r) return r sim function simulationonstatet reeincurling(x, d, s, A x, ñ, r) W w(d) for a A x do if a A s then ñ[a] ñ[a] + W N s,a, r[a] r[a] + W R s,a else ñ[a] ñ[a] + W N new, r[a] r[a] + R s /N s end if end for if N s th(d) s x then x s s N s 0 end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n A x a ñ[a] n sum /ñ sum r A x a r[a] n sum /ñ sum n sum n sum + N policy #A ( x a sim A[argmax r[i] + C i 0..#A 1 n[i] UCB ) 2 ln n sum ] n[i] r sim simulationincurling(transition(x, a sim )) else s x s a sim, r sim simulationonstatet reeincurling(x, d + 1, s, A, ñ, r) end if if s N s,asim, s,asim then N s,asim N new, R s,asim 0, N s N s + N new end if 39 N s,asim N s,asim + 1, R s,asim R s,asim + r sim N s N s + 1, R s R s + r sim return a sim, r sim

2 N new 3 n sum N policy #A x 2 r[a] R s /N s W 3 n(x, a) r(x, a) N policy n sum 8 5.5.3 3 1 2 4 3 [24] l ( l 2, 2l) 63 31 ( l 4, ) 189 1 2142 4284 [24] 5.2 3 URL 40

5.2: [24] 5.2 Q p Q f 1440 2142 [24] 5.2 3 2 [24][25] [25] 2142 1440 2880 UCB 41

1 9 transitionw ithoutnoise(state, action) Algorithm 9 allrootactions(state) candidaterootactions(state) C limit c 0 A all allrootactions(x) N root #A all 0 R root #A all 0 while c < C limit do seq 0 #A all 1 for i 0 to #A all 1 do x transitionw ithoutnoise(x, A all [seq[i]]) N root [seq[i]] N root [seq[i]] + 1 R root [seq[i]] R root [seq[i]] + simulationincurling(x) c c + 1 end for end while A cand candidaterootactions(x) for a A cand do end for 3 3 5.1 0 1 UCB1[11] [28] 42

5.5.4 4 Box2D 1 5.6 1 1 4284 47 201348 400 800 GAT 4 2 10 5.7 0.219 4 http://minerva.cs.uec.ac.jp/curling/wiki.cgi GAT (2016) 43

0.219 0.781 1 2 5.2: (2 ) 1 128 34 67 171 0.423 0.617 1 289 43 8 60 0.811 (p = 4 10 11 ) 5.3: (10 ) 1 271 8 9 121 0.698 0.803 1 356 8 3 33 0.907 (p = 1 10 65 ) 2 10 44

6 WUCT DPW 2 4 Trap Problem WUCT Treasure Hunt Problem Trap Problem Treasure Hunt Problem Trap Problem WUCT Treasure Hunt Problem WUCT Treasure Hunt Problem DPW Trap Problem DPW RAVE DPW-RAVE[30] D = 5 Treasure Hunt Problem DPW [30] DPW-RAVE Backpropagation 2 30 30 45

2 3 Treasure Hunt Problem D = 15 46

[1] Poole, D., Mackworth, A.: Artificial Intelligence, http://artint.info/ (2017. 6. 29 ) [2] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. and Hassabis, D.: Mastering the game of Go with deep neural networks and tree search, Nature, Vol. 529, pp. 484-503 (2016) [3] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, Martin., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning, Nature, Vol. 518, 529-533 (2015) [4] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928-1937 (2016) [5] Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor critic with exprtience replay. ICLR 2017 (2017) [6] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac- Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., Degris, T.: The predictron: End-to-end learning and planning, https://arxiv.org/abs/1612.08810 (2017) [7] Genesereth, M., Thielscher, M.: General game playing. Synthesis Lectures on Artificial Intelligence and Machine Learning, 8(2), pp. 1-229 (2014) [8] Graf, T., Platzer, M.: Adaptive Playouts in Monte Carlo Tree Search with Policy Gradient Reinforcement Learning, Advances in Computer Games. Lecture Notes in Computer Science, vol 9525. Springer, Cham (2015) [9] Robbins, H.: Some aspects of the sequential design of experiments, Bull. Amer. Math. Soc. 58, No. 5, pp. 527-535 (1952) [10] Thompson, W. R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (3-4) pp. 285-294 (1933) 48

[11] Auer, P., Cesa-Bianchi N., Fischer, P.: Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, Vol. 47, pp. 235-256 (2002) [12] Lai, T. L., and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, Vol. 6, pp. 4-22 (1985) [13] Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, Vol. 4, No. 1, pp. 1-49 (2012) [14] Kocsis, L., Szepesvari, C.: Bandit based Monte-Carlo Planning. European conference on machine learning (ECML2006), pp. 282-293 (2006) [15] Srinivas, N., Krause, A., Kakade, S. M., Seeger, M.: Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1015-1022 (2010) [16] Krause, A., Ong, C. S.: Contextual Gaussian Process Bandit Optimization, Advances in Neural Information Processing Systems, pp. 2447-2455 (2011) [17] Weinstein, A.: Local Planning for Continuous Markov Decision Processes. Ph. D. thesis (2014) [18] Couetoux, A., Hoock, J., Sokolovska, N., Teytaud, O., Bonnard, N.: Continuous upper confidence trees. International Conference on Learning and Intelligent Optimization, pp. 433-445 (2011) [19] Couetoux, A.: Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Making Problems. doctoral thesis, Paris university (2013) [20] Chaslot. G., Fiter, C., Hoock, J.P., Rimmel A., Teytaud, O.: Adding expert knowledge and exploration in Monte-Carlo Tree Search, Advances in Computer Games, LNCS, Vol. 6048, pp. 1-13 (2009) [21],, :,, 2014-GI-31, No. 2, pp. 1-5 (2014) [22] Ito, T., Kitasei, Y.: Proposal and Implementation of Digital Curling, 2015 IEEE Conference on Computational Intelligence and Games, pp. 469-473 (2015) [23], : 2015,, 2016-GI-36, No. 2, pp. 1-6 (2016) [24],, :,, Vol. 57, No. 11, pp. 2354-2364 (2016) 49

[25] Yamamoto, M., Kato, S., Iizuka, H.: Digital Curling Strategy on Game Tree Search, 2015 IEEE Conference on Computational Intelligence and Games, pp. 474-480 (2015) [26], : AI. 2016 pp. 172-179 (2016) [27] Yee, T., Lisy, V., Bowling, M.: Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty. International Joint Conference on Artificial Intelligence (2016) [28] Audibert, J. Y., Munos, R., and Szepesvari, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theoretical Computer Science, 410(19) pp. 1876-1902 (2009) [29] Zobrist, A,: A New Hashing Method with Applications for Game Playing. ICCA journal, Vol. 13, No. 2, pp. 69-73 (1970) [30] Couetoux, A., Milone, M., Brendel, M., Doghmen, H., Sebag, M., Teytaud, O.: Continuous rapid action value estimates. The 3rd Asian Conference on Machine Learning, Vol. 20, pp. 19-31 (2011) 50