Similar documents
The 19th Game Programming Workshop 2014 SHOT 1,a) 2 UCT SHOT UCT SHOT UCT UCT SHOT UCT An Empirical Evaluation of the Effectiveness of the SHOT algori

25 Study on Effectiveness of Monte Calro Tree Search for Single-Player Games

[1] AI [2] Pac-Man Ms. Pac-Man Ms. Pac-Man Pac-Man Ms. Pac-Man IEEE AI Ms. Pac-Man AI [3] AI 2011 UCT[4] [5] 58,990 Ms. Pac-Man AI Ms. Pac-Man 921,360

Logistello 1) playout playout 1 5) SIMD Bitboard playout playout Bitboard Bitboard 8 8 = black white 2 2 Bitboard 2 1 6) position rev i

2015 3

Research on decision making in multi-player games with imperfect information

DQN Pathak Intrinsic Curiosity Module (ICM) () [2] Pathak VizDoom Super Mario Bros Mnih A3C [3] ICM Burda ICM Atari 2600 [4] Seijen Hybrid Reward Arch

IPSJ SIG Technical Report Vol.2016-GI-35 No /3/9 StarCraft AI Deep Q-Network StarCraft: BroodWar Blizzard Entertainment AI Competition AI Convo

p-9-10.eps

Vol. 52 No (Dec. 2011) Ms. Pac-Man IEEE CIG Ms. Pac-Man Ms. Pac-Man AI AI Ms. Pac-Man Ms. Pac-Man Competition Ms. Pac-Man Monte

The 23rd Game Programming Workshop ,a) 2,3,b) Deep Q-Network Atari2600 Minecraft AI Minecraft hg-dagger/q Imitation Learning and Reinforcement L

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

ばらつき抑制のための確率最適制御

Run-Based Trieから構成される 決定木の枝刈り法

Mastering the Game of Go without Human Knowledge ( ) AI 3 1 AI 1 rev.1 (2017/11/26) 1 6 2

これからの強化学習 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

[1], []. AlphaZero 4 TPU 7 elmo 90 8 [3] AlphaZero 1 TPU TPU 64 [3] AlphaZero elmo AlphaZero [3] [4]AlphaZero [3].3 Saliency Map [5] Smooth- Gra

AI

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

3 2 2 (1) (2) (3) (4) 4 4 AdaBoost 2. [11] Onishi&Yoda [8] Iwashita&Stoica [5] 4 [3] 3. 3 (1) (2) (3)

Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution

[1] Google AlphaGo [2] AI [3], [4] [5], [6] 1 [7] [8], [9] IEEE-CIG 2015 AI [10] GPW AI 2 AI GCCS AI Expectimax [11] A

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai,

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

TD 2048 TD 1 N N 2048 N TD N N N N N N 2048 N 2048 TD 2048 TD TD TD 2048 TD 2048 minimax 2048, 2048, TD, N i

(a) Picking up of six components (b) Picking up of three simultaneously. components simultaneously. Fig. 2 An example of the simultaneous pickup. 6 /

三石貴志.indd

2007/8 Vol. J90 D No. 8 Stauffer [7] 2 2 I 1 I 2 2 (I 1(x),I 2(x)) 2 [13] I 2 = CI 1 (C >0) (I 1,I 2) (I 1,I 2) Field Monitoring Server

i

IHI Robust Path Planning against Position Error for UGVs in Rough Terrain Yuki DOI, Yonghoon JI, Yusuke TAMURA(University of Tokyo), Yuki IKEDA, Atsus

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

Convolutional Neural Network CNN CNN [2], [3] CNN Deep Convolutional Neural Network DCNN 2012 ILSVRC 2 10% 9 DCNN [4] 2014 DCNN AI

人工知能学会研究会資料 SIG-KBS-B Analysis of Voting Behavior in One Night Werewolf 1 2 Ema Nishizaki 1 Tomonobu Ozaki Graduate School of Integrated B

_314I01BM浅谷2.indd

コンピュータ 囲 碁 に 起 きた 革 命 2008 年 3 月 末 パリ 囲 碁 トーナメントのエキシビショ ンでプロ 対 コンピュータの 対 戦 が 実 現 ( プロ:タラヌ カタリン 五 段 ( 日 本 棋 院 中 部 総 本 部 所

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2009-GI-22 No /6/26 ( ) GPCC (Games and Puzzles Competitions on Computers) 200

(pdf) (cdf) Matlab χ ( ) F t

[1] SBS [2] SBS Random Forests[3] Random Forests ii

DL_UCT

1,a) 1,b) TUBSTAP TUBSTAP Offering New Benchmark Maps for Turn Based Strategy Game Tomihiro Kimura 1,a) Kokolo Ikeda 1,b) Abstract: Tsume-shogi and Ts

No. 3 Oct The person to the left of the stool carried the traffic-cone towards the trash-can. α α β α α β α α β α Track2 Track3 Track1 Track0 1

IPSJ SIG Technical Report Vol.2017-MUS-116 No /8/24 MachineDancing: 1,a) 1,b) 3 MachineDancing MachineDancing MachineDancing 1 MachineDan

IPSJ SIG Technical Report Vol.2009-CVIM-167 No /6/10 Real AdaBoost HOG 1 1 1, 2 1 Real AdaBoost HOG HOG Real AdaBoost HOG A Method for Reducing

,.,., ( ).,., A, B. A, B,.,...,., Python Long Short Term Memory(LSTM), Unity., Asynchronous method, Deep Q-Network(DQN), LSTM, TORCS. Asynchronous met

知能と情報, Vol.29, No.6, pp

Fig. 2 28th Ryuou Tournament, Match 5, 59th move. The last move is Black s Rx5f. 1 Tic-Tac-Toe Fig. 1 AsearchtreeofTic-Tac-Toe. [2] [3], [4]

Computational Semantics 1 category specificity Warrington (1975); Warrington & Shallice (1979, 1984) 2 basic level superiority 3 super-ordinate catego

& 3 3 ' ' (., (Pixel), (Light Intensity) (Random Variable). (Joint Probability). V., V = {,,, V }. i x i x = (x, x,, x V ) T. x i i (State Variable),

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

14 2 5

[2][3][4][5] 4 ( 1 ) ( 2 ) ( 3 ) ( 4 ) 2. Shiratori [2] Shiratori [3] [4] GP [5] [6] [7] [8][9] Kinect Choi [10] 3. 1 c 2016 Information Processing So

,,.,.,,.,.,.,.,,.,..,,,, i

JFE.dvi

Optical Flow t t + δt 1 Motion Field 3 3 1) 2) 3) Lucas-Kanade 4) 1 t (x, y) I(x, y, t)

Vol.55 No (Nov. 2014) Hex 1,a) 1,b) 1,c) 1,d) , Hex 2 Hex Hex Hex Hex Hex Hex Hex Development of Computer Hex Strategy

21 Pitman-Yor Pitman- Yor [7] n -gram W w n-gram G Pitman-Yor P Y (d, θ, G 0 ) (1) G P Y (d, θ, G 0 ) (1) Pitman-Yor d, θ, G 0 d 0 d 1 θ Pitman-Yor G


untitled


xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

VRSJ-SIG-MR_okada_79dce8c8.pdf

AI 1,a) 1,b) 1,c) , AI branching factor AI AI AI Proposal of Challenges and Approaches to Create Effective Artificial Players for Tur

IPSJ SIG Technical Report 1, Instrument Separation in Reverberant Environments Using Crystal Microphone Arrays Nobutaka ITO, 1, 2 Yu KITANO, 1

カルマンフィルターによるベータ推定( )

鉄鋼協会プレゼン

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

4 ( ) NATURE SCIENCE [Battiston 16] 2008 ( ) 5 JPX [ 13] [ 15a, 15b] [ 15,Mizuta 16c] [ 15a, 15b] δt (δt =1) (δt > 1) 4 [ 09, 12] 5 [LeBaron 06,Chen 1

JVRSJ Vol.18 No.3 September, NPC RTS Real-time Simulation NPC NPC NPC AI NPC 4 AI 2 AI 図 1 ゲームとユーザエクスペリエンス reality a

IS1-09 第 回画像センシングシンポジウム, 横浜,14 年 6 月 2 Hough Forest Hough Forest[6] Random Forest( [5]) Random Forest Hough Forest Hough Forest 2.1 Hough Forest 1 2.2

80 X 1, X 2,, X n ( λ ) λ P(X = x) = f (x; λ) = λx e λ, x = 0, 1, 2, x! l(λ) = n f (x i ; λ) = i=1 i=1 n λ x i e λ i=1 x i! = λ n i=1 x i e nλ n i=1 x

世界コンピュータ将棋選手権 [30] CSA CSA 電王戦 [31] Computer Olympiad [32] ICGA コンピュータ将棋対局場 [33],floodgate [34] 24 floodgate floodgate

2 ( ) i



258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System


2.2 6).,.,.,. Yang, 7).,,.,,. 2.3 SIFT SIFT (Scale-Invariant Feature Transform) 8).,. SIFT,,. SIFT, Mean-Shift 9)., SIFT,., SIFT,. 3.,.,,,,,.,,,., 1,

., White-Box, White-Box. White-Box.,, White-Box., Maple [11], 2. 1, QE, QE, 1 Redlog [7], QEPCAD [9], SyNRAC [8] 3 QE., 2 Brown White-Box. 3 White-Box


1. A0 A B A0 A : A1,...,A5 B : B1,...,B


Gaze Head Eye (a) deg (b) 45 deg (c) 9 deg 1: - 1(b) - [5], [6] [7] Stahl [8], [9] Fang [1], [11] Itti [12] Itti [13] [7] Fang [1],

Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

日本内科学会雑誌第98巻第4号

日本内科学会雑誌第97巻第7号

CVaR

ii

x T = (x 1,, x M ) x T x M K C 1,, C K 22 x w y 1: 2 2

2013 M

Vol. 43 No. 2 Feb. 2002,, MIDI A Probabilistic-model-based Quantization Method for Estimating the Position of Onset Time in a Score Masatoshi Hamanaka

COM COM 4) 5) COM COM 3 4) 5) COM COM 6) 7) 10) COM Bonanza 6) Bonanza Hearts COM 7) 10) Hearts 3 2,000 4,000

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

Vol1-CVIM-172 No.7 21/5/ Shan 1) 2 2)3) Yuan 4) Ancuti 5) Agrawal 6) 2.4 Ben-Ezra 7)8) Raskar 9) Image domain Blur image l PSF b / = F(

,,, 2 ( ), $[2, 4]$, $[21, 25]$, $V$,, 31, 2, $V$, $V$ $V$, 2, (b) $-$,,, (1) : (2) : (3) : $r$ $R$ $r/r$, (4) : 3

johnny-paper2nd.dvi

aca-mk23.dvi

1 Kinect for Windows M = [X Y Z] T M = [X Y Z ] T f (u,v) w 3.2 [11] [7] u = f X +u Z 0 δ u (X,Y,Z ) (5) v = f Y Z +v 0 δ v (X,Y,Z ) (6) w = Z +

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

Transcription:

2017 31-156861

31-156861 2 30

1 1 2 MDP 4 2.1...................... 5 2.2................... 5 2.3................................ 6 2.4.................... 7 2.5.................... 9 2.6.................. 9 2.6.1 Progressive Widening.......................... 10 2.6.2 Double Progressive Widening..................... 10 2.6.3 Weighted UCT............................. 12 3 13 3.1............................. 13 3.2................................ 14 3.3................................ 14 3.4....................... 14 3.5.................................. 14 3.6...................... 15 3.7...................... 16 3.8................ 16 3.9...................................... 19 3.9.1............................... 19 3.9.2............................... 20 4 21 4.1 Trap Problem............................. 21 4.1.1................................. 21 4.1.2................................. 22 4.1.3................................. 23 4.1.4................................. 24 4.2 Treasure Hunt Problem....................... 24 4.2.1................................. 24 4.2.2................................. 25 4.2.3................................. 26

4.2.4................................. 27 5 34 5.1................................... 34 5.2.............................. 34 5.3................... 36 5.4....................... 36 5.5.............. 37 5.5.1.............................. 37 5.5.2........................ 38 5.5.3....................... 40 5.5.4......................... 43 5.6.................................. 43 5.7..................................... 43 6 45

1 Markov Decision Process; MDP MDP MDP 1.1 1.1 0, 1, 2,... s 0, s 1, s 2,... A 0, A 1, A 2,... R 0, R 1, R 2,... 1

1.1: MDP [1] 1.1 MDP s 1 s 0 A 0 R 0 s 0 s 1 A 0,, 1.1 x a r MDP X A P x X a A x r MDP MDP r sum = γ t r t (1.1) γ 0 γ 1 γ < 1 γ = 1 MDP t=0 2

2 / MDP 3 4 5 6 3

2 MDP MDP 2 2 [2] 2 2 [3][4][5] [3][4] Arati2600 RGB [5] 3D MuJoCo 1 end-to-end MRP ; MDP [6] [7] MDP 1 http://www.mujoco.org/ 4

2.1 1 (policy) 1 1 2.2 / / 2 1 2 Adaptive Playout[8] 5

[8] [13] UCT [14] 2.3 [9] 1 MDP MDP 2 1. 2. 2 2 1. 2. 6

Thompson Sampling[10] UCB1,UCB1-tuned, UCB2[11] Thompson Sampling UCB1 Thompson Sampling[10] a D A a T S = argmax v a D a (2.1) a A v a D a Thompson Sampling [10] 0 1 UCB1[11] a N a a R a UCB1 v a v a = R a 2 ln a + C A N a UCB (2.2) N a N a a UCB1 = argmax (v a ) (2.3) a A C UCB1 C UCB1 UCB1 optiminizm in the face of uncetrainty; OFU [12] 2.4 [13] [13] 2.1 2.1 2 ; Expansion 2.1 ; Backpropagation 2.1 ; Selection 7

2.1: [13] 2.1 2.1 2 ; Simulation UCB1 UCT[14] UCT 1 transition(statex, actiona) x a γ UCB1 Algorithm 1 UCT N x,a x a R x,a x a N x x = a A x N x,a function simulationinu CT (x) if x then return x end if a sim argmax a Ax ( Rx,a ) 2 ln N x N x,a N x,a + C UCB x, r transition(x, a sim ) r sim r + γ simulationinuct(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim x 0 N x0,a 8

2.5 UCT 1 MDP GP-UCB[15] [16] UCB1 Weighted UCB (WUCB) [17] k((state, action), (state, action)) WUCB a a 1 a 2 k((s 1, a 1 ), (s 2, a 2 )) = 0 P ( y, b, r) n(x, a) = r(x, a) = y,b,r P y,b,r P k((x, a), (y, b)) (2.4) k((x, a), (y, b)) r (2.5) n(x) = a A n(x, a) (2.6) v a = r(x, a) n(x, a) + C 2 ln n(x) UCB n(x, a) (2.7) a UCB1 = argmax (v a ) (2.8) a A r 2.6 MDP Double Progressive Widening Weighted UCT 4 9

2.6.1 Progressive Widening Progressive Widening (PW) A A UCB1 Progressive Widening PW UCT UCT + PW 2 newaction(statex) x α Progressive Widening Algorithm 2 UCT + PW A x x x function simulationinuct P W (x) if x then return x end if if N α x #A x then A x A x newaction(x) end if ( ) R a sim argmax x,a 2 ln a A a A N x,a + C N x,a UCB N x,a x, r transition(x, a sim ) r sim r + γ simulationinuct-pw(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim PW N x α 0 < α < 1 PW 2.6.2 Double Progressive Widening Double Progressive Widening DPW [18] 1 1 DPW PW Double Progressive Widening DPW UCT UCT + DPW DPW [18] UCT DPW 10

UCT + DPW DPW DPW 3 [18] 3 PW PW α PW β Algorithm 3 DPW M x,a x a (, ) (x, a) N x,a,(x,r) x a x r function simulationindp W (x) if x then return x end if if N α x #A x then A x A x newaction(x) end if ( ) R a sim argmax x,a 2 ln a A a A N x,a + C N x,a UCB N x,a if N x,a β #X x,a then Mx, a M x,a transition(x, a sim ) end if x, r M x,a N x,a,(x,r) x,r Mx,a N x,a,(x,r) r sim r + γ simulationindpw(x ) N x N x + 1, N x,asim N x,asim + 1, R x,asim R x,asim + r sim, N x,a,(x,r) N x,a,(x,r) + 1 Progressive Widening UCT 2 DPW PW WUCT WUCB 11

2.6.3 Weighted UCT DPW DPW WUCB Weighted UCT (WUCT) [17] WUCB UCB1 UCB1 WUCT 4 DPW 4 A Algorithm 4 WUCT P (,, ) function simulationinw U CT (x) if x then return x end if n 0 r 0 for y, a, r P do n[a] n[a] + k((x, a), (y, a)) r[a] r[a] + k((x, a), (y, a)) r end for n num a A N[a] a sim argmax a A ( r[a] n[a] + C UCB ) 2 ln n sum n[a] x, r transition(x, a sim ) r sim r + γ simulationinwuct(x ) P P (x, a sim, r sim ) 1 12

3 3.1 2 DPW [18] WUCT [17] 1 1 3.1 3.1: 13

3.1 [0, 1) x x UCT 3.2 3.1 D 1 2 D D 2 D 5 30 3.3 1 4, 5 3.4, a s a N s,a R s,a N s a 1 N s,a R s,a 2 3.5 14

d th(d) d th(d) d 3.6 1 d d w(d) w(d) x a ñ(x, a) r(x, a) S x s d(s) s N s,a R s,a a ñ(x, a) = s S w(d(s))n s,a (3.1) r(x, a) = s S w(d(s))r s,a (3.2) x x A ñ(x) = a A ñ(x, a) (3.3) 15

s x 0 x 1 x 0 x 1 ñ(x) = s S w(d(s))n s ñ(x, a) r(x, a) x a UCB w(d) ñ(x, a) x UCT f(n) n(x) = f(ñ(x)) (3.4) ñ(x, a) r(x, a) UCB1 n(x, a) r(x, a) n(x, a) = ñ(x, a) n(x) ñ(x) r(x, a) = r(x, a) n(x) ñ(x) (3.5) (3.6) n(x) n(x, a) r(x, a) UCB1 [11] a try a try = argmax a A ( r(x, a) n(x, a) + C UCB C UCB UCB1 ) 2 ln n(x) n(x, a) (3.7) 3.7 5 6 5 6 3.8 5 6 1 7 7 16

Algorithm 5 N s,a s a R s,a s a N s s nextstate(state, action) w(depth) th(depth) f(frequency) N prior, R prior x function simulation(x) if x then return x end if A x ñ #A N prior r #A R prior s x, r sim simulationonstatet ree(x, 0, s, A, ñ, r) return r sim function simulationonstatet ree(x, d, s, A, ñ, r) W w(d) for i 0 to #A 1 do a A[i] ñ[i] ñ[i] + W N s,a, r[i] r[i] + W R s,a end for if N s th(d) s x then x s s end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n #A i ñ[i] n sum /ñ sum r #A i r[i] n sum /ñ sum a sim A[argmax i 0..#A 1 ( r[i] n[i] + C UCB 2 ln n sum n[i] x, r transition(x, a) r sim r + γ simulation(x ) else s x s a sim, r sim simulationonstatet ree(x, d + 1, s, A, ñ, r) end if N s,asim N s,asim + 1, R s,asim R s,asim + r sim N s N s + 1 return a sim, r sim 17 ) ]

Algorithm 6 N s,a s a R s,a s a nextstate(state, action) w(depth) th(depth) f(frequency) N prior, R prior x function simulation(x) if x then return x end if A x ñ #A N prior r #A R prior s x a, r sim simulationonstatet ree(x, 0, s, A, ñ, r) return r sim function simulationonstatet ree(x, d, s, A, ñ, r) W w(d) for i 0 to #A 1 do a A[i] ñ[i] ñ[i] + W N s,a, r[i] r[i] + W R s,a end for if N s,a th(d) s x then x s s end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n #A i ñ[i] n sum /ñ sum r #A i r[i] n sum /ñ sum a sim A[argmax i 0..#A 1 ( r[i] n[i] + C UCB 2 ln n sum n[i] x, r transition(x, a) r sim r + γ simulation(x ) else s x s a sim, r sim simulationonstatet ree(x, d + 1, s, A, ñ, r) end if N s,asim N s,asim + 1, R s,asim R s,asim + r sim return a sim, r sim 18 ) ]

Algorithm 7 rootaction(state) (ex. UCB1) x root while do a root rootaction(x root ) x root, r root transition(x root, a root ) r sim r root + γ simulation(x root) r sim a root end while return 1 4 3.9 3.9.1 WUCB WUCT,, n #A O(#An) n O(n 2 ) O(#A ) 1. th(d) O(n) 0 MDP WUCT 2. th(d) d th(d) d d 1 th(d) 19

n O(logn) th(d) 3.9.2 WUCT O(n) 1 O(n) 3.4 O(1) O(#A) D d O(Dd) O(1) O(Dd) W UCT 20

4 [18][19] [18][19] 2 γ = 1 4.1 Trap Problem Trap Problem [18] 4.1.1 [18] x x 0 = 0 t a t 8 0 i < 8 a i = i+0.5 8 [18] [0, 1] [0, R) d x x = x + a + d r x x < 1 0.7 x 1.7 1 l x < h 0 t t = 0 t = 2 R = 0.01 R = 0.1 [18] R R = 0.01 R = 0.1 21

UCT [18] 1.7 t = 2 x 1.7 t = 0, t = 1 a 1 x < 1.7 0 t = 1 t = 0 1.4 UCT 1.4 1.4 trap 4.1.2 DPW WUCT DPW PW 8 PW DPW C UCB = 1 α = 0.3 β = 0.25 WUCT C UCB = 1 4 (,, ) P P i x, a P i x i, a i σ i,a = 0.5 (1 + k((x, a), (x j, a j ))) 1 2 j 0..i 1 (4.1) (x x i ) 2 2σ i,a 2 k((x, a), (x i, a i )) = e 2 (4.2) πσ i,a σ i,a 0.5 O(#P) 4 n[a] 22

C UCB = 1 th(d) = 8 1.5 d w(d) = (d + 1) 2 f(n sum ) = ñ 2 3 sum : [ 2R, 2 + 2R] 6 s (s, a) WUCT 1 t = 0 t = 0 UCB1 UCB1 C UCB = 1 Intel Corei7-4790K 4.00 GHz 4.1.3 1 WUCT R = 0.01 R = 1 1, 10, 100, 1000, 10000 5 10000 1, 10, 100, 1000ms 4 1ms 100ms 10000 1000ms 1000 23

4.1.4 4.1 4.2 R = 0.01 R = 0.1 4.1 WUCT DPW 100 1.4 WUCT 1000 4.2 WUCT 10ms 1000ms DPW R = 0.01 R = 0.1 WUCT DPW DPW 4.2 Treasure Hunt Problem Treasure Hunt Problem [19] 4.2.1 [19] 4.3 4.3 (x, y) Start (0, 0) Treasure D 0 x 0 0 y D G D G x D D G y D R treasure h D H x D+H D H y D+H 2 2 2 2 R hole (x, y) A θ A 4.3 ϵ R time t t = T [19] t (x, y ) x t+1 = max(0, min(x, D)) y t+1 = max(0, min(y, D)) 24

(0, 0), (0, D), (D, 0) 3 3.9.1 D = 5 D = 15 G = 1 h = D/2 A = 1 ϵ = 1 R treasure = 1, R hole = 0.5, R time = 0.001 T = D 10 [19] D = 5 D = 15 4.2.2 WUCT DPW WUCT PW DPW WUCT PW PW α = 0.2 DPW C UCB Treasure Hunt Problem Trap Problem C = 0.3 WUCT C UCB 1 WUCT Trap Problem Trap Problem 4.2 (x x i ) 2 (x x i ) 2 + (y y i ) 2 2 D 25

4.1 σ i,a Trap Problem 0.5 D 2 ([0, D], [0, D]) Trap Problem WUCT 10 WUCT WUCT 1 WUCT/ WUCT/ 1 DPW t = 1 WUCT WUCT 4.2.3 D = 5 D = 15 1, 10, 100, 1000, 10000 5 1 100 10000 1000 10000 1000 1, 10, 100, 1000ms 4 1ms, 10ms 10000 100ms 1000ms 1000 D = 15 1, 10, 100, 1000, 10000 5 1 100 10000 1000 10000 1000 1, 10, 100, 1000ms 4 1ms 100ms 10000 1000ms 1000 WUCT 1000 26

4.2.4 4.4 4.5 - D = 5 D = 15 4.6-4.7-4.4 D = 5 D = 15 1 2 D = 15 DPW D = 5 D = 15 4.5 DPW 1ms 10ms WUCT D = 5 D = 15 D = 5 DPW DPW D = 15 1ms WUCT 1 1ms 10 4.6 WUCT 100 10 100 4.7 4.4 4.5 D = 15 WUCT 1 DPW 10000 1000ms 27

4.1: Trap Problem 28

4.2: Trap Problem 29

4.3: Treasure Hunt Problem [19] 30

4.4: Treasure Hunt Problem 31

4.5: Treasure Hunt Problem 32

4.6: Treasure Hunt Problem 4.7: Treasure Hunt Problem 33

5 4 1 2 5.1 MDP 1 1 1 1 8 5.1 1 1 0 1 8 10 1 5.2 [21][22] 34

5.1: 1 Box2D 1 [23] 1 http://box2d.org/ 35

5.3 [24] [25] expectimax [26] Yee [27] kernel regression UCT (KR-UCT) KR-UCT end-to-end 5.4 MDP x a v x, v y w 1 0 1 10 5.1 0.25 1 14 17 32 36

10 10 14 17 1 5.1: 8 0.000195461 8 0.00000547291 7 0.000781844 7 0.0000547291 6 0.00312738 6 0.0000547291 5 0.0125095 5 0.000547291 4 0.0390922 4 0.00547291 3 0.070366 3 0.0156369 2 0.195461 2 0.0547291 1 0.390922 1 0.172006 0 0.0390922 5.5 5.5.1 3.1 x y 1 x y 15 2 [29] 15 d d 3 4 1 15 15 x y 15 0.5mm 15.8m 37

2 100 2 5.5.2 d th(d) th(d) = 1.3 d (5.1) w(d) w(d) = (d + 1) 4 (5.2) w(d) x ñ(x) n(x) f(n sum ) = ñ 0.4 sum (5.3) UCB1 C UCB = 1 5 8 N s,a, R s,a, N s w(depth), th(depth), f(frequency) 5 γ 1 transition(state, action) 5 1 N new 2 l 2l 38

Algorithm 8 A s s R s s N new N policy function simulationincurling(x) if x then return x end if A x x ñ 0 r 0 s x, r sim simulationonstatet ree(x, 0, s, A x, ñ, r) return r sim function simulationonstatet reeincurling(x, d, s, A x, ñ, r) W w(d) for a A x do if a A s then ñ[a] ñ[a] + W N s,a, r[a] r[a] + W R s,a else ñ[a] ñ[a] + W N new, r[a] r[a] + R s /N s end if end for if N s th(d) s x then x s s N s 0 end if if s x then ñ sum summation of ñ n sum f(ñ sum ) n A x a ñ[a] n sum /ñ sum r A x a r[a] n sum /ñ sum n sum n sum + N policy #A ( x a sim A[argmax r[i] + C i 0..#A 1 n[i] UCB ) 2 ln n sum ] n[i] r sim simulationincurling(transition(x, a sim )) else s x s a sim, r sim simulationonstatet reeincurling(x, d + 1, s, A, ñ, r) end if if s N s,asim, s,asim then N s,asim N new, R s,asim 0, N s N s + N new end if 39 N s,asim N s,asim + 1, R s,asim R s,asim + r sim N s N s + 1, R s R s + r sim return a sim, r sim

2 N new 3 n sum N policy #A x 2 r[a] R s /N s W 3 n(x, a) r(x, a) N policy n sum 8 5.5.3 3 1 2 4 3 [24] l ( l 2, 2l) 63 31 ( l 4, ) 189 1 2142 4284 [24] 5.2 3 URL 40

5.2: [24] 5.2 Q p Q f 1440 2142 [24] 5.2 3 2 [24][25] [25] 2142 1440 2880 UCB 41

1 9 transitionw ithoutnoise(state, action) Algorithm 9 allrootactions(state) candidaterootactions(state) C limit c 0 A all allrootactions(x) N root #A all 0 R root #A all 0 while c < C limit do seq 0 #A all 1 for i 0 to #A all 1 do x transitionw ithoutnoise(x, A all [seq[i]]) N root [seq[i]] N root [seq[i]] + 1 R root [seq[i]] R root [seq[i]] + simulationincurling(x) c c + 1 end for end while A cand candidaterootactions(x) for a A cand do end for 3 3 5.1 0 1 UCB1[11] [28] 42

5.5.4 4 Box2D 1 5.6 1 1 4284 47 201348 400 800 GAT 4 2 10 5.7 0.219 4 http://minerva.cs.uec.ac.jp/curling/wiki.cgi GAT (2016) 43

0.219 0.781 1 2 5.2: (2 ) 1 128 34 67 171 0.423 0.617 1 289 43 8 60 0.811 (p = 4 10 11 ) 5.3: (10 ) 1 271 8 9 121 0.698 0.803 1 356 8 3 33 0.907 (p = 1 10 65 ) 2 10 44

6 WUCT DPW 2 4 Trap Problem WUCT Treasure Hunt Problem Trap Problem Treasure Hunt Problem Trap Problem WUCT Treasure Hunt Problem WUCT Treasure Hunt Problem DPW Trap Problem DPW RAVE DPW-RAVE[30] D = 5 Treasure Hunt Problem DPW [30] DPW-RAVE Backpropagation 2 30 30 45

2 3 Treasure Hunt Problem D = 15 46

47

[1] Poole, D., Mackworth, A.: Artificial Intelligence, http://artint.info/ (2017. 6. 29 ) [2] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. and Hassabis, D.: Mastering the game of Go with deep neural networks and tree search, Nature, Vol. 529, pp. 484-503 (2016) [3] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, Martin., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning, Nature, Vol. 518, 529-533 (2015) [4] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928-1937 (2016) [5] Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor critic with exprtience replay. ICLR 2017 (2017) [6] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac- Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., Degris, T.: The predictron: End-to-end learning and planning, https://arxiv.org/abs/1612.08810 (2017) [7] Genesereth, M., Thielscher, M.: General game playing. Synthesis Lectures on Artificial Intelligence and Machine Learning, 8(2), pp. 1-229 (2014) [8] Graf, T., Platzer, M.: Adaptive Playouts in Monte Carlo Tree Search with Policy Gradient Reinforcement Learning, Advances in Computer Games. Lecture Notes in Computer Science, vol 9525. Springer, Cham (2015) [9] Robbins, H.: Some aspects of the sequential design of experiments, Bull. Amer. Math. Soc. 58, No. 5, pp. 527-535 (1952) [10] Thompson, W. R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (3-4) pp. 285-294 (1933) 48

[11] Auer, P., Cesa-Bianchi N., Fischer, P.: Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, Vol. 47, pp. 235-256 (2002) [12] Lai, T. L., and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, Vol. 6, pp. 4-22 (1985) [13] Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, Vol. 4, No. 1, pp. 1-49 (2012) [14] Kocsis, L., Szepesvari, C.: Bandit based Monte-Carlo Planning. European conference on machine learning (ECML2006), pp. 282-293 (2006) [15] Srinivas, N., Krause, A., Kakade, S. M., Seeger, M.: Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1015-1022 (2010) [16] Krause, A., Ong, C. S.: Contextual Gaussian Process Bandit Optimization, Advances in Neural Information Processing Systems, pp. 2447-2455 (2011) [17] Weinstein, A.: Local Planning for Continuous Markov Decision Processes. Ph. D. thesis (2014) [18] Couetoux, A., Hoock, J., Sokolovska, N., Teytaud, O., Bonnard, N.: Continuous upper confidence trees. International Conference on Learning and Intelligent Optimization, pp. 433-445 (2011) [19] Couetoux, A.: Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Making Problems. doctoral thesis, Paris university (2013) [20] Chaslot. G., Fiter, C., Hoock, J.P., Rimmel A., Teytaud, O.: Adding expert knowledge and exploration in Monte-Carlo Tree Search, Advances in Computer Games, LNCS, Vol. 6048, pp. 1-13 (2009) [21],, :,, 2014-GI-31, No. 2, pp. 1-5 (2014) [22] Ito, T., Kitasei, Y.: Proposal and Implementation of Digital Curling, 2015 IEEE Conference on Computational Intelligence and Games, pp. 469-473 (2015) [23], : 2015,, 2016-GI-36, No. 2, pp. 1-6 (2016) [24],, :,, Vol. 57, No. 11, pp. 2354-2364 (2016) 49

[25] Yamamoto, M., Kato, S., Iizuka, H.: Digital Curling Strategy on Game Tree Search, 2015 IEEE Conference on Computational Intelligence and Games, pp. 474-480 (2015) [26], : AI. 2016 pp. 172-179 (2016) [27] Yee, T., Lisy, V., Bowling, M.: Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty. International Joint Conference on Artificial Intelligence (2016) [28] Audibert, J. Y., Munos, R., and Szepesvari, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Theoretical Computer Science, 410(19) pp. 1876-1902 (2009) [29] Zobrist, A,: A New Hashing Method with Applications for Game Playing. ICCA journal, Vol. 13, No. 2, pp. 69-73 (1970) [30] Couetoux, A., Milone, M., Brendel, M., Doghmen, H., Sebag, M., Teytaud, O.: Continuous rapid action value estimates. The 3rd Asian Conference on Machine Learning, Vol. 20, pp. 19-31 (2011) 50