(EC2013) 2013 10 COM 1,2,a) 1 1 1,b) 1,c) COMCOM COM COM COM COM Evaluating Human-like Video-Game Agents Autonomously Acquired with Biological Constraints Fujii Nobuto 1,2,a) Sato Yuichi 1 Wakama Hironori 1 Kazai Koji 1,b) Katayose Haruhiro 1,c) Abstract: While various systems that have aimed at automatically acquiring behavioral patterns have been proposed and some have successfully obtained stronger patterns than human players, those patterns have looked mechanical. We propose the autonomous acquisition of NPCs human-like behaviors, which emulate the behaviors of human players. In our previous study, the behaviors are acquired using techniques of reinforcement learning and pathfinding, where biological constraints are imposed. In this paper, We evaluated human-like behavioral patterns through subjective assessments, and discuss the possibility of implementing the proposed system. 1. =COM 1 Graduate School of Science and Technology, Kwansei Gakuin University 2 DC2 Research Fellow of Japan Society for the Promotion of Science a) nobuto@kwansei.ac.jp b) kazai@kwansei.ac.jp c) katayose@kwansei.ac.jp COM COM COM COM [1], [2], [3] COM COM COM 26
[4], [5] COM [4] COM [5] COM COM [6] Infinite Mario Bros. COM 2 3 4 Infinite Mario Bros. 5 2. 2.1 COM [7] [1], [2] Bonanza [7] Bonanza 6 Bonanza [3], [8] Robin 2009 Mario AI Competition A* COM [1] Mario AI Competition Infinite Mario Bros. COM [9] Robin COM A* Hearts Q COM [2] 4 3 Hearts COM AI COM COM 2.2 COM COM Jacob 2012 The 2K BotPrize COM [4] The 2K BotPrize FPS COM COM COM 27
[5] 3. 3.1 COM COM COM Cabrera [10] [11] Maslow 5 [12] 1) 2) 3) 4) 5) 5) COM COM 3.2 Cabrera [10] [11] Maslow [12] ( 1 ) COM ( 2 ) COM ( 3 ) COM ( 4 ) 28
4. 4.1 Q [13] Q Q argmax at Q(s t, a t ) (1) 1 t s t t a t t COM Q(s t, a t ) s t a t Q Q s t Q COM Q Q(s t, a t ) = (1 α)q(s t, a t )+α((r+γmax p Q(s t+1, p))(2) 2 α Q r γ 0 1 r s t a t COM ϵ greedy ϵ greedy 1 ϵ Q ϵ t s t a t r COM [2], [14] 2 Q(s t, a t ) s t 2 Q r r 4.4 ϵ s t s t 4.2 A* 2.1 A* A* f (n) = g (n) + h (n) (3) 3, f (n) n f (n) g (n) n h (n) n A* 4.3 Infinite Mario Bros. COM 1) 2) 3) Infinite Mario Bros. Infinite Mario Bros. 1 Infinite Mario Bros. 29
1 Infinite Mario Bros. COM COM 1 COM LEFT, RIGHT, DOWN, SPEED, JUMP 24 COM Mario AI Competition[9] COM COM COM 22 22 COM 4.4 Infinite Mario Bros. Q Infinite Mario Bros. COM s COM s 7 7 COM 22 22 1 1 (LEFT,RIGHT,DOWN,JUMP,SPEED) (OFF,ON,OFF,OFF,OFF) (OFF,ON,OFF,OFF,ON) (OFF,ON,OFF,ON,OFF) (OFF,ON,OFF,ON,ON) (ON,OFF,OFF,OFF,OFF) (ON,OFF,OFF,OFF,ON) (ON,OFF,OFF,ON,OFF) (ON,OFF,OFF,ON,ON) (OFF,OFF,OFF,ON,OFF) (OFF,OFF,ON,OFF,OFF) (OFF,OFF,OFF,OFF,OFF) 7 7 s COM 8 9 COM COM Q a 11 a 1 Q r r r = distance + damaged + death + keyp ress (4) 30
2 Q COMQ A* COMA* 2 4 distance damaged death keyp ress distance 2.0 damaged -50.0 death -100.0 keyp ress -5.0 A* 2.1 A* COM[1] g (n) h (n) [1] 5. 5.1 COM 2 2 5 2 5.2 20 24 20 13 7 20 µ = 34 σ = 29 5 µ σ 4 3 0 63 µ + σ 2 5 63 14 Infinite Mario Bros. 10 1 25 2 7 1 2 Q 3 A* 2 3 Q 2 Q ϵ 0.0 ϵ 0.2 5 31
2 [, ] (COM) 10.62 5448 [, ] (COM) 14.25 4069 [,, ] (COM) () 15.57 3458 [, ] (COM) 7.29 7926 [, ] (COM) 9.34 3118 [ ] ( ) 10.08 6031 [ ] ( ) 14.25 3644 [ ] ( ) 7.68 7371 50 200 5.3 2 7 COM COM Q A* 3 Q A* Q [, ] 人 間 ら し く な い 人 間 ら し く な い 探 索 無 し 3 強 化 導 入 ( 挑 戦 のみ) 強 化 無 し 上 級 者 初 級 者 中 級 者 探 索 導 入 上 級 者 初 級 者 中 級 者 強 化 導 入 人 間 ら し い 人 間 ら し い 0.66 [, ] 0.29 0.66 0.29 = 0.37 95% 0.48 5% :0.37 < 95% :0.48 A* [, ] [, ] 1% :1.35 > 99%:0.72 COM Q [, ] [ ][ ][ ] A* [, ] [ ][ ] [, ] [ ] :1.12 > 99%:0.58 [, ] [ ] :1.33 > 99%:0.58 [, ] [ ] :0.71 > 95%:0.59 6. COM 3 [, ] [, ] [, ] 1% [ ] [, ] [ ] 5% 2.1 COM [, ] [, ] [, ] [ ] [, ] 2 32
Q 4.4 s [,, ] 7. COM COM COM COM 1) 2) 3) 3 COM [1] Togelius, J., Karakovskiy, S. and Baumgarten, R.: The 2009 Mario AI Competition, Evolutionary Computation (CEC) 2010 IEEE, pp. 1 8 (2010). [2] Fujita, H. and Ishii, S.: Model-based reinforcement learning for partially observable games with samplingbased state estimation, Neural Computation, Vol. 19, pp. 3051 3087 (2007). [3] Hoki, K. and Kaneko, T.: The Global Landscape of Objective Functions for the Optimization of Shogi Piece Values with a Game-Tree Search, Advances in Computer Games 2012, Lecture Notes in Computer Science, Vol. 7168, pp. 184 195 (2012). [4] Schrum, J., Karpov, I. V. and Miikkulainen, R.: Humanlike Behavior via Neuroevolution of Combat Behavior and Replay of Human Traces, 2011 IEEE Conference CIG 11, pp. 329 336 (2011). [5] Viennot, S. AI GPW2012, pp. 47 54 (2012). [6] COM Vol. 2013-EC-27, No. 16, pp. 1 6 (2013). [7] GPW2006, pp. 78 83 (2006). [8] Sugiyama, T., Obata, T., Hoki, K. and Ito, T.: Optimistic Selection Rule Better Than Majority Voting System, Computers and Games, Lecture Notes in Computer Science, Vol. 6515, pp. 166 175 (2011). [9] J.Togelius, S.Karakovskiy, J.Koutnik and J.Schmidhuber: Super Mario Evolution, 2009 IEEE Conference CIG 09, pp. 156 161 (2009). [10] J.L.Cabrera and J.G.Milton: On-Off Intermittency in a Human Balancing Task, Physical Review Letters, Vol. 89, No. 15 (2002). [11] pp. 19 22 (2004). [12] Maslow, A. H.: A Theory of Human Motivation, Psychological Review, Vol. 50, pp. 370 396 (1943). [13] Watkins, C.: Learning from Delayed Rewards, PhD thesis, Cambridge University, Cambridge, England. (1989). [14] Patel, P. G., Carver, N. and Rahimi, S.: Tuning Computer Gaming Agents using Q-Learning, pp. 581 588 (2011). 33