The 23rd Game Programming Workshop ,a) 2,3,b) Deep Q-Network Atari2600 Minecraft AI Minecraft hg-dagger/q Imitation Learning and Reinforcement L

1,a) 2,3,b) Deep Q-Network Atari2600 Minecraft AI Minecraft hg-dagger/q Imitation Learning and Reinforcement Learning using Hierarchical Structure Yutaro Fujimura 1,a) Tomoyuki Kaneko 2,3,b) Abstract: Deep Q-Network (DQN) has achieved above-human performance on the domain of classic Atari2600 games and therefore DQN is expected to apply to many video games. However it is difficult for DQN to learn on games with sparse feedback, such as Minecraft. To solve this problem, we investigated the performance of hg-dagger/q, a framework to combine imitation learning and reinforcement learning using hierarchical structure. We demonstrate the strength of hg-dagger/q on a Minecraft environment. 1. Mnih Deep Q-Network (DQN) Atari2600 [1] DQN Atari2600 1 Graduate School of Arts and Sciences, The University of Tokyo 2 Interfaculty Initiative in Information Studies, the University of Tokyo 3 JST, PRESTO a) yut-mak874@g.ecc.u-tokyo.ac.jp b) kaneko@acm.org Johnson Minecraft *1 AI Malmo [2] Malmo Minecraft AI Minecraft DQN hg-dagger/q [3] *1 https://minecraft.net/ (Accessed: 2018-10-16) 2018 Information Processing Society of Japan - 145 -

Minecraft 2. Minecraft Minecraft 3D Minecraft AI Malmo AI Minecraft Malmo Minecraft AI 2017 *2 2018 *3 3. 3.1 [4] Q S A R : S A R P : S A S [0, 1] (Markov Decision Process, MDP). s S, a A π : S A, P (s, a) s, r = R(s, a). Q(s, a) 3.2 Deep Q-Network (DQN) Deep Q-Network (DQN) [1] Q Q(s, a) θ Q(s, a; θ) DQN θ *2 https://www.microsoft.com/en-us/research/ academic-program/collaborative-ai-challenge/ (Accessed: 2018-10-16) *3 https://www.crowdai.org/challenges/marlo-2018 (Accessed: 2018-10-16) [ ( ) ] 2 L i (θ i ) = E s,a,r,s r + γ max Q(s, a ; θ i 1 ) Q(s, a; θ i ) a θ i i (1) DQN Atari2600 Montezuma s Revenge Montezuma s Revenge 1 DQN 3.3 [5] 3.1 MDP 2 g G g µ : S G a π g : S A Alg 1 Algorithm 1 1: repeat 2: s 3: : g µ(s) 4: loop 5: s 6: if g then 7: break 8: a : a π g (s) 9: a 10: until µ π g Alg 1 π g τ = (s 1, a 1,..., s T, a T, s T +1) τ σ = (s 1, g 1, τ 1, s 2, a 2, τ 2,...) τ h s h+1 τ h+1 σ τ h 2018 Information Processing Society of Japan - 146 -

Algorithm 2 Hierarchically Guided DAgger/Q-learning (hg-dagger/q) Input: Function Pseudo(s; g) Function Terminal(s; g) ϵ g > 0, g G 1: Initialize: 2: D HI and D g, g G 3: Q g, g G 4: for t = 1,..., T do 5: s 6: σ 7: repeat 8: s HI s, g µ(s) and initialize τ 9: repeat 10: a ϵ g -greedy(q g, s) 11: a s 12: r Pseudo( s; g) 13: Q g : D g 14: τ (s, a, s, r) 15: s s 16: until Terminal(s; g) 17: σ (s HI, g, τ) 18: until 19: σ τ FULL τ HI 20: if Inspect FULL (τ FULL ) = Fail then 21: D Label HI (τ HI ) 22: for (s h, g h, τ) in σ do 23: gh D 24: if g h gh then 25: break 26: D gh D gh τ h 27: D HI D HI D 28: else 29: D gh D gh τ h for all (s h, g h, τ h ) σ 30: µ : µ Train(µ, D HI ) τ HI = (s 1, g 1, s 2, g 2,...) τ FULL DQN h-dqn [6] Hierarchical Deep RL Network (H-DRLN) [7] 3.4 Hybrid Imitation and Reinforcement Learning Le µ π g Hierarchically Guided DAgger/Q-learning (hg-dagger/q) [3] Alg 2 Alg 2 Pseudo(s; g) s g Terminal(s; g) [3] s Algorithm 3 Inspect FULL (τ FULL ) 1: if τ FULL then 2: return Pass 3: else 4: return Fail g Success(s; g) 1 if Success(s; g) 1 if Success(s; g) and Terminal(s; g) κ (2) κ > 0 (s 1, a 1, s 2, a 2,..., ) τfull = {(s 1, a 1), (s 2, a 2),...} µ Label HI (τ HI ) = {(s 1, g 1), (s 2, g 2),...} π g Terminal(s; g) Label HI h-dqn [6] Montezuma s Revenge Alg? Q DQN 3.5 Deep Reccurent Q-Network (DRQN) DQN 4 Minecraft 4 DQN LSTM Deep Reccurent Q-Network (DRQN) [8] 4. Montezuma s Revenge Minecarft 4.1 2 2 2018 Information Processing Society of Japan - 147 -

の緑のマスの位置にある鉄のドアで仕切られており鉄のドアは破壊するのに非常に時間がかかるため *4 エージェントが時間内にゴールにたどり着くためにはエージェントが配置される最初の部屋の仕掛けを解き鉄のドアを開けて通る必要がある 4.3 最初の部屋の仕掛けエージェントが最初に配置される側の部屋には図 2 の水色のマスの位置に石の感圧板茶色のマスに原木ブロックが配置されているまずエージェントが石の感圧板の図 1 エージェントが観察できる入力例上に乗ると隣に設置されたディスペンサーからダイヤの Fig. 1 斧が射出され獲得することができる次に原木ブロックを破壊すると隣に設置されたオブザーバーからレッドストーン回路を通じて鉄のドアが開くという仕掛けになっている 4.4 報酬設計ダイヤの斧を獲得したとき原木ブロックを破壊して獲得したとき及びゴール地点の金ブロックに触れたときにエージェントは +1 点を獲得するまたエージェントが行動を選択する度に 0.01 点を獲得しゴールにたどり着図 2 けずに 100 秒が経過した場合は 1 点を獲得する実験に用いたマップの模式図 Fig. 2 4.5 エージェントがとれる行動 Minecraft のゲームでは様々な行動コマンドが存在するが本研究の実験においては以下の 7 種類に制限した何もしない (前, 後) に歩く 180 度 / 秒の速度で (左, 右) にカメラを回転させる攻撃を開始する攻撃を終了する 5. 実験図 3 5.1 実験概要実験に用いたマップを俯瞰した様子 3.4 節で述べた hg-dagger/q における Q 学習を DQN Fig. 3 で行うエージェントを hg-dagger/dqn とし DRQN で行ピンク色のマスに矢印の向きで配置されるエージェントうエージェントを hg-dagger/drqn とする 4 章で作成は図 1 のような画像を 0.05 秒に 1 度の間隔で観察しながした Minecraft の環境で DQN, DRQN, hg-dagger/dqn, らゴール地点に到達することが目的であるゴール地点 hg-dagger/drqn のエージェントで学習を行った実装である金ブロックに触れた時点でゲームはクリアとなりには Python3.5 を Minecraft での実験を行うために Malmo そのエピソードは終了する Minecraft 上での実際のマッ (0.36.0.0) を OpenAI Gym [9] の形式で取り扱えるラッパープを天井ブロックを除いて上から撮影したものが図 3 でである MarLO ある習のフレームワークとして Keras (2.2.2) を用いバックエ *5 (0.0.1-dev16) を使用したまた深層学ンドは tensorflow (1.9.0) を用いた 4.2 部屋の設計図 2 の灰色のマスで表現されたそれぞれの部屋の壁床及び天井は岩盤ブロックで構築されていてエージェン *4 *5 トは破壊することができないまた 2 つの部屋の間は図 2 2018 Information Processing Society of Japan - 148 - ピッケル以外の道具及び素手では約 25 秒 https://github.com/crowdai/marlo (Accessed: 16) 2018-10-

1 DQN Table 1 3 Table 3 84 84 4 8 8 32 4 4 64 3 3 64 512 7 2 DRQN Table 2 84 84 4 8 8 32 4 4 64 3 3 64 LSTM 512 tanh 7 5.2 Minecraft 84 84 Minecraft UI 4 DQN 1 [1] DRQN 2 [8] Adam [10] 0.00025 µ [3] 3 0.00025 RMSProp 5.3 DQN DRQN [1] [1] 4 Malmo 1 84 84 4 8 8 32 Dropout Dropout 0.5 4 4 64 Dropout Dropout 0.5 3 3 64 Dropout Dropout 0.5 512 Dropout Dropout 0.5 7 4 hg-dagger/dqn, hg-dagger/drqn Table 4 replay memory size 500000 target network update frequency 2000 final exploration frame 2000000 replay start size 20000 1 hg-dagger/dqn hg-dagger/drqn DQN DRQN 4 5.4 DQN,DRQN DQN DRQN 4 DRQN 4 DQN DQN DRQN 5.5 hg-dagger/dqn, hg-dagger/drqn hg-dagger/dqn, hg-dagger/drqn 4.4 0 1 2 [3] 2018 Information Processing Society of Japan - 149 -

external rewards meta agent accuracy 0 5 10 15 20 DQN DRQN 25 0K 400K 800K 1200K 1600K 2000K steps 1.0 0.8 0.6 0.4 0.2 0.0 4 DQN, DRQN hg-dagger/dqn hg-dagger/drqn 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples 5 5 hg-dagger/dqn hg-dagger/drqn 5 6 7 0 100% 1 30% 2 20% 8 9 6 8 7 9 1 2 10 11 0 1 1 1000 2 6. Minecraft DQN DRQN hg-dagger/q Minecraft DQN DRQN 1 Experience Replay Prioritized Experience Replay [11] JSPS 16H02927 JST [1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. and Hassabis, D.: Humanlevel control through deep reinforcement learning, Nature, Vol. 518, No. 7540, pp. 529 533 (online), available from http://dx.doi.org/10.1038/nature14236 (2015). [2] Johnson, M., Hofmann, K., Hutton, T. and Bignell, D.: The malmo platform for artificial intelligence experimentation, IJCAI International Joint Conference on Artificial Intelligence, Vol. 2016-Janua, pp. 4246 4247 (2016). [3] Le, H., Jiang, N., Agarwal, A., Dudik, M., Yue, Y. and Daumé, III, H.: Hierarchical Imitation and Reinforcement Learning, Proceedings of the 35th International Conference on Machine Learning (Dy, J. and Krause, A., eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmssan, Stockholm Sweden, PMLR, pp. 2923 2932 (online), available from http://proceedings.mlr.press/v80/le18a.html (2018). [4] Sutton, R. S., Precup, D. and Singh, S.: Intra-option learning about temporally abstract actions, Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pp. 556 564 (1998). [5] Sutton, R. S., Precup, D. and Singh, S.: Between MDPs and semi-mdps: A Framework for Temporal Abstraction in Reinforcement Learning, Artif. Intell., Vol. 112, No. 1-2, pp. 181 211 (online), DOI: 10.1016/S0004-2018 Information Processing Society of Japan - 150 -

0.8 0.6 0.4 0.2 0.0 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples subgoal success rate1.0 0.8 0.6 0.4 0.2 0.0 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples subgoal success rate1.0 6 hg-dagger/dqn 7 hg-dagger/drqn LO-level samples 1600K 1400K 1200K 1000K 800K 600K 400K 200K 0K 0 250 500 750 1000 1250 episode (HI-level labeling samples) LO-level samples 1600K 1400K 1200K 1000K 800K 600K 400K 200K 0K 0 250 500 750 1000 1250 episode (HI-level labeling samples) 8 hg-dagger/dqn 9 hg-dagger/drqn 3702(99)00052-1 (1999). [6] Kulkarni, T. D., Narasimhan, K. R., Saeedi, A. and Tenenbaum, J. B.: Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, No. Nips (online), DOI: 10.1023/A:1025696116075 (2016). [7] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J. and Mannor, S.: A Deep Hierarchical Approach to Lifelong Learning in Minecraft., AAAI, Vol. 3, p. 6 (2017). [8] Hausknecht, M. and Stone, P.: Deep Recurrent Q- Learning for Partially Observable MDPs, (online), DOI: 10.1.1.696.1421 (2015). [9] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W.: OpenAI Gym, pp. 1 4 (online), available from http://arxiv.org/abs/1606.01540 (2016). [10] Kingma, D. P. and Ba, J.: Adam: A Method for Stochastic Optimization, pp. 1 15 (online), available from http://arxiv.org/abs/1412.6980 (2014). [11] Schaul, T., Quan, J., Antonoglou, I. and Silver, D.: Prioritized Experience Replay, CoRR, Vol. abs/1511.05952 (online), available from http://arxiv.org/abs/1511.05952 (2015). 2018 Information Processing Society of Japan - 151 -

steps 2000 1500 1000 steps 1500 1000 500 500 0 0 250 500 750 1000 1250 episode (HI-level labeling samples) 0 250 500 750 1000 1250 episode (HI-level labeling samples) 10 hg-dagger/dqn 11 hg-dagger/drqn 2018 Information Processing Society of Japan - 152 -