The 23rd Game Programming Workshop ,a) 2,3,b) Deep Q-Network Atari2600 Minecraft AI Minecraft hg-dagger/q Imitation Learning and Reinforcement L

Similar documents
DQN Pathak Intrinsic Curiosity Module (ICM) () [2] Pathak VizDoom Super Mario Bros Mnih A3C [3] ICM Burda ICM Atari 2600 [4] Seijen Hybrid Reward Arch

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

IPSJ SIG Technical Report Vol.2016-GI-35 No /3/9 StarCraft AI Deep Q-Network StarCraft: BroodWar Blizzard Entertainment AI Competition AI Convo

,.,., ( ).,., A, B. A, B,.,...,., Python Long Short Term Memory(LSTM), Unity., Asynchronous method, Deep Q-Network(DQN), LSTM, TORCS. Asynchronous met

COM COM 4) 5) COM COM 3 4) 5) COM COM 6) 7) 10) COM Bonanza 6) Bonanza Hearts COM 7) 10) Hearts 3 2,000 4,000

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

これからの強化学習 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 初版 1 刷発行時のものです.

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

_314I01BM浅谷2.indd


Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

Fig. 2 28th Ryuou Tournament, Match 5, 59th move. The last move is Black s Rx5f. 1 Tic-Tac-Toe Fig. 1 AsearchtreeofTic-Tac-Toe. [2] [3], [4]

人工知能学会研究会資料 SIG-KBS-B Analysis of Voting Behavior in One Night Werewolf 1 2 Ema Nishizaki 1 Tomonobu Ozaki Graduate School of Integrated B

1, 2, 2, 2, 2 Recovery Motion Learning for Single-Armed Mobile Robot in Drive System s Fault Tauku ITO 1, Hitoshi KONO 2, Yusuke TAMURA 2, Atsushi YAM

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

A Study on Throw Simulation for Baseball Pitching Machine with Rollers and Its Optimization Shinobu SAKAI*5, Yuichiro KITAGAWA, Ryo KANAI and Juhachi

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

[1], []. AlphaZero 4 TPU 7 elmo 90 8 [3] AlphaZero 1 TPU TPU 64 [3] AlphaZero elmo AlphaZero [3] [4]AlphaZero [3].3 Saliency Map [5] Smooth- Gra

2. Twitter Twitter 2.1 Twitter Twitter( ) Twitter Twitter ( 1 ) RT ReTweet RT ReTweet RT ( 2 ) URL Twitter Twitter 140 URL URL URL 140 URL URL

2017 (413812)

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

johnny-paper2nd.dvi

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-GI-34 No /7/ % Selections of Discarding Mahjong Piece Using Neural Network Matsui

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

,,,,., C Java,,.,,.,., ,,.,, i

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

( ) [1] [4] ( ) 2. [5] [6] Piano Tutor[7] [1], [2], [8], [9] Radiobaton[10] Two Finger Piano[11] Coloring-in Piano[12] ism[13] MIDI MIDI 1 Fig. 1 Syst

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

DL_UCT

ID 3) 9 4) 5) ID 2 ID 2 ID 2 Bluetooth ID 2 SRCid1 DSTid2 2 id1 id2 ID SRC DST SRC 2 2 ID 2 2 QR 6) 8) 6) QR QR QR QR

PowerPoint プレゼンテーション

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

3_23.dvi

The 19th Game Programming Workshop 2014 SHOT 1,a) 2 UCT SHOT UCT SHOT UCT UCT SHOT UCT An Empirical Evaluation of the Effectiveness of the SHOT algori

IPSJ SIG Technical Report Vol.2009-CVIM-167 No /6/10 Real AdaBoost HOG 1 1 1, 2 1 Real AdaBoost HOG HOG Real AdaBoost HOG A Method for Reducing

GPGPU

IPSJ SIG Technical Report Secret Tap Secret Tap Secret Flick 1 An Examination of Icon-based User Authentication Method Using Flick Input for

it-ken_open.key

12) NP 2 MCI MCI 1 START Simple Triage And Rapid Treatment 3) START MCI c 2010 Information Processing Society of Japan

日本感性工学会論文誌

1 1 CodeDrummer CodeMusician CodeDrummer Fig. 1 Overview of proposal system c

IPSJ SIG Technical Report Vol.2014-CG-155 No /6/28 1,a) 1,2,3 1 3,4 CG An Interpolation Method of Different Flow Fields using Polar Inter

Sobel Canny i

IPSJ SIG Technical Report An Evaluation Method for the Degree of Strain of an Action Scene Mao Kuroda, 1 Takeshi Takai 1 and Takashi Matsuyama 1

1_26.dvi

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

情報 システム工学概論 コンピュータゲームプレイヤ 鶴岡慶雅 工学部電子情報工学科 情報理工学系研究科電子情報学専攻

1 (1997) (1997) 1974:Q3 1994:Q3 (i) (ii) ( ) ( ) 1 (iii) ( ( 1999 ) ( ) ( ) 1 ( ) ( 1995,pp ) 1

IPSJ SIG Technical Report Vol.2015-CVIM-196 No /3/6 1,a) 1,b) 1,c) U,,,, The Camera Position Alignment on a Gimbal Head for Fixed Viewpoint Swi

DTN DTN DTN DTN i

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

Mimehand II[1] [2] 1 Suzuki [3] [3] [4] (1) (2) 1 [5] (3) 50 (4) 指文字, 3% (25 個 ) 漢字手話 + 指文字, 10% (80 個 ) 漢字手話, 43% (357 個 ) 地名 漢字手話 + 指文字, 21

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2015-DBS-162 No /11/26 1,a) 1,b) EM Designing and developing an interactive data minig tool for rapid r

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai,

IPSJ SIG Technical Report Vol.2009-DBS-149 No /11/ Bow-tie SCC Inter Keyword Navigation based on Degree-constrained Co-Occurrence Graph

IPSJ SIG Technical Report Pitman-Yor 1 1 Pitman-Yor n-gram A proposal of the melody generation method using hierarchical pitman-yor language model Aki

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

ISSN ISBN C3033 The Institute for Economic Studies Seijo University , Seijo, Setagaya Tokyo , Japan

28 TCG SURF Card recognition using SURF in TCG play video

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

23_02.dvi

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

9_18.dvi

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

MDD PBL ET 9) 2) ET ET 2.2 2), 1 2 5) MDD PBL PBL MDD MDD MDD 10) MDD Executable UML 11) Executable UML MDD Executable UML

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

三石貴志.indd

Vol. 48 No. 3 Mar PM PM PMBOK PM PM PM PM PM A Proposal and Its Demonstration of Developing System for Project Managers through University-Indus

149 (Newell [5]) Newell [5], [1], [1], [11] Li,Ryu, and Song [2], [11] Li,Ryu, and Song [2], [1] 1) 2) ( ) ( ) 3) T : 2 a : 3 a 1 :

JavaScript Web JavaScript BitArrow BitArrow ( 4 ) Web VBA JavaScript JavaScript JavaScript Web Ajax(Asynchronous JavaScript + XML) Web. JavaScr

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

Convolutional Neural Network A Graduation Thesis of College of Engineering, Chubu University Investigation of feature extraction by Convolution

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

IPSJ SIG Technical Report Vol.2009-BIO-17 No /5/26 DNA 1 1 DNA DNA DNA DNA Correcting read errors on DNA sequences determined by Pyrosequencing

ActionScript Flash Player 8 ActionScript3.0 ActionScript Flash Video ActionScript.swf swf FlashPlayer AVM(Actionscript Virtual Machine) Windows

IPSJ SIG Technical Report Vol.2014-CE-123 No /2/8 Bebras 1,a) Bebras,,, Evaluation and Possibility of the Questions for Bebras Contest Abs

WHITE PAPER RNN

ABSTRACT The movement to increase the adult literacy rate in Nepal has been growing since democratization in In recent years, about 300,000 peop

1 Kinect for Windows M = [X Y Z] T M = [X Y Z ] T f (u,v) w 3.2 [11] [7] u = f X +u Z 0 δ u (X,Y,Z ) (5) v = f Y Z +v 0 δ v (X,Y,Z ) (6) w = Z +

Run-Based Trieから構成される 決定木の枝刈り法

14 2 5

[1] AI [2] Pac-Man Ms. Pac-Man Ms. Pac-Man Pac-Man Ms. Pac-Man IEEE AI Ms. Pac-Man AI [3] AI 2011 UCT[4] [5] 58,990 Ms. Pac-Man AI Ms. Pac-Man 921,360

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

When creating an interactive case scenario of a problem that may occur in the educational field, it becomes especially difficult to assume a clear obj

e-learning e e e e e-learning 2 Web e-leaning e 4 GP 4 e-learning e-learning e-learning e LMS LMS Internet Navigware

1 7.35% 74.0% linefeed point c 200 Information Processing Society of Japan

08医療情報学22_1_水流final.PDF

130 Oct Radial Basis Function RBF Efficient Market Hypothesis Fama ) 4) 1 Fig. 1 Utility function. 2 Fig. 2 Value function. (1) (2)

IPSJ SIG Technical Report Vol.2017-MUS-116 No /8/24 MachineDancing: 1,a) 1,b) 3 MachineDancing MachineDancing MachineDancing 1 MachineDan

H19国際学研究科_02.indd

AP AP AP AP AP AP AP( AP) AP AP( AP) AP AP Air Patrol[1] Air Patrol Cirond AP AP Air Patrol Senser Air Patrol Senser AP AP Air Patrol Senser AP

DEIM Forum 2010 A Web Abstract Classification Method for Revie

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

(12th) R.s!..

Transcription:

1,a) 2,3,b) Deep Q-Network Atari2600 Minecraft AI Minecraft hg-dagger/q Imitation Learning and Reinforcement Learning using Hierarchical Structure Yutaro Fujimura 1,a) Tomoyuki Kaneko 2,3,b) Abstract: Deep Q-Network (DQN) has achieved above-human performance on the domain of classic Atari2600 games and therefore DQN is expected to apply to many video games. However it is difficult for DQN to learn on games with sparse feedback, such as Minecraft. To solve this problem, we investigated the performance of hg-dagger/q, a framework to combine imitation learning and reinforcement learning using hierarchical structure. We demonstrate the strength of hg-dagger/q on a Minecraft environment. 1. Mnih Deep Q-Network (DQN) Atari2600 [1] DQN Atari2600 1 Graduate School of Arts and Sciences, The University of Tokyo 2 Interfaculty Initiative in Information Studies, the University of Tokyo 3 JST, PRESTO a) yut-mak874@g.ecc.u-tokyo.ac.jp b) kaneko@acm.org Johnson Minecraft *1 AI Malmo [2] Malmo Minecraft AI Minecraft DQN hg-dagger/q [3] *1 https://minecraft.net/ (Accessed: 2018-10-16) 2018 Information Processing Society of Japan - 145 -

Minecraft 2. Minecraft Minecraft 3D Minecraft AI Malmo AI Minecraft Malmo Minecraft AI 2017 *2 2018 *3 3. 3.1 [4] Q S A R : S A R P : S A S [0, 1] (Markov Decision Process, MDP). s S, a A π : S A, P (s, a) s, r = R(s, a). Q(s, a) 3.2 Deep Q-Network (DQN) Deep Q-Network (DQN) [1] Q Q(s, a) θ Q(s, a; θ) DQN θ *2 https://www.microsoft.com/en-us/research/ academic-program/collaborative-ai-challenge/ (Accessed: 2018-10-16) *3 https://www.crowdai.org/challenges/marlo-2018 (Accessed: 2018-10-16) [ ( ) ] 2 L i (θ i ) = E s,a,r,s r + γ max Q(s, a ; θ i 1 ) Q(s, a; θ i ) a θ i i (1) DQN Atari2600 Montezuma s Revenge Montezuma s Revenge 1 DQN 3.3 [5] 3.1 MDP 2 g G g µ : S G a π g : S A Alg 1 Algorithm 1 1: repeat 2: s 3: : g µ(s) 4: loop 5: s 6: if g then 7: break 8: a : a π g (s) 9: a 10: until µ π g Alg 1 π g τ = (s 1, a 1,..., s T, a T, s T +1) τ σ = (s 1, g 1, τ 1, s 2, a 2, τ 2,...) τ h s h+1 τ h+1 σ τ h 2018 Information Processing Society of Japan - 146 -

Algorithm 2 Hierarchically Guided DAgger/Q-learning (hg-dagger/q) Input: Function Pseudo(s; g) Function Terminal(s; g) ϵ g > 0, g G 1: Initialize: 2: D HI and D g, g G 3: Q g, g G 4: for t = 1,..., T do 5: s 6: σ 7: repeat 8: s HI s, g µ(s) and initialize τ 9: repeat 10: a ϵ g -greedy(q g, s) 11: a s 12: r Pseudo( s; g) 13: Q g : D g 14: τ (s, a, s, r) 15: s s 16: until Terminal(s; g) 17: σ (s HI, g, τ) 18: until 19: σ τ FULL τ HI 20: if Inspect FULL (τ FULL ) = Fail then 21: D Label HI (τ HI ) 22: for (s h, g h, τ) in σ do 23: gh D 24: if g h gh then 25: break 26: D gh D gh τ h 27: D HI D HI D 28: else 29: D gh D gh τ h for all (s h, g h, τ h ) σ 30: µ : µ Train(µ, D HI ) τ HI = (s 1, g 1, s 2, g 2,...) τ FULL DQN h-dqn [6] Hierarchical Deep RL Network (H-DRLN) [7] 3.4 Hybrid Imitation and Reinforcement Learning Le µ π g Hierarchically Guided DAgger/Q-learning (hg-dagger/q) [3] Alg 2 Alg 2 Pseudo(s; g) s g Terminal(s; g) [3] s Algorithm 3 Inspect FULL (τ FULL ) 1: if τ FULL then 2: return Pass 3: else 4: return Fail g Success(s; g) 1 if Success(s; g) 1 if Success(s; g) and Terminal(s; g) κ (2) κ > 0 (s 1, a 1, s 2, a 2,..., ) τfull = {(s 1, a 1), (s 2, a 2),...} µ Label HI (τ HI ) = {(s 1, g 1), (s 2, g 2),...} π g Terminal(s; g) Label HI h-dqn [6] Montezuma s Revenge Alg? Q DQN 3.5 Deep Reccurent Q-Network (DRQN) DQN 4 Minecraft 4 DQN LSTM Deep Reccurent Q-Network (DRQN) [8] 4. Montezuma s Revenge Minecarft 4.1 2 2 2018 Information Processing Society of Japan - 147 -

の緑のマスの位置にある鉄のドアで仕切られており 鉄の ドアは破壊するのに非常に時間がかかるため *4 エージェ ントが時間内にゴールにたどり着くためには エージェン トが配置される最初の部屋の仕掛けを解き 鉄のドアを開 けて通る必要がある 4.3 最初の部屋の仕掛け エージェントが最初に配置される側の部屋には 図 2 の 水色のマスの位置に石の感圧板 茶色のマスに原木ブロッ クが配置されている まず エージェントが石の感圧板の 図 1 エージェントが観察できる入力例 上に乗ると 隣に設置されたディスペンサーからダイヤの Fig. 1 斧が射出され 獲得することができる 次に 原木ブロッ クを破壊すると 隣に設置されたオブザーバーからレッド ストーン回路を通じて 鉄のドアが開くという仕掛けに なっている 4.4 報酬設計 ダイヤの斧を獲得したとき 原木ブロックを破壊して獲 得したとき及びゴール地点の金ブロックに触れたときに エージェントは +1 点を獲得する また エージェントが 行動を選択する度に 0.01 点を獲得し ゴールにたどり着 図 2 けずに 100 秒が経過した場合は 1 点を獲得する 実験に用いたマップの模式図 Fig. 2 4.5 エージェントがとれる行動 Minecraft のゲームでは様々な行動コマンドが存在する が 本研究の実験においては以下の 7 種類に制限した 何もしない (前, 後) に歩く 180 度 / 秒の速度で (左, 右) にカメラを回転させる 攻撃を開始する 攻撃を終了する 5. 実験 図 3 5.1 実験概要 実験に用いたマップを俯瞰した様子 3.4 節で述べた hg-dagger/q における Q 学習を DQN Fig. 3 で行うエージェントを hg-dagger/dqn とし DRQN で行 ピンク色のマスに矢印の向きで配置される エージェント うエージェントを hg-dagger/drqn とする 4 章で作成 は図 1 のような画像を 0.05 秒に 1 度の間隔で観察しなが した Minecraft の環境で DQN, DRQN, hg-dagger/dqn, ら ゴール地点に到達することが目的である ゴール地点 hg-dagger/drqn のエージェントで学習を行った 実装 である金ブロックに触れた時点で ゲームはクリアとなり には Python3.5 を Minecraft での実験を行うために Malmo そのエピソードは終了する Minecraft 上での実際のマッ (0.36.0.0) を OpenAI Gym [9] の形式で取り扱えるラッパー プを天井ブロックを除いて上から撮影したものが図 3 で である MarLO ある 習のフレームワークとして Keras (2.2.2) を用い バックエ *5 (0.0.1-dev16) を使用した また 深層学 ンドは tensorflow (1.9.0) を用いた 4.2 部屋の設計 図 2 の灰色のマスで表現された それぞれの部屋の壁 床及び天井は岩盤ブロックで構築されていて エージェン *4 *5 トは破壊することができない また 2 つの部屋の間は図 2 2018 Information Processing Society of Japan - 148 - ピッケル以外の道具及び素手では約 25 秒 https://github.com/crowdai/marlo (Accessed: 16) 2018-10-

1 DQN Table 1 3 Table 3 84 84 4 8 8 32 4 4 64 3 3 64 512 7 2 DRQN Table 2 84 84 4 8 8 32 4 4 64 3 3 64 LSTM 512 tanh 7 5.2 Minecraft 84 84 Minecraft UI 4 DQN 1 [1] DRQN 2 [8] Adam [10] 0.00025 µ [3] 3 0.00025 RMSProp 5.3 DQN DRQN [1] [1] 4 Malmo 1 84 84 4 8 8 32 Dropout Dropout 0.5 4 4 64 Dropout Dropout 0.5 3 3 64 Dropout Dropout 0.5 512 Dropout Dropout 0.5 7 4 hg-dagger/dqn, hg-dagger/drqn Table 4 replay memory size 500000 target network update frequency 2000 final exploration frame 2000000 replay start size 20000 1 hg-dagger/dqn hg-dagger/drqn DQN DRQN 4 5.4 DQN,DRQN DQN DRQN 4 DRQN 4 DQN DQN DRQN 5.5 hg-dagger/dqn, hg-dagger/drqn hg-dagger/dqn, hg-dagger/drqn 4.4 0 1 2 [3] 2018 Information Processing Society of Japan - 149 -

external rewards meta agent accuracy 0 5 10 15 20 DQN DRQN 25 0K 400K 800K 1200K 1600K 2000K steps 1.0 0.8 0.6 0.4 0.2 0.0 4 DQN, DRQN hg-dagger/dqn hg-dagger/drqn 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples 5 5 hg-dagger/dqn hg-dagger/drqn 5 6 7 0 100% 1 30% 2 20% 8 9 6 8 7 9 1 2 10 11 0 1 1 1000 2 6. Minecraft DQN DRQN hg-dagger/q Minecraft DQN DRQN 1 Experience Replay Prioritized Experience Replay [11] JSPS 16H02927 JST [1] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S. and Hassabis, D.: Humanlevel control through deep reinforcement learning, Nature, Vol. 518, No. 7540, pp. 529 533 (online), available from http://dx.doi.org/10.1038/nature14236 (2015). [2] Johnson, M., Hofmann, K., Hutton, T. and Bignell, D.: The malmo platform for artificial intelligence experimentation, IJCAI International Joint Conference on Artificial Intelligence, Vol. 2016-Janua, pp. 4246 4247 (2016). [3] Le, H., Jiang, N., Agarwal, A., Dudik, M., Yue, Y. and Daumé, III, H.: Hierarchical Imitation and Reinforcement Learning, Proceedings of the 35th International Conference on Machine Learning (Dy, J. and Krause, A., eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmssan, Stockholm Sweden, PMLR, pp. 2923 2932 (online), available from http://proceedings.mlr.press/v80/le18a.html (2018). [4] Sutton, R. S., Precup, D. and Singh, S.: Intra-option learning about temporally abstract actions, Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), pp. 556 564 (1998). [5] Sutton, R. S., Precup, D. and Singh, S.: Between MDPs and semi-mdps: A Framework for Temporal Abstraction in Reinforcement Learning, Artif. Intell., Vol. 112, No. 1-2, pp. 181 211 (online), DOI: 10.1016/S0004-2018 Information Processing Society of Japan - 150 -

0.8 0.6 0.4 0.2 0.0 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples subgoal success rate1.0 0.8 0.6 0.4 0.2 0.0 0K 400K 800K 1200K 1600K 2000K LO-level reinforcement learning samples subgoal success rate1.0 6 hg-dagger/dqn 7 hg-dagger/drqn LO-level samples 1600K 1400K 1200K 1000K 800K 600K 400K 200K 0K 0 250 500 750 1000 1250 episode (HI-level labeling samples) LO-level samples 1600K 1400K 1200K 1000K 800K 600K 400K 200K 0K 0 250 500 750 1000 1250 episode (HI-level labeling samples) 8 hg-dagger/dqn 9 hg-dagger/drqn 3702(99)00052-1 (1999). [6] Kulkarni, T. D., Narasimhan, K. R., Saeedi, A. and Tenenbaum, J. B.: Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, No. Nips (online), DOI: 10.1023/A:1025696116075 (2016). [7] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J. and Mannor, S.: A Deep Hierarchical Approach to Lifelong Learning in Minecraft., AAAI, Vol. 3, p. 6 (2017). [8] Hausknecht, M. and Stone, P.: Deep Recurrent Q- Learning for Partially Observable MDPs, (online), DOI: 10.1.1.696.1421 (2015). [9] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W.: OpenAI Gym, pp. 1 4 (online), available from http://arxiv.org/abs/1606.01540 (2016). [10] Kingma, D. P. and Ba, J.: Adam: A Method for Stochastic Optimization, pp. 1 15 (online), available from http://arxiv.org/abs/1412.6980 (2014). [11] Schaul, T., Quan, J., Antonoglou, I. and Silver, D.: Prioritized Experience Replay, CoRR, Vol. abs/1511.05952 (online), available from http://arxiv.org/abs/1511.05952 (2015). 2018 Information Processing Society of Japan - 151 -

steps 2000 1500 1000 steps 1500 1000 500 500 0 0 250 500 750 1000 1250 episode (HI-level labeling samples) 0 250 500 750 1000 1250 episode (HI-level labeling samples) 10 hg-dagger/dqn 11 hg-dagger/drqn 2018 Information Processing Society of Japan - 152 -