C++ TPDPL(Template Parallel Distributed Processing Library) C X10 1) Place Activity X10 Place 2) 2.2 C++ C/C++OpenMP MPI C/C++ OpenMP

Similar documents
Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

09中西

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

GPGPU

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

スパコンに通じる並列プログラミングの基礎

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

. IDE JIVE[1][] Eclipse Java ( 1) Java Platform Debugger Architecture [5] 3. Eclipse GUI JIVE 3.1 Eclipse ( ) 1 JIVE Java [3] IDE c 016 Information Pr

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

fiš„v8.dvi

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

DEIM Forum 2012 C2-6 Hadoop Web Hadoop Distributed File System Hadoop I/O I/O Hadoo

01_OpenMP_osx.indd

スパコンに通じる並列プログラミングの基礎

スパコンに通じる並列プログラミングの基礎

07-二村幸孝・出口大輔.indd

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

Amazon EC2 IaaS (Infrastructure as a Service) HPCI HPCI ( VM) VM VM HPCI VM OS VM HPCI HPC HPCI RENKEI-PoP 2 HPCI HPCI 1 HPCI HPCI HPC CS

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

3. XML, DB, DB (AP). DB, DB, AP. RDB., XMLDB, XML,.,,.,, (XML / ), XML,,., AP. AP AP AP 検索キー //A=1 //A=2 //A=3 返却 XML 全体 XML 全体 XML 全体 XMLDB <root> <A

,,,,., C Java,,.,,.,., ,,.,, i

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

MAC root Linux 1 OS Linux 2.6 Linux Security Modules LSM [1] Security-Enhanced Linux SELinux [2] AppArmor[3] OS OS OS LSM LSM Performance Monitor LSMP

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

DEIM Forum 2010 D Development of a La

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

KII, Masanobu Vol.7 No Spring

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

1 DHT Fig. 1 Example of DHT 2 Successor Fig. 2 Example of Successor 2.1 Distributed Hash Table key key value O(1) DHT DHT 1 DHT 1 ID key ID IP value D

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

HP High Performance Computing(HPC)

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

main.dvi

Web Basic Web SAS-2 Web SAS-2 i

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 分子 ( 粒子 ) セル 系の状態 ステップ 18

大学における原価計算教育の現状と課題

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

BOK body of knowledge, BOK BOK BOK 1 CC2001 computing curricula 2001 [1] BOK IT BOK 2008 ITBOK [2] social infomatics SI BOK BOK BOK WikiBOK BO

Transcription:

C++ 1 2 2 CPU S.C. () PC C++ TPDPL(Template Parallel Distributed Processing Library) PE(Processing Element ) S.C.(T2K ) An Implementation of C++ Task Mapping Library and Evaluation on Heterogeneous Environments Takeo YAMASAKI, 1 Daisuke MIYAMOTO 2 and Masaya NAKAYAMA 2 Modern computing architectures are increasingly parallel distributed. This trend is driven by multi-core processors, grid, cluster and cloud-computing. These systems are complicated because of their scale, heterogeneous structures, and asymmetric architectures. Therefore, more productive paradigm that assists development of parallel distributed processing applications is required and has been considered. In this paper we pay attention to task mapping paradigm, and design C++ parallel distributed programing library, TPDPL (Template Parallel Distributed Processing Library), and develop PE (Processing Element) Containers and task mapping algorithms. Finally we report the performance evaluation of them on T2K open supercomputer and private cluster computer and cloud computer and we confirm the performance of TPDPL task mapping system. 1. 1 Graduate School of Engineering, The University of Tokyo 2 Information Technology Center, The University of Tokyo FPGA GPU C GPU C++ C++ OpenMP MPI CORBA C++ C++11 C++ 15

C++ TPDPL(Template Parallel Distributed Processing Library) C++ 2 3 4 5 2. 2.1 X10 1) Place Activity X10 Place 2) 2.2 C++ C/C++OpenMP MPI C/C++ OpenMP (Omni OpenMP/S- CASH 3),XcalableMP 4) ) MPI (Grid MPI 5) ) (MPC++ 6) ) TBB 7) 8) 9) 2.3 C++11 C++ 10) thread future promiss async mutex condition variabl atomic operation 3. X10 Place Activity PE (processing element) PE 12) MPC++ PE PE PE, thread pe TCP MPI PE tcp pe mpi pe GPU FPGA PE PE PE 16

STL(Standard Template Library) C++ ( 1) STL PE STL vector list PE PE PE PE PE 11)12) PE PE PE PE for 3.1 PE PE PE PE PE PE pe vector PE thread pp, mpi pp, tcp pp PE hetero 3.1.1 PE.1 pe vector 1 STL TPDPL std::vector PE talloc join set join set PE id 1 : pe vector 1 int add(int a, int b){ return a+b; } 2 void test(){ 3 // thread pe 4 4 pe vector<thread pe> pevec(4); 5 join set js; 6 for(int i=0; i<4; i++){ 7 js += pevec[i].talloc(add, 1, i); 8 } 9 js.join all(); 10 } 3.1.2 PE.2 PE pe vector thread pp CPUID CPU full assign mpi pp MPI CPU thread pe PE 17

12 mpi pe slave() tcp pp tcp pe set pe thread pe (23 ) tcp pe full assign assign (10 ) 2 : PE 1 void test(){ 2 // thread pe 3 thread pp tpp; 4 tpp.full assign(); 5 // mpi pe 6 mpi pp mpp; 7 mpp.full assign(); 8 // tcp pe 9 tcp pp spp; 10 spp.assign(64); 11 } 12 void mpi pe slave(){ 13 // MPI 0 14 mpi pe singleton::start server(); 15 while(mpi pe singleton::is server working()){ 16 Sleep(10); 17 } 18 } 19 void tcp pe slave(){ 20 // 21 network tools::init sock(); 22 23 int port = 50000; 24 tcp pe mta("127.0.0.1", port); 25 thread pp tpp; 26 tpp.full assign(); 27 for(uint32 t i=0; i<4; i++){ 28 mta.talloc(&tcp pe singleton::set pe, 29 (void )&tpp.at(i)); 30 } 31 mta.talloc(set pes, inst); 32 while(tcp pe singleton::is server working()){ 33 Sleep(10); 34 } 35 } 3.1.3 3 : 1 void test(){ 2 hetero<thread pp, mpi pp, tcp pp> pec; 3 // thread pe CPU 4 pec.get pec0().full assign(); 5 // mpi pe CPU 6 pec.get pec1().full assing(); 7 // tcp pe 64PE 8 pec.get pec2().assign(64); 9 pec.reflush(); 10 11 hetero<thread pp, mpi pp, tcp pp>::iterator it; 12 for(it=pec.begin(); it!=pec.end(); it++){ 13 it.talloc(/ task /); 14 } 15 pec.join all(); 16 } PE hetero PE.3 PE get pecn N 4 get pec0 thread pp 6 get pec1 mpi pp 8 get pec2 tcp pp PE 12 for begin() end() PE PE thread pp mpi pp tcp pp 3.2 PE PE PE PE PE PE for PE PE 3 PE even CPU clock for 1 test.4 for xxx PE PE talloc 4 : 1 void test(){ 2 thread pp pec; 3 { 4 reducer<int> ret; 5 ret += for even(pec).talloc(load, 1, 10000); 6 ret.jreduce(); // join & reduce 7 } 8 { 9 reducer<int> ret; 10 ret += for clock(pec).talloc(load, 1, 10000); 18

11 ret.jreduce(); 12 } 13 { 14 reducer<int> ret; 15 ret += for test(pec).talloc(load, 1, 10000); 16 ret.jreduce(); 17 } 18 } 4. Task Mapping PE 4.1 S.C. HaaS (StarBED 13) ) S.C. T2K Open Supercomputer( ) HA8000 CPU AMD Opteron 8356 2.3GHz 4 1 4CPU 4 OS RedHat Enterprise Linux 5, gcc version 4.1.2, MPI ver.1.2 MPICH-MX CPU Intel Xeon W3530 2.8GHz 4 2 OS ubuntu10.04lts, gcc version 4.4.3, MPI MPICH2 StarBED Intel Xeon X5670 2.93GHz 6 2CPU 5 OS Debian 6.0.2 gcc4.4.5 2 PE 128 PE node0 thread pe 4 node1 mpi pe 4 S.C. tcp pe 60 StarBED tcp pe 60 PE S.C. 64 PE 64 tcp pe mpi pe 1 1 4 60 PE StarBED CPU CPU PE OS PE.6 hetero thread pp mpi pp tcp pp2 S.C. TCP file pe NFS node0 tcp pe 1.25GB/s 7.5GB/s StarBED tcp pe JGN-X 14) StarBED 1Gb/s 1Gb/s 2 S.C. StarBED S.C. WAN NFS 5 : PE 1 hetero<thread pp, mpi pp, tcp pp, tcp pp> pec; 2 // thread pe pool CPU (4PE) 3 pec.pec0.full assign(); 4 // mpi pe CPU (4PE) 5 pec.pec1.full assing(); 6 // S.C. 60PE 7 pec.pec2.assign(60); 8 // StarBED 60PE 9 for(int i=0; i<60; i++){ 10 pec.pec2.assign(ip[i], port[i]); 11 } 2 PE 19

4.2 (.6) int load int double load double load 1 10000000 for 6 : 1 uint64 t load int(int64 t start, int64 t end){ 2 uint64 t a=0; 3 for(uint32 t i=start; i<=end; i++) 4 for(uint32 t j=0; j<=1000; j++) 5 a += (uint64 t)(i+j) (i j) (i j) (i/j); 6 return a; 7 } 8 uint64 t load double(int64 t start, int64 t end){ 9 uint64 t a=0; 10 for(uint32 t i=start; i<=end; i++) 11 for(uint32 t j=0; j<=1000; j++){ 12 double ii=(double)i,jj=(double)j; 13 a += ((jj/ii) (ii jj)/(ii jj) (ii/jj); 14 } 15 return a; 16 } even,clock,test even 10000000/68 clock PE 10000000*/( ) test 10000000/ /( ) even clock CPU PE CPU test 1 for 3 even,clock,test load int 4.3 3 load int node0 node1 S.C. StarBED even 1 S.C. clock CPU test 4 load double double even clock S.C. test test 5 test PE for 1 PE 4 node0 4 node1 60 S.C. 60 StarBED node0 S.C. S.C. 20

test PE PE 4 even,clock,test load double StarBED StarBED 5 load double test load double even PE load CPU 5. C++ (TPDPL) S.C. 1) Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, David Grov: Report on the Programming Language X10 version 2.1, http://dist.codehaus.org/x10/ documentation/languagespec/x10-latest.pdf (2011) 2) Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar, Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement, Proceedings of the 22nd Workshop on Languages and Compilers for Parallel Computing (LCPC), October 2009. 3),,,,, :Ethernet OpenMP Omni/SCASH, HPC 2002-HPC-91-21 pp. 119-124, 2002. 4) XcalableMP, ACS Vol.3 No. 3,153-165 (2010-09-17), 1882-7829, 2010. 5) Y.Ishikawa, M.Matsuda, T.Kudoh, H.Tezuka, S.Sekiguchi:GridMPI - MPI, SWOPP03, 2003. 6) Yutaka Ishikawa, Atsushi Hori, Mitsuhisa Sato, Motohiko Matsuda, Jorg Nolte, Hiroshi Tezuka, Hiroki Konaka, Munenori Maeda, Kazuto Kubota :Design and Implementation of Metalevel Architecture in C++ - - MPC++ Approach - -, Reflection 96 Conference, April 20- -23, 1996. 7) Threading Building Blocks web site, http: //threadingbuildingblocks.org/ (2011) 8),,,. :, 2011-HPC-129, 2011 9),,, : 21

, 2011-HPC- 129, 2011 10) The C++ Standards Committee http://www. open-std.org/jtc1/sc22/wg21/ 11), : C++, HPCS2011 IPSJ Symposium Series, Vol.2011, p.82 (2011) 12) :C++ tpdplib T2K NPB, HPC-129, No.26, 2011 3 13) StarBED Project http://www.starbed.org/ 14) JGN-X http://www.jgn.nict.go.jp/ 1977 2009 ( ) 2011 ( 22 7 17 ) ( 22 9 17 ) 1986,.. ( )... IEEE,, 22