XcalableMP入門

Similar documents
研究背景 大規模な演算を行うためには 分散メモリ型システムの利用が必須 Message Passing Interface MPI 並列プログラムの大半はMPIを利用 様々な実装 OpenMPI, MPICH, MVAPICH, MPI.NET プログラミングコストが高いため 生産性が悪い 新しい並

nakao

PowerPoint プレゼンテーション

¥Ñ¥Ã¥±¡¼¥¸ Rhpc ¤Î¾õ¶·

MPI usage

2 T 1 N n T n α = T 1 nt n (1) α = 1 100% OpenMP MPI OpenMP OpenMP MPI (Message Passing Interface) MPI MPICH OpenMPI 1 OpenMP MPI MPI (trivial p

2012年度HPCサマーセミナー_多田野.pptx

01_OpenMP_osx.indd

WinHPC ppt

目 目 用方 用 用 方

±é½¬£²¡§£Í£Ð£É½éÊâ

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

Microsoft Word - 計算科学演習第1回3.doc

58 7 MPI 7 : main(int argc, char *argv[]) 8 : { 9 : int num_procs, myrank; 10 : double a, b; 11 : int tag = 0; 12 : MPI_Status status; 13 : 1 MPI_Init

I I / 47

OpenMP (1) 1, 12 1 UNIX (FUJITSU GP7000F model 900), 13 1 (COMPAQ GS320) FUJITSU VPP5000/64 1 (a) (b) 1: ( 1(a))

DKA ( 1) 1 n i=1 α i c n 1 = 0 ( 1) 2 n i 1 <i 2 α i1 α i2 c n 2 = 0 ( 1) 3 n i 1 <i 2 <i 3 α i1 α i2 α i3 c n 3 = 0. ( 1) n 1 n i 1 <i 2 < <i

<4D F736F F F696E74202D C097F B A E B93C782DD8EE682E890EA97705D>

ex01.dvi

Microsoft PowerPoint - 講義:片方向通信.pptx

r07.dvi

ohp07.dvi

120802_MPI.ppt

連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 分子 ( 粒子 ) セル 系の状態 ステップ 18

115 9 MPIBNCpack 9.1 BNCpack 1CPU X = , B =

44 6 MPI 4 : #LIB=-lmpich -lm 5 : LIB=-lmpi -lm 7 : mpi1: mpi1.c 8 : $(CC) -o mpi1 mpi1.c $(LIB) 9 : 10 : clean: 11 : -$(DEL) mpi1 make mpi1 1 % mpiru

ohp03.dvi

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

£Ã¥×¥í¥°¥é¥ß¥ó¥°(2018) - Âè11²ó – ½ÉÂꣲ¤Î²òÀ⡤±é½¬£² –

SystemC言語概論

ex01.dvi

r03.dvi

Microsoft PowerPoint - 演習2:MPI初歩.pptx

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

C言語によるアルゴリズムとデータ構造

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裱£²²ó ¡Ý½ÉÂꣲ¤Î²òÀ⡤±é½¬£²¡Ý

Intel® Compilers Professional Editions

untitled

スパコンに通じる並列プログラミングの基礎

C

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

C/C++ FORTRAN FORTRAN MPI MPI MPI UNIX Windows (SIMD Single Instruction Multipule Data) SMP(Symmetric Multi Processor) MPI (thread) OpenMP[5]

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裶²ó ¨¡ À©¸æ¹½Â¤¡§·«¤êÊÖ¤· ¨¡

double float

超初心者用

インテル(R) Visual Fortran Composer XE

1.ppt

スパコンに通じる並列プログラミングの基礎

040312研究会HPC2500.ppt

Krylov (b) x k+1 := x k + α k p k (c) r k+1 := r k α k Ap k ( := b Ax k+1 ) (d) β k := r k r k 2 2 (e) : r k 2 / r 0 2 < ε R (f) p k+1 :=

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

第3章 OpenGL の基礎

第3章 OpenGL の基礎

CUDA 連携とライブラリの活用 2

joho07-1.ppt

openmp1_Yaguchi_version_170530

MPI MPI MPI.NET C# MPI Version2

スパコンに通じる並列プログラミングの基礎

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£²¡Ë

Transcription:

XcalableMP 1 HPC-Phys@, 2018 8 22

XcalableMP XMP XMP Lattice QCD!2

XMP MPI MPI!3

XMP 1/2 PCXMP MPI Fortran CCoarray C++ MPIMPI XMP OpenMP http://xcalablemp.org!4

XMP 2/2 SPMD (Single Program Multiple Data) MPI XMP MPI Coarray HPF 2 Coarray!5

XcalableMP XMP XMP Lattice QCD!6

1 100 4 MPI 11 25 2 Loop!7

C Fortran 8

C Fortran 9

XMP MPI XMP int array[max], res = 0; #pragma xmp nodes p[*] #pragma xmp template t[max] #pragma xmp distribute t[block] onto p #pragma xmp align array[i] with t[i] int main(){ #pragma xmp loop on t[i] reduction(+:res) for (int i = 0; i < MAX; i++){ array[i] = func(i); res += array[i]; } return 0; } MPI int array[max], res = 0; int main(int argc, char **argv){ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); } int dx = MAX/size; in llimit = rank * dx; int ulimit = (rank!= (size -1))? llimit + dx : MAX; int temp_res = 0; for(int i = llimit; i < ulimit; i++){ array[i] = func(i); temp_res += array[i]; } MPI_Allreduce(&temp_res, &res, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize(); return 0;!10

loop loop for #pragma xmp loop on t[i] for(int i=0;i<16;i++){ a[16] #pragma xmp nodes p[4] #pragma xmp template t[16] #pragma xmp distribute t[block] onto p int a[16]; #pragma xmp align a[i] with t[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p[0] p[1] p[2] p[3]!11

loop loop for #pragma xmp loop on t[i] for(int i=2;i<11;i++){ a[16] #pragma xmp nodes p[4] #pragma xmp template t[16] #pragma xmp distribute t[block] onto p #pragma int a[16]; xmp align a[i] with t[i] #pragma xmp align a[i] with t[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p[0] p[1] p[2] p[3]!12

loop!13

XMP + OpenMP!14

bcast reduction!15

gmove #pragma xmp gmove a[2:4] = b[3:4]; array-name base length a[8] 0 1 2 3 4 5 6 7 b[8] 0 1 2 3 4 5 6 7 p[0] p[1] p[2] p[3]!16

gmove!$xmp gmove a(3:6) = b(4:7) array-name( base ) a(8) 1 2 3 4 5 6 7 8 b(8) 1 2 3 4 5 6 7 8 p(1) p(2) p(3) p(4)!17

shadow/reflect #pragma xmp nodes p[3] #pragma xmp template t[9] #pragma xmp distribute t[block] onto p int a[9]; #pragma xmp align a[i] with t[i] #pragma xmp shadow a[1:1]... #pragma xmp reflect (a)!$xmp nodes p(3)!$xmp template t(9)!$xmp distribute t(block) onto p integer :: a(9)!$xmp align a(i) with t(i)!$xmp shadow a(1:1)...!$xmp reflect (a) shadow!18

shadow/reflect #pragma xmp nodes p[3] #pragma xmp template t[9] #pragma xmp distribute t[block] onto p int a[9]; #pragma xmp align a[i] with t[i] #pragma xmp shadow a[1:1]... #pragma xmp reflect (a)!$xmp nodes p(3)!$xmp template t(9)!$xmp distribute t(block) onto p integer :: a(9)!$xmp align a(i) with t(i)!$xmp shadow a(1:1)...!$xmp reflect (a) reflect!19

shadow/reflect #pragma xmp loop on t[i] for(int i=1;i<9;i++){ b[i] = a[i-1] + a[i] + a[i+1]; }!$xmp loop on t(i) do i = 2, 8 b(i) = a(i-1) + a(i) + a(i+1) end do!20

shadow/reflect #pragma xmp shadow a[1:1]... #pragma xmp reflect (a) #pragma xmp loop on t[i] for(int i=1;i<9;i++){ b[i] = a[i-1] + a[i] + a[i+1]; }!$xmp shadow a(1:1)...!$xmp reflect (a)!$xmp loop on t(i) do i = 2, 8 b(i) = a(i-1) + a(i) + a(i+1) end do!21

1 100 4 MPI 11 25 2 Fortran 2008 coarrayxmp/c!22

Coarray in XMP/C real a(8) real b(8)[*] if(this_image() == 1) then b(6)[2] = b(4) a(0) = b(3)[2] end if sync all double a[8]; double b[8]:[*]; if(xmpc_this_image() == 1){ b[6]:[2] = b[4]; a[0] = b[3]:[2]; } xmpc_sync_all(null);!23

in XMP/C if(xmpc_this_image() == 1){ b[10:5]:[2] = b[0:5]; a[:]:[2] = b[:]; c[:][9]:[2] = c[:][0]; } image 2 image 1 c[10][10] c[10][10]!24

25 user code (a.c) $ xmpcc a.c -o a.out (a.out) https://omni-compiler.org

XcalableMP XMP XMP Lattice QCD!26

27

28 S = B R = B X = B sr = norm(s) T = WD(U,X) S = WD(U,T) R = R - S P = R rrp = rr = norm(r) do{ T = WD(U,P) V = WD(U,T) pap = dot(v,p) cr = rr/pap X = cr * P + X R = -cr * V + R rr = norm(r) bk = rr/rrp P = bk * P P = P + R rrp = rr }while(rr/sr > 1.E-16) // COPY // COPY // COPY // NORM // Main Kernel // Main Kernel // AXPY // COPY // NORM // Main Kernel // Main Kernel // DOT // AXPY // AXPY // NORM // SCAL // AXPY #pragma xmp reflect (X) width(/periodic/..) orthogonal WD(X,...); void WD(Quark_t X[NT][NZ][NY][NX],... ){ : #pragma xmp loop on t[t][z] #pragma omp parallel for collapse(4) for(int t=0;t<nt;t++) for(int z=0;z<nz;z++) for(int y=0;y<ny;y++) for(int x=0;x<nx;x++){ :

reflect #pragma xmp reflect (a) #pragma xmp reflect (a) width(/periodic/1,..) orthogonal #pragma xmp reflect (a) orthogonal orthogonal!29

30

1800 1000 Performance (GFlops) 1350 900 450 0 Intel compiler Omni compiler 1 2 4 8 16 32 64 128 256 750 500 250 0 1 2 4 8 16 32 64 128 256 Better Number of processes Number of processes Performances of Omni compiler achieve 94-105% of those of Intel compiler. 31

32 1000 750 500 250 0 OpenMP (Base code) 854 XMP+OpenMP MPI+OpenMP 968 979 Almost the same 180 135 90 45 0 34% reduce (118/178) 4 114 XMP + OpenMP How many lines the code changed from a base code to a parallel code 53 125 MPI + OpenMP Modification Addition Quantitative Qualitative While most of 114 lines in XMP+OpenMP is the insertion of XMP directives, 125 in MPI+OpenMP is a creation of new functions for communication. It is easier to develop a parallel application in XMP+OpenMP than MPI+OpenMP

33