XcalableMP入門

Similar documents
XACC講習会

1.overview

Microsoft PowerPoint - sps14_kogi6.pptx

XMPによる並列化実装2

XACCの概要

PowerPoint プレゼンテーション

研究背景 大規模な演算を行うためには 分散メモリ型システムの利用が必須 Message Passing Interface MPI 並列プログラムの大半はMPIを利用 様々な実装 OpenMPI, MPICH, MVAPICH, MPI.NET プログラミングコストが高いため 生産性が悪い 新しい並

HPC146

nakao

PowerPoint Presentation

高生産 高性能プログラミング のための並列言語 XcalableMP 佐藤三久 筑波大学計算科学研究センター

PowerPoint プレゼンテーション

¥Ñ¥Ã¥±¡¼¥¸ Rhpc ¤Î¾õ¶·

MPI usage

2 T 1 N n T n α = T 1 nt n (1) α = 1 100% OpenMP MPI OpenMP OpenMP MPI (Message Passing Interface) MPI MPICH OpenMPI 1 OpenMP MPI MPI (trivial p

2012年度HPCサマーセミナー_多田野.pptx

NUMAの構成

HPC143

01_OpenMP_osx.indd

WinHPC ppt

目 目 用方 用 用 方

±é½¬£²¡§£Í£Ð£É½éÊâ

Microsoft PowerPoint - XMP-AICS-Cafe ppt [互換モード]

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

Microsoft Word - 計算科学演習第1回3.doc

スライド 1

58 7 MPI 7 : main(int argc, char *argv[]) 8 : { 9 : int num_procs, myrank; 10 : double a, b; 11 : int tag = 0; 12 : MPI_Status status; 13 : 1 MPI_Init

I I / 47

OpenMP (1) 1, 12 1 UNIX (FUJITSU GP7000F model 900), 13 1 (COMPAQ GS320) FUJITSU VPP5000/64 1 (a) (b) 1: ( 1(a))

chap2.ppt

DKA ( 1) 1 n i=1 α i c n 1 = 0 ( 1) 2 n i 1 <i 2 α i1 α i2 c n 2 = 0 ( 1) 3 n i 1 <i 2 <i 3 α i1 α i2 α i3 c n 3 = 0. ( 1) n 1 n i 1 <i 2 < <i

<4D F736F F F696E74202D C097F B A E B93C782DD8EE682E890EA97705D>

Microsoft PowerPoint - KHPCSS pptx

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

ex01.dvi

演習準備 2014 年 3 月 5 日神戸大学大学院システム情報学研究科森下浩二 1 RIKEN AICS HPC Spring School /3/5

Microsoft PowerPoint - 講義:片方向通信.pptx

r07.dvi

ohp07.dvi

120802_MPI.ppt

連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 分子 ( 粒子 ) セル 系の状態 ステップ 18

115 9 MPIBNCpack 9.1 BNCpack 1CPU X = , B =

Microsoft PowerPoint - 講義:コミュニケータ.pptx

44 6 MPI 4 : #LIB=-lmpich -lm 5 : LIB=-lmpi -lm 7 : mpi1: mpi1.c 8 : $(CC) -o mpi1 mpi1.c $(LIB) 9 : 10 : clean: 11 : -$(DEL) mpi1 make mpi1 1 % mpiru

ohp03.dvi

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

£Ã¥×¥í¥°¥é¥ß¥ó¥°(2018) - Âè11²ó – ½ÉÂꣲ¤Î²òÀ⡤±é½¬£² –

86 8 MPIBNCpack 15 : int n, myid, numprocs, i; 16 : double pi, start_x, end_x; 17 : double startwtime = 0.0, endwtime; 18 : int namelen; 19 : char pro

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

02_C-C++_osx.indd

コードのチューニング

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

4th XcalableMP workshop 目的 n XcalableMPのローカルビューモデルであるXMPのCoarray機能を用 いて Fiberミニアプリ集への実装と評価を行う PGAS(Pertitioned Global Address Space)言語であるCoarrayのベ ンチマ

Microsoft PowerPoint - 03_What is OpenMP 4.0 other_Jan18

SystemC言語概論

ex01.dvi

r03.dvi

課題 S1 解説 C 言語編 中島研吾 東京大学情報基盤センター

Microsoft PowerPoint - 演習2:MPI初歩.pptx

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

ALG ppt

C言語によるアルゴリズムとデータ構造

スライド 1

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裱£²²ó ¡Ý½ÉÂꣲ¤Î²òÀ⡤±é½¬£²¡Ý

Intel® Compilers Professional Editions

untitled

スパコンに通じる並列プログラミングの基礎

para02-2.dvi

<4D F736F F F696E74202D C097F B A E B93C782DD8EE682E890EA97705D>

C

1重谷.PDF

橡3_2石川.PDF

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

C/C++ FORTRAN FORTRAN MPI MPI MPI UNIX Windows (SIMD Single Instruction Multipule Data) SMP(Symmetric Multi Processor) MPI (thread) OpenMP[5]

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裶²ó ¨¡ À©¸æ¹½Â¤¡§·«¤êÊÖ¤· ¨¡

double float

Microsoft PowerPoint - OpenMP入門.pptx

develop

超初心者用

インテル(R) Visual Fortran Composer XE

1.ppt

II 3 yacc (2) 2005 : Yacc 0 ~nakai/ipp2 1 C main main 1 NULL NULL for 2 (a) Yacc 2 (b) 2 3 y

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - XMP-tutorial-msato pptx

040312研究会HPC2500.ppt

Krylov (b) x k+1 := x k + α k p k (c) r k+1 := r k α k Ap k ( := b Ax k+1 ) (d) β k := r k r k 2 2 (e) : r k 2 / r 0 2 < ε R (f) p k+1 :=

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

第3章 OpenGL の基礎

第3章 OpenGL の基礎

CUDA 連携とライブラリの活用 2

<4D F736F F F696E74202D C097F B A E B93C782DD8EE682E890EA97705D>

joho07-1.ppt

openmp1_Yaguchi_version_170530

untitled

MPI MPI MPI.NET C# MPI Version2

スパコンに通じる並列プログラミングの基礎

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£²¡Ë

Transcription:

XcalableMP 1 HPC-Phys@, 2018 8 22

XcalableMP XMP XMP Lattice QCD!2

XMP MPI MPI!3

XMP 1/2 PCXMP MPI Fortran CCoarray C++ MPIMPI XMP OpenMP http://xcalablemp.org!4

XMP 2/2 SPMD (Single Program Multiple Data) MPI XMP MPI Coarray HPF 2 Coarray!5

XcalableMP XMP XMP Lattice QCD!6

1 100 4 MPI 11 25 2 Loop!7

C Fortran 8

C Fortran 9

XMP MPI XMP int array[max], res = 0; #pragma xmp nodes p[*] #pragma xmp template t[max] #pragma xmp distribute t[block] onto p #pragma xmp align array[i] with t[i] int main(){ #pragma xmp loop on t[i] reduction(+:res) for (int i = 0; i < MAX; i++){ array[i] = func(i); res += array[i]; } return 0; } MPI int array[max], res = 0; int main(int argc, char **argv){ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); } int dx = MAX/size; in llimit = rank * dx; int ulimit = (rank!= (size -1))? llimit + dx : MAX; int temp_res = 0; for(int i = llimit; i < ulimit; i++){ array[i] = func(i); temp_res += array[i]; } MPI_Allreduce(&temp_res, &res, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Finalize(); return 0;!10

loop loop for #pragma xmp loop on t[i] for(int i=0;i<16;i++){ a[16] #pragma xmp nodes p[4] #pragma xmp template t[16] #pragma xmp distribute t[block] onto p int a[16]; #pragma xmp align a[i] with t[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p[0] p[1] p[2] p[3]!11

loop loop for #pragma xmp loop on t[i] for(int i=2;i<11;i++){ a[16] #pragma xmp nodes p[4] #pragma xmp template t[16] #pragma xmp distribute t[block] onto p #pragma int a[16]; xmp align a[i] with t[i] #pragma xmp align a[i] with t[i] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 p[0] p[1] p[2] p[3]!12

loop!13

XMP + OpenMP!14

bcast reduction!15

gmove #pragma xmp gmove a[2:4] = b[3:4]; array-name base length a[8] 0 1 2 3 4 5 6 7 b[8] 0 1 2 3 4 5 6 7 p[0] p[1] p[2] p[3]!16

gmove!$xmp gmove a(3:6) = b(4:7) array-name( base ) a(8) 1 2 3 4 5 6 7 8 b(8) 1 2 3 4 5 6 7 8 p(1) p(2) p(3) p(4)!17

shadow/reflect #pragma xmp nodes p[3] #pragma xmp template t[9] #pragma xmp distribute t[block] onto p int a[9]; #pragma xmp align a[i] with t[i] #pragma xmp shadow a[1:1]... #pragma xmp reflect (a)!$xmp nodes p(3)!$xmp template t(9)!$xmp distribute t(block) onto p integer :: a(9)!$xmp align a(i) with t(i)!$xmp shadow a(1:1)...!$xmp reflect (a) shadow!18

shadow/reflect #pragma xmp nodes p[3] #pragma xmp template t[9] #pragma xmp distribute t[block] onto p int a[9]; #pragma xmp align a[i] with t[i] #pragma xmp shadow a[1:1]... #pragma xmp reflect (a)!$xmp nodes p(3)!$xmp template t(9)!$xmp distribute t(block) onto p integer :: a(9)!$xmp align a(i) with t(i)!$xmp shadow a(1:1)...!$xmp reflect (a) reflect!19

shadow/reflect #pragma xmp loop on t[i] for(int i=1;i<9;i++){ b[i] = a[i-1] + a[i] + a[i+1]; }!$xmp loop on t(i) do i = 2, 8 b(i) = a(i-1) + a(i) + a(i+1) end do!20

shadow/reflect #pragma xmp shadow a[1:1]... #pragma xmp reflect (a) #pragma xmp loop on t[i] for(int i=1;i<9;i++){ b[i] = a[i-1] + a[i] + a[i+1]; }!$xmp shadow a(1:1)...!$xmp reflect (a)!$xmp loop on t(i) do i = 2, 8 b(i) = a(i-1) + a(i) + a(i+1) end do!21

1 100 4 MPI 11 25 2 Fortran 2008 coarrayxmp/c!22

Coarray in XMP/C real a(8) real b(8)[*] if(this_image() == 1) then b(6)[2] = b(4) a(0) = b(3)[2] end if sync all double a[8]; double b[8]:[*]; if(xmpc_this_image() == 1){ b[6]:[2] = b[4]; a[0] = b[3]:[2]; } xmpc_sync_all(null);!23

in XMP/C if(xmpc_this_image() == 1){ b[10:5]:[2] = b[0:5]; a[:]:[2] = b[:]; c[:][9]:[2] = c[:][0]; } image 2 image 1 c[10][10] c[10][10]!24

25 user code (a.c) $ xmpcc a.c -o a.out (a.out) https://omni-compiler.org

XcalableMP XMP XMP Lattice QCD!26

27

28 S = B R = B X = B sr = norm(s) T = WD(U,X) S = WD(U,T) R = R - S P = R rrp = rr = norm(r) do{ T = WD(U,P) V = WD(U,T) pap = dot(v,p) cr = rr/pap X = cr * P + X R = -cr * V + R rr = norm(r) bk = rr/rrp P = bk * P P = P + R rrp = rr }while(rr/sr > 1.E-16) // COPY // COPY // COPY // NORM // Main Kernel // Main Kernel // AXPY // COPY // NORM // Main Kernel // Main Kernel // DOT // AXPY // AXPY // NORM // SCAL // AXPY #pragma xmp reflect (X) width(/periodic/..) orthogonal WD(X,...); void WD(Quark_t X[NT][NZ][NY][NX],... ){ : #pragma xmp loop on t[t][z] #pragma omp parallel for collapse(4) for(int t=0;t<nt;t++) for(int z=0;z<nz;z++) for(int y=0;y<ny;y++) for(int x=0;x<nx;x++){ :

reflect #pragma xmp reflect (a) #pragma xmp reflect (a) width(/periodic/1,..) orthogonal #pragma xmp reflect (a) orthogonal orthogonal!29

30

1800 1000 Performance (GFlops) 1350 900 450 0 Intel compiler Omni compiler 1 2 4 8 16 32 64 128 256 750 500 250 0 1 2 4 8 16 32 64 128 256 Better Number of processes Number of processes Performances of Omni compiler achieve 94-105% of those of Intel compiler. 31

32 1000 750 500 250 0 OpenMP (Base code) 854 XMP+OpenMP MPI+OpenMP 968 979 Almost the same 180 135 90 45 0 34% reduce (118/178) 4 114 XMP + OpenMP How many lines the code changed from a base code to a parallel code 53 125 MPI + OpenMP Modification Addition Quantitative Qualitative While most of 114 lines in XMP+OpenMP is the insertion of XMP directives, 125 in MPI+OpenMP is a creation of new functions for communication. It is easier to develop a parallel application in XMP+OpenMP than MPI+OpenMP

33