Microsoft PowerPoint - omp-03.ppt [互換モード]

Size: px

Start display at page:

Download "Microsoft PowerPoint - omp-03.ppt [互換モード]"

ひでよりいりぐら
5 years ago
Views:

1 Parallel Programming for Multicore Processors using OpenMP Part III: Parallel Version + Exercise Kengo Nakajima Information Technology enter Programming for Parallel omputing ( ) Seminar on Advanced omputing ( )

2 OMP-3 1 Parallel Version: OpenMP OpenMP version of -sol Number of threads= PEsmpTOT can be controlled in the program Fundamental Idea Meshes in a same color/level are independent, therefore parallel/concurrent processing is possible for these meshes.

3 OMP olors, 4-Threads Initial Mesh

4 OMP olors, 4-Threads Initial Mesh

5 OMP olors, 4-Threads Renumbering according to olor ID

6 OMP olors, 4-Threads Meshes in a same color/level are independent, therefore parallel/concurrent processing is possible for these meshes, renumbered meshes are assigned to thread #3 thread #2 thread #1 thread #0 threads

7 OMP-3 6 Files on FX10 >$ cd <$O-TOP> >$ cp /home/ss/aics60//multicore-c.tar. >$ cp /home/ss/aics60/f/multicore-f.tar. >$ tar xvf multicore-c.tar >$ tar xvf multicore-f.tar >$ cd multicore onfirm the following directories: L3 omp <$O-L3>, <$O-stream>

8 OMP-3 7 Files on FX10 (cont.) Location <$O-L3>/src,<$O-L3>/run ompile/run Main Part cd <$O-L3>/src make <$O-L3>/run/L3-sol (exec) ontrol Data <$O-L3>/run/INPUT.DAT Batch Job Script <$O-L3>/run/go1.sh

9 OMP-3 8 Running the ode % cd <$O-L3> % ls run src src0 reorder0 % cd src % make % cd../run % ls L3-sol L3-sol % <modify INPUT.DAT > % <modify go1.sh > % pjsub go1.sh

10 OMP-3 9 Running the Program L3-sol Poisson Solver FVM test.inp ParaView File INPUT.DAT ontrol File

11 OMP-3 10 ontrol Data: INPUT.DAT NX/NY/NZ 1.00e e e-00 DX/DY/DZ 1.0e-08 EPSIG 16 PEsmpTOT 100 NOLORtot NX,NY,NZ Number of meshes in X/Y/Z dir. DX,DY,DZ Size of meshes EPSIG onvergence riteria for IG PEsmpTOT Thread Number NOLORtot Reordering Method + Initial Number of olors/levels 2: M, =0: M, =-1: RM, -2 : MRM z NZ y x NX NY Z X Y

12 OMP-3 11 go1.sh #!/bin/sh #PJM -L "node=1" #PJM -L "elapse=00:10:00" #PJM -L "rscgrp=lecture" #PJM -g "gt71" #PJM -j #PJM -o "arcm.lst" export OMP_NUM_THREADS=16./L3-sol =PEsmpTOT

13 OMP-3 12 Applying OpenMP to -sol Examples Optimization + Exercise

14 OMP-3 13 Applying OpenMP to -sol on IG solver Dot Products, DAXPY, Mat-Vec NO data dependency: Just insert directives Preconditioning (I Factorization, Forward/Backward Substitution) NO data dependency in same color: Parallel processing is possible for meshes in same color

15 OMP-3 14 Just inserting directives works fine, but... (1/2) (Mat-Vec)!$omp parallel do private(i,val,k) do i = 1, N VAL= D(i)*W(i,P) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*W(itemL(k),P) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*W(itemU(k),P) W(i,Q)= VAL!$omp end parallel do Thread number cannot be handled in the program

16 OMP-3 15 Just inserting directives works fine, but... (2/2) (Forward Substitution) do icol= 1, NOLORtot!$omp parallel do private (i, VAL, k) do i= OLORindex(icol-1)+1, OLORindex(icol) VAL= D(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL - (AL(k)**2) * DD(itemL(k)) DD(i)= 1.d0/VAL!$omp end parallel do Thread number cannot be handled in the program

17 OMP-3 16 Parallelize IG Method by OpenMP Dot Product: OK DAXPY: OK Matrix-Vector Multiply: OK Preconditioning

18 OMP-3 17 Main Program (1/2) program MAIN use STRUT use PG use solver_ig_mc implicit REAL*8 (A-H,O-Z) real(kind=8), dimension(:), allocatable :: WK call INPUT call POINTER_INIT call BOUNDARY_ELL call ELL_METRIS call POI_GEN PHI= 0.d0 call solve_ig_mc & & ( IELTOT, NPL, NPU, indexl, iteml, indexu, itemu, D, & & BFORE, PHI, AL, AU, NOLORtot, PEsmpTOT, & & SMPindex, SMPindexG, EPSIG, ITR, IER)

19 OMP-3 18 Main Program (2/2) allocate (WK(IELTOT)) do ic0= 1, IELTOT icel= NEWtoOLD(ic0) WK(icel)= PHI(ic0) Renumbering of PHI to original numbering do icel= 1, IELTOT PHI(icel)= WK(icel) call OUTUD stop end

20 OMP-3 19 Main Program program MAIN use STRUT use PG use solver_ig_mc implicit REAL*8 (A-H,O-Z) real(kind=8), dimension(:), allocatable :: WK call INPUT call POINTER_INIT call BOUNDARY_ELL call ELL_METRIS call POI_GEN PHI= 0.d0 call solve_ig_mc & & ( IELTOT, NPL, NPU, indexl, iteml, indexu, itemu, D, & & BFORE, PHI, AL, AU, NOLORtot, PEsmpTOT, & & SMPindex, SMPindexG, EPSIG, ITR, IER)

21 OMP-3 20 module STRUT module STRUT use omp_lib include 'precision.inc'!!-- METRIs & FLUX integer (kind=kint) :: IELTOT, IELTOTp, N integer (kind=kint) :: NX, NY, NZ, NXP1, NYP1, NZP1, IBNODTOT integer (kind=kint) :: NXc, NYc, NZc real (kind=kreal) :: & & DX, DY, DZ, XAREA, YAREA, ZAREA, RDX, RDY, RDZ, & & RDX2, RDY2, RDZ2, R2DX, R2DY, R2DZ real (kind=kreal), dimension(:), allocatable :: & & VOLEL, VOLNOD, RV, RVN integer (kind=kint), dimension(:,:), allocatable :: & & XYZ, NEIBcell!!-- BOUNDARYs integer (kind=kint) :: ZmaxELtot integer (kind=kint), dimension(:), allocatable :: B_INDEX, B_NOD integer (kind=kint), dimension(:), allocatable :: ZmaxEL!!-- WORK integer (kind=kint), dimension(:,:), allocatable :: IWKX real(kind=kreal), dimension(:,:), allocatable :: FV integer (kind=kint) :: PEsmpTOT end module STRUT IELTOT: Number of meshes (NX x NY x NZ) N: Number of modes NX,NY,NZ: Number of meshes in x/y/z directions NXP1,NYP1,NZP1: Number of nodes in x/y/z directions IBNODTOT: = NXP1 x NYP1 XYZ(IELTOT,3): Location of meshes NEIBcell(IELTOT,6): Neighboring meshes PEsmpTOT: Number of threads

22 OMP-3 21 module PG module PG (cont.) integer, parameter :: N2= 256 integer :: NUmax, NLmax, NOLORtot, NOLORk, NU, NL integer :: NPL, NPU integer :: METHOD, ORDER_METHOD real(kind=8) :: EPSIG real(kind=8), dimension(:), allocatable :: D, PHI, BFORE real(kind=8), dimension(:), allocatable :: AL, AU integer, dimension(:), allocatable :: INL, INU, OLORindex integer, dimension(:), allocatable :: SMPindex, SMPindexG integer, dimension(:), allocatable :: OLDtoNEW, NEWtoOLD integer, dimension(:,:), allocatable :: IAL, IAU integer, dimension(:), allocatable :: indexl, iteml integer, dimension(:), allocatable :: indexu, itemu end module PG NOLORtot OLORindex (0:NOLORtot) Total number of colors/levels Index of number of meshes in each color/level (OLORindex(icol)- OLORindex(icol-1)) SMPindex (0:NOLORtot*PEsmpTOT) SMPindexG(0:PEsmpTOT) OLDtoNEW, NEWtoOLD Reference table before/after renumbering

23 OMP-3 22 Variables/Arrays for Matrix (1/2) Name Type ontent D(N) R Diagonal components of the matrix (N= IELTOT) BFORE(N) R RHS vector PHI(N) R Unknown vector indexl(0:n), I # of L/U non-zero off-diag. comp. (RS) indexu(0:n) NPL, NPU I Total # of L/U non-zero off-diag. comp. (RS) iteml(npl), itemu(npu) AL(NPL), AU(NPU) I R olumn ID of L/U non-zero off-diag. comp. (RS) L/U non-zero off-diag. comp. (RS) Name Type ontent NL,NU I MAX. # of L/U non-zero off-diag. comp. for each mesh (=6) INL(N), INU(N) IAL(NL,N), IAU(NU,N) I I # of L/U non-zero off-diag. comp. olumn ID of L/U non-zero off-diag. comp.

24 OMP-3 23 Variables/Arrays for Matrix (2/2) Name Type ontent NOLORtot I Input: reordering method + initial number of colors/levels 2: M, =0: M, =-1: RM, -2 : MRM Output: Final number of colors/levels OLORindex (0:NOLORtot) I Number of meshes at each color/level 1D compressed array Meshes in icol th color/level are stored in this array from OLORindex(icol-1)+1 to OLORindex(icol) NEWtoOLD(N) I Reference array from New to Old numbering OLDtoNEW(N) I Reference array from Old to New numbering PEsmpTOT I Number of Threads SMPindex (0:NOLORtot*PEsmpTOT) I Array for OpenMP Operations (for Loops with Data Dependency) SMPindexG(0:PEsmpTOT) I Array for OpenMP Operations (for Loops without Data Dependency)

25 OMP-3 24 Main Program program MAIN use STRUT use PG use solver_ig_mc implicit REAL*8 (A-H,O-Z) real(kind=8), dimension(:), allocatable :: WK call INPUT call POINTER_INIT call BOUNDARY_ELL call ELL_METRIS call POI_GEN PHI= 0.d0 call solve_ig_mc & & ( IELTOT, NPL, NPU, indexl, iteml, indexu, itemu, D, & & BFORE, PHI, AL, AU, NOLORtot, PEsmpTOT, & & SMPindex, SMPindexG, EPSIG, ITR, IER)

26 OMP-3 25 input: reading INPUT.DAT!!***!*** INPUT!***!! INPUT ONTROL DATA! subroutine INPUT use STRUT use PG implicit REAL*8 (A-H,O-Z) character*80 NTFIL!!-- NTL. file open (11, file='input.dat', status='unknown') read (11,*) NX, NY, NZ read (11,*) DX, DY, DZ read (11,*) EPSIG read (11,*) PEsmpTOT read (11,*) NOLORtot close (11)!=== return end NX/NY/NZ 1.00e e e-02 DX/DY/DZ 1.00e-08 EPSIG 16 PEsmpTOT 100 NOLORtot PEsmpTOT Thread Number NOLORtot Reordering Method + Initial Number of olors/levels 2: M =0: M =-1: RM -2 : MRM

27 OMP-3 26 cell_metrics!!***!*** ELL_METRIS!***! subroutine ELL_METRIS use STRUT use PG implicit REAL*8 (A-H,O-Z)!!-- ALLOATE allocate (VOLEL(IELTOT)) allocate ( RV(IELTOT))!!-- VOLUME, AREA, PROJETION etc. XAREA= DY * DZ YAREA= DX * DZ ZAREA= DX * DY DZ XAREA RDX= 1.d0 / DX RDY= 1.d0 / DY RDZ= 1.d0 / DZ RDX2= 1.d0 / (DX**2) RDY2= 1.d0 / (DY**2) RDZ2= 1.d0 / (DZ**2) R2DX= 1.d0 / (0.50d0*DX) R2DY= 1.d0 / (0.50d0*DY) R2DZ= 1.d0 / (0.50d0*DZ) z y x DX DY V0= DX * DY * DZ RV0= 1.d0/V0 VOLEL= V0 RV = RV0 return end

28 OMP-3 27 Main Program program MAIN use STRUT use PG use solver_ig_mc implicit REAL*8 (A-H,O-Z) real(kind=8), dimension(:), allocatable :: WK call INPUT call POINTER_INIT call BOUNDARY_ELL call ELL_METRIS call POI_GEN PHI= 0.d0 call solve_ig_mc & & ( IELTOT, NPL, NPU, indexl, iteml, indexu, itemu, D, & & BFORE, PHI, AL, AU, NOLORtot, PEsmpTOT, & & SMPindex, SMPindexG, EPSIG, ITR, IER)

29 OMP-3 28 poi_gen (1/9) subroutine POI_GEN use STRUT use PG implicit REAL*8 (A-H,O-Z)!!-- INIT. nn = IELTOT nnp= IELTOTp NU= 6 NL= 6 allocate (BFORE(nn), D(nn), PHI(nn)) allocate (INL(nn), INU(nn), IAL(NL,nn), IAU(NU,nn)) PHI = 0.d0 D = 0.d0 BFORE= 0.d0 INL= 0 INU= 0 IAL= 0 IAU= 0

30 OMP-3 29!! ! ONNETIVITY! poi_gen (2/9)!=== do icel= 1, IELTOT icn1= NEIBcell(icel,1) icn2= NEIBcell(icel,2) icn3= NEIBcell(icel,3) icn4= NEIBcell(icel,4) icn5= NEIBcell(icel,5) icn6= NEIBcell(icel,6) NEIBcell(icel,6)!=== if (icn5.ne.0.and.icn5.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn5 INL( icel)= icou if (icn3.ne.0.and.icn3.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn3 INL( icel)= icou if (icn1.ne.0.and.icn1.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn1 INL( icel)= icou if (icn2.ne.0.and.icn2.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn2 INU( icel)= icou if (icn4.ne.0.and.icn4.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn4 INU( icel)= icou if (icn6.ne.0.and.icn6.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn6 INU( icel)= icou NEIBcell(icel,1) NEIBcell(icel,3) NEIBcell(icel,5) NEIBcell(icel,4) Lower Triangular Part NEIBcell(icel,5)= icel NX*NY NEIBcell(icel,3)= icel NX NEIBcell(icel,1)= icel 1 NEIBcell(icel,2)

31 OMP-3 30!! ! ONNETIVITY! poi_gen (2/9)!=== do icel= 1, IELTOT icn1= NEIBcell(icel,1) icn2= NEIBcell(icel,2) icn3= NEIBcell(icel,3) icn4= NEIBcell(icel,4) icn5= NEIBcell(icel,5) icn6= NEIBcell(icel,6) NEIBcell(icel,6)!=== if (icn5.ne.0.and.icn5.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn5 INL( icel)= icou if (icn3.ne.0.and.icn3.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn3 INL( icel)= icou if (icn1.ne.0.and.icn1.le.ieltot) then icou= INL(icel) + 1 IAL(icou,icel)= icn1 INL( icel)= icou if (icn2.ne.0.and.icn2.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn2 INU( icel)= icou if (icn4.ne.0.and.icn4.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn4 INU( icel)= icou if (icn6.ne.0.and.icn6.le.ieltot) then icou= INU(icel) + 1 IAU(icou,icel)= icn6 INU( icel)= icou NEIBcell(icel,1) NEIBcell(icel,3) NEIBcell(icel,5) NEIBcell(icel,4) Upper Triangular Part NEIBcell(icel,2)= icel + 1 NEIBcell(icel,4)= icel + NX NEIBcell(icel,6)= icel + NX*NY NEIBcell(icel,2)

32 OMP-3 31 poi_gen (3/9)!! ! MULTIOLORING! !=== allocate (OLDtoNEW(IELTOT), NEWtoOLD(IELTOT)) allocate (OLORindex(0:IELTOT)) 111 continue write (*,'(//a,i8,a)') 'You have', IELTOT, ' elements.' write (*,'( a )') 'How many colors do you need?' write (*,'( a )') ' #OLOR must be more than 2 and' write (*,'( a,i8 )') ' #OLOR must not be more than', IELTOT write (*,'( a )') ' M if #OLOR.eq. 0' write (*,'( a )') ' RM if #OLOR.eq.-1' write (*,'( a )') 'MRM if #OLOR.le.-2' write (*, ( a ) ) => Reordering NOLORtot > 1: Multicolor NOLORtot = 0: M NOLORtot =-1: RM NOLORtot <-1: M-RM if (NOLORtot.gt.0) then call M (IELTOT, NL, NU, INL, IAL, INU, IAU, & & NOLORtot, OLORindex, NEWtoOLD, OLDtoNEW) if (NOLORtot.eq.0) then call M (IELTOT, NL, NU, INL, IAL, INU, IAU, & & NOLORtot, OLORindex, NEWtoOLD, OLDtoNEW) if (NOLORtot.eq.-1) then call RM (IELTOT, NL, NU, INL, IAL, INU, IAU, & & NOLORtot, OLORindex, NEWtoOLD, OLDtoNEW) if (NOLORtot.lt.-1) then call MRM (IELTOT, NL, NU, INL, IAL, INU, IAU, & & NOLORtot, OLORindex, NEWtoOLD, OLDtoNEW) write (*,'(//a,i8,// )') '### FINAL OLOR NUMBER', NOLORtot

33 OMP-3 32 poi_gen (4/9) allocate (SMPindex(0:PEsmpTOT*NOLORtot)) SMPindex= 0 do ic= 1, NOLORtot nn1= OLORindex(ic) - OLORindex(ic-1) num= nn1 / PEsmpTOT nr = nn1 - PEsmpTOT*num do ip= 1, PEsmpTOT if (ip.le.nr) then SMPindex((ic-1)*PEsmpTOT+ip)= num + 1 else SMPindex((ic-1)*PEsmpTOT+ip)= num do ic= 1, NOLORtot do ip= 1, PEsmpTOT j1= (ic-1)*pesmptot + ip j0= j1-1 SMPindex(j1)= SMPindex(j0) + SMPindex(j1) allocate (SMPindexG(0:PEsmpTOT)) SMPindexG= 0 nn= IELTOT / PEsmpTOT nr= IELTOT - nn*pesmptot do ip= 1, PEsmpTOT SMPindexG(ip)= nn if (ip.le.nr) SMPindexG(ip)= nn + 1 do ip= 1, PEsmpTOT SMPindexG(ip)= SMPindexG(ip-1) + SMPindexG(ip) SMPindex: for preconditioning do ic= 1, NOLORtot!$omp parallel do do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot+ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) ( )!omp end parallel do!===

34 OMP-3 33 SMPindex: for preconditioning do ic= 1, NOLORtot!$omp parallel do do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot+ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) ( )!omp end parallel do Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 color=1 color=2 color=3 color=4 color= colors, 8-threads Meshes in same color are independent: parallel processing Reordering in ascending order according to color ID

35 OMP-3 34 poi_gen (4/9) allocate (SMPindex(0:PEsmpTOT*NOLORtot)) SMPindex= 0 do ic= 1, NOLORtot nn1= OLORindex(ic) - OLORindex(ic-1) num= nn1 / PEsmpTOT nr = nn1 - PEsmpTOT*num do ip= 1, PEsmpTOT if (ip.le.nr) then SMPindex((ic-1)*PEsmpTOT+ip)= num + 1 else SMPindex((ic-1)*PEsmpTOT+ip)= num do ic= 1, NOLORtot do ip= 1, PEsmpTOT j1= (ic-1)*pesmptot + ip j0= j1-1 SMPindex(j1)= SMPindex(j0) + SMPindex(j1) allocate (SMPindexG(0:PEsmpTOT)) SMPindexG= 0 nn= IELTOT / PEsmpTOT nr= IELTOT - nn*pesmptot do ip= 1, PEsmpTOT SMPindexG(ip)= nn if (ip.le.nr) SMPindexG(ip)= nn + 1!$omp parallel do do ip= 1, PEsmpTOT do i= SMPindexG(ip-1)+1, SMPindexG(ip) ( )!$omp end parallel do SMPindexG: for Dot-products, DAXPY, Mat-vec, and Poi-gen do ip= 1, PEsmpTOT SMPindexG(ip)= SMPindexG(ip-1) + SMPindexG(ip)!===

36 OMP-3 35 SMPindexG!$omp parallel do do ip= 1, PEsmpTOT do i= SMPindexG(ip-1)+1, SMPindexG(ip) ( )!$omp end parallel do ip=1 ip=2 ip=3 ip=4 ip=5 ip=6 ip=7 ip=8 ip=1 ip=2 ip=3 ip=4 ip=5 ip=6 ip=7 ip=8 for Dot-products, DAXPY, Mat-vec, and Poi-gen

37 OMP-3 36!!-- 1D array nn = IELTOT allocate (indexl(0:nn), indexu(0:nn)) indexl= 0 indexu= 0!=== do icel= 1, IELTOT indexl(icel)= INL(icel) indexu(icel)= INU(icel) do icel= 1, IELTOT indexl(icel)= indexl(icel) + indexl(icel-1) indexu(icel)= indexu(icel) + indexu(icel-1) NPL= indexl(ieltot) NPU= indexu(ieltot) allocate (iteml(npl), AL(NPL)) allocate (itemu(npu), AU(NPU)) iteml= 0 itemu= 0 AL= 0.d0 AU= 0.d0 poi_gen (5/9) New numbering is applied after this point Name Type ontent D(N) R Diagonal components of the matrix (N= IELTOT) BFORE(N) R RHS vector PHI(N) R Unknown vector indexl(0:n), indexu(0:n) I # of L/U non-zero off-diag. comp. (RS) NPL, NPU I Total # of L/U non-zero offdiag. comp. (RS) iteml(npl), itemu(npu) I olumn ID of L/U non-zero off-diag. comp. (RS) AL(NPL), AU(NPU) R L/U non-zero off-diag. comp. (RS)

38 OMP-3 37!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0) if (icn5.ne.0) then icn5= OLDtoNEW(icN5) coef= RDZ * ZAREA D(icel)= D(icel) - coef icel: New ID ic0: Old ID if (icn5.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN5) then iteml(j+indexl(icel-1))= icn5 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN5) then itemu(j+indexu(icel-1))= icn5 AU(j+indexU(icel-1))= coef exit poi_gen (6/9) New numbering is applied neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

39 OMP-3 38 oef. Matrix: Parallel, SMPindexG private!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6)!$omp& private (VOL0,coef,j,ii,jj,kk) & do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0)

40 OMP-3 39!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0) if (icn5.ne.0) then icn5= OLDtoNEW(icN5) coef= RDZ * ZAREA D(icel)= D(icel) - coef icel: New ID ic0: Old ID if (icn5.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN5) then iteml(j+indexl(icel-1))= icn5 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN5) then itemu(j+indexu(icel-1))= icn5 AU(j+indexU(icel-1))= coef exit poi_gen (6/9) New numbering is applied neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

41 OMP-3 40!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0) if (icn5.ne.0) then icn5= OLDtoNEW(icN5) coef= RDZ * ZAREA D(icel)= D(icel) - coef if (icn5.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN5) then iteml(j+indexl(icel-1))= icn5 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN5) then itemu(j+indexu(icel-1))= icn5 AU(j+indexU(icel-1))= coef exit poi_gen (6/9) New numbering is applied neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

42 OMP-3 41!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0) if (icn5.ne.0) then icn5= OLDtoNEW(icN5) coef= RDZ * ZAREA D(icel)= D(icel) - coef if (icn5.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN5) then iteml(j+indexl(icel-1))= icn5 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) 1 RDZ z ZAREA xy if (IAU(j,icel).eq.icN5) then itemu(j+indexu(icel-1))= icn5 AU(j+indexU(icel-1))= coef exit icn5 < icel Lower Part poi_gen (6/9) New numbering is applied neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

43 OMP-3 42!! ! INTERIOR & NEUMANN BOUNDARY ELLs! !===!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) do ip = 1, PEsmpTOT do icel= SMPindexG(ip-1)+1, SMPindexG(ip) ic0 = NEWtoOLD(icel) icn1= NEIBcell(ic0,1) icn2= NEIBcell(ic0,2) icn3= NEIBcell(ic0,3) icn4= NEIBcell(ic0,4) icn5= NEIBcell(ic0,5) icn6= NEIBcell(ic0,6) VOL0= VOLEL (ic0) if (icn5.ne.0) then icn5= OLDtoNEW(icN5) coef= RDZ * ZAREA D(icel)= D(icel) - coef if (icn5.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN5) then iteml(j+indexl(icel-1))= icn5 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) 1 RDZ z ZAREA xy if (IAU(j,icel).eq.icN5) then itemu(j+indexu(icel-1))= icn5 AU(j+indexU(icel-1))= coef exit icn5 > icel Upper Part poi_gen (6/9) New numbering is applied neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

44 OMP-3 43 if (icn3.ne.0) then icn3= OLDtoNEW(icN3) coef= RDY * YAREA D(icel)= D(icel) - coef if (icn3.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN3) then iteml(j+indexl(icel-1))= icn3 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN3) then itemu(j+indexu(icel-1))= icn3 AU(j+indexU(icel-1))= coef exit if (icn1.ne.0) then icn1= OLDtoNEW(icN1) coef= RDX * XAREA D(icel)= D(icel) - coef if (icn1.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN1) then iteml(j+indexl(icel-1))= icn1 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN1) then itemu(j+indexu(icel-1))= icn1 AU(j+indexU(icel-1))= coef exit poi_gen (7/9) neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

45 OMP-3 44 if (icn2.ne.0) then icn2= OLDtoNEW(icN2) coef= RDX * XAREA D(icel)= D(icel) - coef if (icn2.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN2) then iteml(j+indexl(icel-1))= icn2 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN2) then itemu(j+indexu(icel-1))= icn2 AU(j+indexU(icel-1))= coef exit if (icn4.ne.0) then icn4= OLDtoNEW(icN4) coef= RDY * YAREA D(icel)= D(icel) - coef if (icn4.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN4) then iteml(j+indexl(icel-1))= icn4 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN4) then itemu(j+indexu(icel-1))= icn4 AU(j+indexU(icel-1))= coef exit poi_gen (8/9) neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

46 OMP-3 45!$omp parallel do private (ip,icel,ic0,icn1,icn2,icn3,icn4,icn5,icn6) &!$omp& private (VOL0,coef,j,ii,jj,kk) poi_gen (9/9) if (icn6.ne.0) then icn6= OLDtoNEW(icN6) coef= RDZ * ZAREA D(icel)= D(icel) - coef if (icn6.lt.icel) then do j= 1, INL(icel) if (IAL(j,icel).eq.icN6) then iteml(j+indexl(icel-1))= icn6 AL(j+indexL(icel-1))= coef exit else do j= 1, INU(icel) if (IAU(j,icel).eq.icN6) then itemu(j+indexu(icel-1))= icn6 AU(j+indexU(icel-1))= coef exit ii= XYZ(ic0,1) jj= XYZ(ic0,2) kk= XYZ(ic0,3) BFORE(icel)= -dfloat(ii+jj+kk) * VOL0!$omp end parallel do!=== BFORE using original mesh ID ii,jj,kk,vol0: private neib( icel,1) neib( icel,2) neib( icel,3) x y neib( icel,4) y neib( icel,5) neib( icel,6) x z z icel icel icel icel icel icel yz yz zx zx xy xy f icel xyz

47 OMP-3 46 Main Program program MAIN use STRUT use PG use solver_ig_mc implicit REAL*8 (A-H,O-Z) real(kind=8), dimension(:), allocatable :: WK call INPUT call POINTER_INIT call BOUNDARY_ELL call ELL_METRIS call POI_GEN PHI= 0.d0 call solve_ig_mc & & ( IELTOT, NPL, NPU, indexl, iteml, indexu, itemu, D, & & BFORE, PHI, AL, AU, NOLORtot, PEsmpTOT, & & SMPindex, SMPindexG, EPSIG, ITR, IER)

48 OMP-3 47 solve_ig_mc (1/6)!***!*** module solver_ig_mc!***! module solver_ig_mc contains!!*** solve_ig! subroutine solve_ig_mc & & ( N, NPL, NPU, indexl, iteml, indexu, itemu, D, B, X, & & AL, AU, NOLORtot, PEsmpTOT, SMPindex, SMPindexG, & & EPS, ITR, IER) implicit REAL*8 (A-H,O-Z) integer :: N, NL, NU, NOLORtot, PEsmpTOT real(kind=8), dimension(n) :: D real(kind=8), dimension(n) :: B real(kind=8), dimension(n) :: X real(kind=8), dimension(npl) :: AL real(kind=8), dimension(npu) :: AU integer, dimension(0:n) :: indexl, indexu integer, dimension(npl):: iteml integer, dimension(npu):: itemu integer, dimension(0:nolortot*pesmptot):: SMPindex integer, dimension(0:pesmptot) :: SMPindexG real(kind=8), dimension(:,:), allocatable :: W integer, parameter :: R= 1 integer, parameter :: Z= 2 integer, parameter :: Q= 2 integer, parameter :: P= 3 integer, parameter :: DD= 4

49 OMP-3 48 solve_ig_mc (2/6)!! ! INIT! !=== allocate (W(N,4))!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) X(i) = 0.d0 W(i,2)= 0.0D0 W(i,3)= 0.0D0 W(i,4)= 0.0D0!$omp end parallel do do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,val,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) VAL= D(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL - (AL(k)**2) * W(itemL(k),DD) W(i,DD)= 1.d0/VAL!$omp end parallel do Incomplete Modified holesky Factorization

50 OMP-3 49 Incomplete Modified holesky Factorization d i 1 i1 a ii a ii k ik dk l W(i,DD): D(i): IAL(j,i): AL(j,i): d i a ii k a ik do i= 1, N VAL= D(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL - (AL(k)**2) * W(itemL(k),DD) W(i,DD)= 1.d0/VAL

51 OMP-3 50 Incomplete Modified holesky Factorization: Parallel Version d i 1 i1 a ii a ii k ik dk l W(i,DD): D(i): IAL(j,i): AL(j,i): d i a ii k a ik do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,val,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) VAL= D(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL - (AL(k)**2) * W(itemL(k),DD) W(i,DD)= 1.d0/VAL!$omp end parallel do

52 OMP-3 51 solve_ig_mc (3/6)! ! {r0}= {b} - [A]{xini}! !===!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*X(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*X(itemL(k)) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*X(itemU(k)) W(i,R)= B(i) - VAL!$omp end parallel do BNRM2= 0.0D0!$omp parallel do private(ip,i) reduction(+:bnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) BNRM2 = BNRM2 + B(i) **2!$omp end parallel do!=== ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

53 OMP-3 52 Mat-Vec NO Data Dependency: SMPindexG!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*X(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*X(itemL(k)) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*X(itemU(k)) W(i,R)= B(i) - VAL!$omp end parallel do

54 OMP-3 53 solve_ig_mc (3/6)! ! {r0}= {b} - [A]{xini}! !===!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*X(i) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*X(itemL(k)) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*X(itemU(k)) W(i,R)= B(i) - VAL!$omp end parallel do BNRM2= 0.0D0!$omp parallel do private(ip,i) reduction(+:bnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) BNRM2 = BNRM2 + B(i) **2!$omp end parallel do!=== ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

55 OMP-3 54 Dot Products: SMPindexG, reduction BNRM2= 0.0D0!$omp parallel do private(ip,i) reduction(+:bnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) BNRM2 = BNRM2 + B(i) **2!$omp end parallel do

56 OMP-3 55 ITR= N do L= 1, ITR!! ! {z}= [Minv]{r}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,Z)= W(i,R)!$omp end parallel do do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,wval,j) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do j= 1, INL(i) WVAL= WVAL - AL(j,i) * W(IAL(j,i),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel do do ic= NOLORtot, 1, -1!$omp parallel do private(ip,ip1,i,sw,j) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) SW = 0.0d0 do j= 1, INU(i) SW= SW + AU(j,i) * W(IAU(j,i),Z) W(i,Z)= W(i,Z) - W(i,DD) * SW!$omp end parallel do!=== solve_ig_mc (4/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

57 OMP-3 56 ITR= N do L= 1, ITR!! ! {z}= [Minv]{r}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,Z)= W(i,R)!$omp end parallel do SMPindex do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,wval,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel do do ic= NOLORtot, 1, -1!$omp parallel do private(ip,ip1,i,sw,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) SW = 0.0d0 do k= indexu(i-1)+1, indexu(i) SW= SW + AU(k) * W(itemU(k),Z) W(i,Z)= W(i,Z) - W(i,DD) * SW!$omp end parallel do!=== solve_ig_mc (4/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

58 OMP-3 57 ITR= N do L= 1, ITR!! ! {z}= [Minv]{r}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,Z)= W(i,R)!$omp end parallel do do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,wval,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel do SMPindex do ic= NOLORtot, 1, -1!$omp parallel do private(ip,ip1,i,sw,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) SW = 0.0d0 do k= indexu(i-1)+1, indexu(i) SW= SW + AU(k) * W(itemU(k),Z) W(i,Z)= W(i,Z) - W(i,DD) * SW!$omp end parallel do!=== solve_ig_mc (4/6) T M z LDL z r Lz r Forward Substitution DL T z z Backward Substitution

59 OMP-3 58 Forward Substitution: SMPindex do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,wval,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(indexL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel do

60 OMP-3 59! ! {p} = {z} if ITER=1! BETA= RHO / RHO1 otherwise! !=== if ( L.eq.1 ) then!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,P)= W(i,Z)!$omp end parallel do else BETA= RHO / RHO1!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,P)= W(i,Z) + BETA*W(i,P)!$omp end parallel do!===! ! {q}= [A]{p}! !===!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*W(i,P) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*W(itemL(k),P) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*W(itemU(k),P) W(i,Q)= VAL!$omp end parallel do!=== solve_ig_mc (5/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

61 OMP-3 60! ! {p} = {z} if ITER=1! BETA= RHO / RHO1 otherwise! !=== if ( L.eq.1 ) then!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,P)= W(i,Z)!$omp end parallel do else BETA= RHO / RHO1!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) W(i,P)= W(i,Z) + BETA*W(i,P)!$omp end parallel do!===! ! {q}= [A]{p}! !===!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*W(i,P) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*W(itemL(k),P) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*W(itemU(k),P) W(i,Q)= VAL!$omp end parallel do!=== solve_ig_mc (5/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

62 OMP-3 61!! ! ALPHA= RHO / {p}{q}! !=== 1= 0.d0!$omp parallel do private(ip,i) reduction(+:1) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) 1= 1 + W(i,P)*W(i,Q)!$omp end parallel do!=== ALPHA= RHO / 1!! ! {x}= {x} + ALPHA*{p}! {r}= {r} - ALPHA*{q}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) X(i) = X(i) + ALPHA * W(i,P) W(i,R)= W(i,R) - ALPHA * W(i,Q)!$omp end parallel do DNRM2= 0.d0!$omp parallel do private(ip,i) reduction(+:dnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) DNRM2= DNRM2 + W(i,R)**2!$omp end parallel do!=== solve_ig_mc (6/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

63 OMP-3 62!! ! ALPHA= RHO / {p}{q}! !=== 1= 0.d0!$omp parallel do private(ip,i) reduction(+:1) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) 1= 1 + W(i,P)*W(i,Q)!$omp end parallel do!=== ALPHA= RHO / 1!! ! {x}= {x} + ALPHA*{p}! {r}= {r} - ALPHA*{q}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) X(i) = X(i) + ALPHA * W(i,P) W(i,R)= W(i,R) - ALPHA * W(i,Q)!$omp end parallel do DNRM2= 0.d0!$omp parallel do private(ip,i) reduction(+:dnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) DNRM2= DNRM2 + W(i,R)**2!$omp end parallel do!=== solve_ig_mc (6/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

64 OMP-3 63!! ! ALPHA= RHO / {p}{q}! !=== 1= 0.d0!$omp parallel do private(ip,i) reduction(+:1) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) 1= 1 + W(i,P)*W(i,Q)!$omp end parallel do!=== ALPHA= RHO / 1!! ! {x}= {x} + ALPHA*{p}! {r}= {r} - ALPHA*{q}! !===!$omp parallel do private(ip,i) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) X(i) = X(i) + ALPHA * W(i,P) W(i,R)= W(i,R) - ALPHA * W(i,Q)!$omp end parallel do DNRM2= 0.d0!$omp parallel do private(ip,i) reduction(+:dnrm2) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) DNRM2= DNRM2 + W(i,R)**2!$omp end parallel do!=== solve_ig_mc (6/6) ompute r (0) = b-[a]x (0) for i= 1, 2, solve [M]z (i-1) = r (i-1) i-1 = r (i-1) z (i-1) if i=1 p (1) = z (0) else i-1 = i-1 / i-2 p (i) = z (i-1) + i-1 q (i) = [A]p (i) i = i-1 /p (i) q (i) x (i) = x (i-1) + i p (i) r (i) = r (i-1) - i q (i) check convergence r end p (i-1)

65 OMP-3 64 Applying OpenMP to -sol Examples Optimization + Exercise

66 OMP-3 65 Results Hitachi SR11000/J2 1-node, 16-cores Meshes ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore Memory ore L3 PU Memory ore L3 ore Memory ore L3 ore Memory ore L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore ore Memory L3 ore Memory ore L3 PU Memory ore L3 PU ore L3 PU Memory ore L3 ore Memory ore L3 ore ore L3 ore Memory ore L3 ore Memory ore L3 ore ore L3 ore Memory ore L3 ore Memory ore L3 ore ore L3 ore

67 OMP-3 66 SR11000, 1-node/16-cores, ( :M, :RM,-:M-RM) 1.E+06 ITERATIONS Incompatible Point # 1.E+04 1.E+02 Iterations IP# E+00 1.E+01 1.E+02 1.E+03 1.E+04 OLOR# 1.E+00 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 OLOR# 3.00 Time (solver) 1.0E-02 Time/Iteration sec sec./iteration 8.0E E E+00 1.E+01 1.E+02 1.E+03 OLOR# 挙動おかしい 4.0E-03 1.E+00 1.E+01 1.E+02 1.E+03 OLOR#

68 OMP-3 67 FX10, 1-node/16-cores, ( :M, :RM,-:M-RM) Iterations M RM M-RM M RM M-RM OLOR# Iterations Time (solver) Number of Incompatible Nodes 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 M RM M-RM 1.E E-02 OLOR# M 3.00E-02 RM M-RM Time/ Iteration IP# sec sec./iteration 2.00E E E OLOR# OLOR#

69 OMP-3 68 Applying OpenMP to -sol Examples Optimization + Exercise

70 OMP-E 69 Running the ode Further Optimization Profiler, Analyzing ompile Lists

71 OMP-1 70 ompile & Run >$ cd <$O-L3>/src >$ make >$ ls../run/l3-sol L3-sol >$ cd../run >$ pjsub go1.sh

72 OMP-3 71 Running L3-sol L3-sol Poisson Solver FVM test.inp ParaView File INPUT.DAT ontrol File

73 OMP-3 72 ontrol Data: INPUT.DAT NX/NY/NZ 1.00e e e-00 DX/DY/DZ 1.0e-08 EPSIG 16 PEsmpTOT 100 NOLORtot NX,NY,NZ Number of meshes in X/Y/Z dir. DX,DY,DZ Size of meshes EPSIG onvergence riteria for IG PEsmpTOT Thread Number NOLORtot Reordering Method + Initial Number of olors/levels 2: M, =0: M, =-1: RM, -2 : MRM z NZ y x NX NY Z X Y

74 OMP-1 73 go1.sh #!/bin/sh #PJM -L "node=1" #PJM -L "elapse=00:10:00" #PJM -L "rscgrp=lecture" #PJM -g "gt71" #PJM -j #PJM -o test.lst export OMP_NUM_THREADS=16./L3-sol =PEsmpTOT

75 OMP-3 74 Results on FX10, 10 6 meshes Iterations: M(2): 333, RM(298-levels): 224, M-RM(Nc=20): 249 sec M=2 RM(298) M-RM(20) Speed-Up M=2 RM(298) M-RM(20) thread# thread# 16 threads M(2): 2.42 sec. M-RM(20): 2.01 sec. Memory

76 75 Exercise Various onfigurations Problem Size Number of Threads Number of olors, Reordering Method (M, RM, M- RM)

77 OMP-E 76 Running the ode Further Optimization OpenMP Statement Sequential Reordering ELL Profiler, Analyzing ompile Lists

78 OMP-E 77 Forward Subst.: urrent Impl. (F) do ic= 1, NOLORtot!$omp parallel do private(ip,ip1,i,wval,k) do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel do At!omp parallel, generation and corruption of threads (up to 16) occurs. In each color, this occurs Some overhead Overhead increases, if number of color increases.

79 OMP-E 78 Forward Subst.: urrent Impl. () for(ic=0; ic<nolortot; ic++) { #pragma omp parallel for private (ip, ip1, i, WVAL, j) for(ip=0; ip<pesmptot; ip++) { ip1 = ic * PEsmpTOT + ip; for(i=smpindex[ip1]; i<smpindex[ip1+1]; i++){ WVAL = W[Z][i]; for(j=indexl[i]; j<indexl[i+1]; j++){ WVAL -= AL[j] * W[Z][itemL[j]-1]; } W[Z][i] = WVAL * W[DD][i]; } } } At!omp parallel, generation and corruption of threads (up to 16) occurs. In each color, this occurs Some overhead Overhead increases, if number of color increases.

80 OMP-E 79 For. Subst.: Reduced Overhead (F)!$omp parallel private(ip,ip1,i,wval,k) do ic= 1, NOLORtot!$omp do do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD) endd!$omp end parallel Generation of threads occurs just once before starting forward substitutions. Loops with!omp do are parallelized.

81 OMP-E 80 For. Subst.: Reduced Overhead () #pragma omp parallel private (ip, ip1, i, WVAL, j) for(ic=0; ic<nolortot; ic++) { #pragma omp for for(ip=0; ip<pesmptot; ip++) { ip1 = ic * PEsmpTOT + ip; for(i=smpindex[ip1]; i<smpindex[ip1+1]; i++){ WVAL = W[Z][i]; for(j=indexl[i]; j<indexl[i+1]; j++){ WVAL -= AL[j] * W[Z][itemL[j]-1]; } W[Z][i] = WVAL * W[DD][i]; } } } Generation of threads occurs just once before starting forward substitutions. Loops with!omp do are parallelized.

82 OMP-E 81 Programs % cd <$O-L3> % ls run reorder0 src src0 % cd src0 % make % cd../run % ls L3-sol0 L3-sol0 % <modify INPUT.DAT > % <modify go0.sh > % pjsub go0.sh

83 OMP-E 82 Results: L3-sol0 is better N=128 3 L3-sol L3-sol0 NOLORtot= -20 M-RM (20) 318 Iterations NOLORtot= -1 RM (382 levels) 287 Iterations 5.69 sec sec sec sec.

84 OMP-E 83 Running the ode Further Optimization OpenMP Statement Sequential Reordering ELL Profiler, Analyzing ompile Lists

85 OMP-3 84 Problems in Reordering oloring M RM M-RM Renumbering is according to color/level ID On each thread, numbering is not continuous reduced performance

86 OMP-3 85 SMPindex: for preconditioning do ic= 1, NOLORtot!$omp parallel do do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot+ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) ( )!omp end parallel do Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 color=1 color=2 color=3 color=4 color= colors, 8-threads Meshes in same color are independent: parallel processing Reordering in ascending order according to color ID

87 OMP-3 86 Sequential Reordering Reordering for continuous memory access on each thread (core) Performance is expected to be better. ontinuous address of arrays, such as coefficient matrices Locality (2-page later) Inconsistent numbering iteml(k) > icel indexl(icel-1)+1 k indexl(icel)

OMP-3 87 Sequential Reordering Further reordering for continuous memory access on each thread, 5-color, 8-threads Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5

88 OMP-3 87 Sequential Reordering Further reordering for continuous memory access on each thread, 5-color, 8-threads Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 color=1 color=2 color=3 color=4 color=5 oalesced Sequential

89 OMP-3 88 Sequential Reordering M-RM(2), 4-threads ontinuous Data Access on a Thread: Utilization of ache, Prefetching M-RM(2) Sequential Reordering, 4-threads

90 OMP-3 89 Sequential Reordering M-RM(2), 4-threads 1 st -olor #0 thread, #1, #2, # M-RM(2) Sequential Reordering, 4-threads

91 OMP-3 90 Sequential Reordering M-RM(2), 4-threads 2 nd -olor #0 thread, #1, #2, # M-RM(2) Sequential Reordering, 4-threads

92 OMP-3 91 Sequential Reordering oalesced Good for GPU oloring (5 colors) +Ordering Initial Vector color=1 color=2 color=3 color=4 color=5 color=1 color=2 color=3 color=4 color= Sequential Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 各スレッド上で不連続なメモリアクセス ( 色の順に番号付け ) color=1 color=2 color=3 color=4 color= スレッド内で連続に番号付け

93 OMP-3 92 Files on FX10 Location <$O-L3>/src,<$O-L3>/run ompile/run Main Part cd <$O-L3>/reorder0 make <$O-L3>/run/L3-rsol0 (exec) ontrol Data <$O-L3>/run/INPUT.DAT Batch Job Script <$O-L3>/run/gor.sh

94 OMP-3 93 INPUT.DAT NX/NY/NZ 1.00e e e-00 DX/DY/DZ 1.0e-08 EPSI 16 PEsmpTOT 100 NOLORtot 0 NFLAG 0 METHOD PEsmpTOT Thread Number NOLORtot Reordering Method + Initial Number of olors/levels 2: M, =0: M, =-1: RM, -2 : MRM NFLAG =0: without first-touch, =1: with first-touch METHOD Loop structure for Mat-Vec =0: conventional way, =1: similar to forward/backward substitution

95 OMP-3 94 Sequential Reordering allocate (SMPindex(0:PEsmpTOT*NOLORtot)) SMPindex= 0 do ic= 1, NOLORtot nn1= OLORindex(ic) - OLORindex(ic-1) num= nn1 / PEsmpTOT nr = nn1 - PEsmpTOT*num do ip= 1, PEsmpTOT if (ip.le.nr) then SMPindex((ic-1)*PEsmpTOT+ip)= num + 1 else SMPindex((ic-1)*PEsmpTOT+ip)= num SMPindex ic= ic= ic= SMPindex_new ic= ic= allocate (SMPindex_new(0:PEsmpTOT*NOLORtot)) SMPindex_new(0)= 0 do ic= 1, NOLORtot do ip= 1, PEsmpTOT j1= (ic-1)*pesmptot + ip j0= j1-1 SMPindex_new((ip-1)*NOLORtot+ic)= SMPindex(j1) SMPindex(j1)= SMPindex(j0) + SMPindex(j1) do ip= 1, PEsmpTOT do ic= 1, NOLORtot j1= (ip-1)*nolortot + ic j0= j1-1 SMPindex_new(j1)= SMPindex_new(j0) + SMPindex_new(j1)

96 OMP-3 95 Mat-Vec: METHOD=0!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i = SMPindexG(ip-1)+1, SMPindexG(ip) VAL= D(i)*W(i,P) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*W(itemL(k),P) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*W(itemU(k),P) W(i,Q)= VAL!$omp end parallel do Original!$omp parallel do private(ip,i,val,k) do ip= 1, PEsmpTOT do i= SMPindex((ip-1)*NOLORtot)+1, SMPindex(ip*NOLORtot) VAL= D(i)*W(i,P) do k= indexl(i-1)+1, indexl(i) VAL= VAL + AL(k)*W(itemL(k),P) do k= indexu(i-1)+1, indexu(i) VAL= VAL + AU(k)*W(itemU(k),P) W(i,Q)= VAL New!$omp end parallel do

97 OMP-3 96 Forward Substitution!$omp parallel private(ip,ip1,i,wval,k) do ic= 1, NOLORtot!$omp do do ip= 1, PEsmpTOT ip1= (ic-1)*pesmptot + ip do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel!$omp parallel private(ip,ip1,i,wval,k) do ic= 1, NOLORtot!$omp do do ip= 1, PEsmpTOT ip1= (ip-1)*nolortot + ic do i= SMPindex(ip1-1)+1, SMPindex(ip1) WVAL= W(i,Z) do k= indexl(i-1)+1, indexl(i) WVAL= WVAL - AL(k) * W(itemL(k),Z) W(i,Z)= WVAL * W(i,DD)!$omp end parallel Original New

98 OMP-3 97 Matrix Storage Format ELL (Ellpack-Itpack): Fixed Loop Length, Good for Prefetching (a) RS (b) ELL

99 OMP-3 98 ases: meshes 並列化向け色付け oloring 手法 Further Reordering 番号付け First Touch Data Placement 係数行列格納 Matrix Storage 形式 Format src0 reorder0 ELL ase-1 ase-2 ase-3 M-RM oalesced ( 図 4(a)) Sequential ( 図 4(b)) 無し NO 有り YES RS ELL oalesced Sequential Initial Vector Initial Vector oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 oloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 color=1 color=2 color=3 color=4 color= 各スレッド上で不連続なメモリアクセス ( 色の順に番号付け ) color=1 color=2 color=3 color=4 color= スレッド内で連続に番号付け

100 OMP-3 99 olor# ~ Iteration M-RM Iterations OLOR#

101 OMP Results: FX10 ASE-1(src0) ASE- 2(reorder0) Slightly improved when number of colors are larger Generally speaking, performance is getting worse if number of colors increases In ASE-2, data on each thread is continuous, when computation proceeds to the next color. First Touch: NO effect ELL: Big effect sec ase-1 ase-2 ase OLOR# ase-1: src0 ase-2: reorder0 ase-3: reorder0 + ELL

102 OMP-3 Fujitsu FX10: ASE-1, M-RM(2) -dem.-miss:25.6%, Mem. throughput:41.8gb/sec. Forward/Backward Substitution E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 src0: RS, oalesced

103 OMP Fujitsu FX10: ASE-2, M-RM(2) 25.6%, 41.8GB/sec. 4.0E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 reorder0: RS, Sequential

104 OMP Fujitsu FX10: ASE-1, M-RM(382) 37.7%, 28.7GB/sec. 4.0E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 src0: RS, oalesced

105 OMP Fujitsu FX10: ASE-2, M-RM(382) 29.3%, 32.6GB/sec. 4.0E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 reorder0: RS, Sequential

106 OMP Summary: Fujitsu FX10 Analysis by Profiler Upper: Demand Miss Rate Lower: Memory Throughput src0 ASE-1 RS+ oalesced reorder0 ASE-2 RS+ Sequential ASE-3 ELL+ Sequential M-RM(2) M-RM(382) 25.5 % 25.6 % 5.42 % 41.8 GB/sec GB/sec GB/sec % 29.3 % 16.5 % 28.7 GB/sec GB/sec GB/sec.

107 OMP Summary: Fujitsu FX10 Analysis by Profiler Upper: M-RM(20), Lower: M-RM(382) ase-2 RS ase-3 ELL Instructions SIMD(%) Memory Access Throughput(%) ase-1: src0 ase-2: reorder0 ase-3: reorder0 + ELL

108 OMP Results: ray XE6 ASE-1(src0) ASE-2(reorder0) Significant Improvement Optimization for NUMA Architecture + First Touch RS ELL Improvement is not so large sec OLOR# ase-1 ase-3 ase-1: src0 ase-2: reorder0 ase-3: reorder0 + ELL ase-2

109 OMP L3 Memory Memory L3 L3 Memory Memory L3 L3 Memory L3 Memory L3 Memory L3 Memory Memory T2K/Tokyo ray XE6 (Hopper) Fujitsu FX10 (Oakleaf-FX)

110 OMP Summary Fujitsu FX10 ray XE6 M-RM(20) M-RM(382) = RM 計算時 Time 間 (sec.) ( 秒 ) 一反復当たり計算 (sec.) 時間 ( 秒 ) Time/Iteration 計算時 Time 間 (sec.) ( 秒 ) 一反復当たり計算 (sec.) 時間 ( 秒 ) Time/Iteration ase ase ase ase ase ase ase-1: src0 ase-2: reorder0 ase-3: reorder0 + ELL

111 OMP Fujitsu FX10: ASE-3, M-RM(2) 5.4%, 64.0GB/sec. 4.0E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 ELL, Sequential

112 OMP Fujitsu FX10: ASE-3, M-RM(382) 16.5%, 52.2GB/sec. 4.0E+00 [ 秒 ] 整数ロードメモリアクセス待ち浮動小数点ロードメモリアクセス待ちストア待ち整数ロードキャッシュアクセス待ち浮動小数点ロードキャッシュアクセス待ち整数演算待ち浮動小数点演算待ち分岐命令待ち命令フェッチ待ちバリア同期待ち uopコミットその他の待ち 1 命令コミット整数レジスタ書き込み制約 2/3 命令コミット 4 命令コミット 3.5E E E E E E E E+00 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8 Thread 9 Thread 10 Thread 11 Thread 12 Thread 13 Thread 14 Thread 15 ELL, Sequential

113 OMP Running the ode Further Optimization Profiler, Analyzing ompile Lists 利用支援ポータルドキュメント閲覧プログラム開発支援ツールプロファイラ使用手引書 3 章 : 詳細プロファイラ Users Portal Document Programming Development Support Tool Profiler User s Guide hap.3 Advanced Profiler

114 113 Default >$ cd <$O-L3>/src >$ make >$ ls../run/l3-sol L3-sol >$ cd../run >$ pjsub go1.sh F90 = frtpx F90OPTFLAGS= -Kfast,openmp -Qt F90FLAGS =$(F90OPTFLAGS) ompile & Run -Qt List of Messages by ompiler (ompile List) *.lst Fortran Only In, -Qt is not avilable Please use -Nsrc Displayed on screen

115 114 urrent version of /++ compiler can produce list of messages Fortran//++ -Nlst=p 標準の最適化情報 ( デフォルト ) -Nlst=t 詳細な最適化情報 Fortran ONLY -Nlst=a 名前の属性情報 -Nlst=d 派生型の構成情報 -Nlst=i インクルードされたファイルのプログラムリストおよびインクルードファイル名一覧 -Nlst=m 自動並列化の状況を OpenMP 指示文によって表現した原始プログラム出力 -Nlst=x 名前および文番号の相互参照情報

116 Info in *.lst 115

117 SIMD Information 116

118 Automatic Parallelization 117

すべて見る

OpenMP/OpenACC によるマルチコアメニィコア並列プログラミング入門 Fortran 編第 Ⅳ 部 :OpenMP による並列化 + 演習中島研吾東京大学情報基盤センター

OpenMP/OpenACC によるマルチコアメニィコア並列プログラミング入門 Fortran 編第 Ⅳ 部 :OpenMP による並列化 + 演習中島研吾東京大学情報基盤センター OpenMP/OpenACC によるマルチコアメニィコア並列プログラミング入門 Fortran 編第 Ⅳ 部 :OpenMP による並列化 + 演習中島研吾東京大学情報基盤センター OMP-3 1 OpenMP 並列化 L2-sol を OpenMP によって並列化する並列化にあたってはスレッド数を PEsmpTOT によってプログラム内で調節できる方法を適用する基本方針同じ色 ( または