ARTED Xeon Phi Xeon Phi 2. ARTED ARTED (Ab-initio Real-Time Electron Dynamics simulator) RTRS- DFT (Real-Time Real-Space Density Functional Theory, )

Size: px

Start display at page:

Download "ARTED Xeon Phi Xeon Phi 2. ARTED ARTED (Ab-initio Real-Time Electron Dynamics simulator) RTRS- DFT (Real-Time Real-Space Density Functional Theory, )"

かおりさかわ
5 years ago
Views:

1 Xeon Phi 1,a) 1,3 2 2,3 Intel Xeon Phi PC RTRSDFT ( ) ARTED (Ab-initio Real-Time Electron Dynamics simulator) Xeon Phi OpenMP Intel E5-2670v2 (Ivy-Bridge 10 ) CPU Xeon Phi Symmetric CPU Symmetric CPU 90% CPU 32 Symmetric Intel Xeon Phi (Xeon Phi) MIC (Many Integrated Cores) KNC (Knights Corner) Xeon Phi GPU (Graphics Processing Units) PCI-Express Xeon Phi Linux Xeon Phi Xeon Phi CPU Xeon Phi 3 1 GPU Offload 2 CPU Xeon Phi Native 3 CPU Xeon Phi MPI (Message Passing Interface) Symmetric Offload GPU CPU Native a) hirokawa@hpcs.cs.tsukuba.ac.jp Symmetric Native CPU Xeon Phi Symmetric Xeon Phi KNL (Knights Landing) PCI-Express CPU KNL Native Symmetric Native Symmetric Xeon Phi [1], [2], [3] Xeon Phi CPU Xeon Phi Xeon Phi RTRSDFT ( ) 1

(Real-Time Real-Space Density Functional Theory, ) [4] RTRSDFT RSDFT

10 100 ARTED 1 10 ARTED ARTED [6] ARTED 11520 90% 3.

NB, NL) SiO 2 (4 3, 48, 36000 = (20, 36, 50)) Si (24 3, 32, 4096 = (16, 16,

0 Ion_Force REAL8, SIZE=3 REAL8, SIZE=3NI Logging REAL8, SIZE=NL Yes

2 ARTED Xeon Phi Xeon Phi 2. ARTED ARTED (Ab-initio Real-Time Electron Dynamics simulator) RTRS- DFT (Real-Time Real-Space Density Functional Theory, ) [4] RTRSDFT RSDFT (Real-Space Density Functional Theory, ) RTRSDFT RSDFT RSDFT 10 [5] ARTED ARTED 1 10 ARTED ARTED [6] ARTED % 3. ARTED ARTED k k NK NB x y z (NLx, NLy, NLz) (NK, NB, NLx, NLy, NLz) (NK, NB, NL) SiO 2 (4 3, 48, = (20, 36, 50)) Si (24 3, 32, 4096 = (16, 16, 16)) MPI k 1 No hpsi psi_rho_rt Hartree Exc_Cor current Iteration mod 10 = 0 Ion_Force REAL8, SIZE=3 REAL8, SIZE=3NI Logging REAL8, SIZE=NL Yes Total_Energy REAL8, SIZE=1 REAL8, SIZE=5 REAL8, SIZE=3NI ARTED OpenMP MPI (NK / # of process, NB, NL) ARTED ARTED RSDFT OpenMP (NL) (NK / # of process) NB k

1 Number of Node 32 CPU Intel E5-2670v2 2 Number of Cores Memory Xeon Phi

15.0.0 FDR Connect-X3 (56 Gbit/s) MPI Intel MPI 5.0.1 Intel MKL 11.1.2 OS Red Hat Enterprise Linux Server 6.

4-1 Inifiniband Network Infiniband FDR 2 Gen3 x8 PCI-Express CPU 0 QPI

Byte/Flop 20 current hpsi 4 25 hpsi CPU Native Symmetric 4.

5TFLOPS 1PFLOPS 1 2 COMA InfiniBand Xeon Phi CPU0 Xeon Phi COMA QPI Xeon

1 240 Symmetric Symmetric 2 CPU 2 Xeon Phi 1 4 2 CPU 20 1 Symmetric 3 5.

3 1 Number of Node 32 CPU Intel E5-2670v2 2 Number of Cores Memory Xeon Phi Infiniband 20 (10 cores 2 sockets)/node 64GB 7110P 2/Node Compiler Intel FDR Connect-X3 (56 Gbit/s) MPI Intel MPI Intel MKL OS Red Hat Enterprise Linux Server 6.4 MPSS OFED Inifiniband Network Infiniband FDR 2 Gen3 x8 PCI-Express CPU 0 QPI CPU 1 Gen2 x16 MIC 0 Gen2 x16 MIC 1 COMA NL NI SiO 2 18 Si 8 1 hpsi 25 Byte/Flop 20 current hpsi 4 25 hpsi CPU Native Symmetric 4. Xeon Phi COMA 32 [7] COMA 2 Xeon Phi TFLOPS 1PFLOPS 1 2 COMA InfiniBand Xeon Phi CPU0 Xeon Phi COMA QPI Xeon CPU PCI-Express InfiniBand NVIDIA GPU QPI [8] CPU 1 OpenMP 10 Xeon Phi Symmetric Symmetric 2 CPU 2 Xeon Phi CPU 20 1 Symmetric 3 5. Xeon Phi 5.1 hpsi Xeon Phi 512bit SIMD Intrinsics Intel Intel Xeon Phi Xeon Phi 240 OpenMP OpenMP SiO 2 SiO k k

4 3 hpsi Xeon Phi CPU 2 SiO 2 # of Process Loop Size Field Size/Process Xeon Phi 240 Xeon Phi CPU Xeon Phi Xeon Phi CPU 40% Xeon Phi ARTED CPU 56% NL (NLx, NLy, NLz) unroll j Intel Intel [9] ( ) -ipo -fp-model fast=2 -complex-limited-range -no-vec-guard-write -qopt-ra-region-strategy=block (Xeon Phi ) -qopt-threads-per-core=4 -qoptgather-scatter-unroll=4 -qopt-assume-safe-padding - opt-streaming-cache-evict=0 -qopt-streaming-stores always -opt-assume-safe-padding -qoptthreads-per-core Xeon Phi -qopt-threads-per-core=4 4 Xeon Phi CPU -qopt-streaming-stores -qopt-streaming-stores -opt-streaming-cache-evict 0 65% OpenMP COMA Xeon Phi 240 SiO OpenMP 4

5 real (8) :: cef complex (8) :: zi,sx,sy,sz,tx,ty,tz real (8) :: lapx (4), lapy (4), lapz (4) real (8) :: nabx (4), naby (4), nabz (4) integer :: idx (4, NL), idy (4, NL), idz (4, NL) complex (8) :: u( NL),s( NL) do i=1, NL sx= lapx (1)*( u( idx (1,i ))+ u( idx (-1,i )))& &+ lapx (2)*( u( idx (2,i ))+ u( idx (-2,i )))& &+ lapx (3)*( u( idx (3,i ))+ u( idx (-3,i )))& &+ lapx (4)*( u( idx (4,i ))+ u( idx (-4,i ))) tx= nabx (1)*( u( idx (1,i)) -u( idx (-1,i )))& &+ nabx (2)*( u( idx (2,i)) -u( idx (-2,i )))& &+ nabx (3)*( u( idx (3,i)) -u( idx (-3,i )))& &+ nabx (4)*( u( idx (4,i)) -u( idx (-4,i )))! y, z s(i)= cef *u(i ) -0.5 d0 *( sx+sy+sz)- zi *( tx+ty+tz) 4 real (8) :: cef complex (8) :: zi,st,tt complex (8) :: lapt (12), nabt (12) integer :: idpt (12, NL), idmt (12, NL) complex (8) :: u( NL),s( NL)! dir$ vector aligned do iz =1, NLz do iy =1, NLy do ix =1, NLx i =(( iz -1)* NLy * NLx )+(( iy -1)* NLx )+ ix st =0. d0 tt =0. d0! dir$ unroll (12) do j=1,12 st=st+ lapt (j )*( u( idpt (j,i ))+u( idmt (j,i ))) tt=tt+ nabt (j )*( u( idpt (j,i ))+u( idmt (j,i ))) s(i)= cef *u(i ) -0.5 d0*st -zi*tt 5 (OMP SCHEDULE=static) (OMP SCHEDULE=static,1) (OMP SCHEDULE=dynamic,1) (KMP AFFINITY) compact balanced scatter 25 3 Si (diamond) # of Process Loop Size Field Size/Process OpenMP 75% IPO (Interprocedual Optimization) 0.78 CPU CPU 94% -qopt-streaming-stores CPU 117% CPU Xeon Phi 66% 5.2 Si hpsi 5.1 SiO 2 2 Si hpsi Si 3 Si k SiO Xeon Phi SiO Xeon Phi hpsi 6 Xeon Phi OpenMP SiO Si

6 6 Si hpsi Xeon Phi CPU 2 CPU 1 CPU Xeon Phi MPI 5.3 current hpsi hpsi psi rho RT hpsi hpsi CPU Xeon Phi Xeon Phi CPU Xeon Phi Xeon Phi Xeon Phi 6. ARTED CPU Native Symmetric 3 CPU Xeon Phi Symmetric Si 100 OpenMP omp do schedule(dynamic,1) (hpsi current) 6.1 CPU Xeon Phi 8 32 CPU Xeon Phi 7 1 hpsi current psi rho RT 3 Xeon Phi hpsi CPU Xeon Phi CPU current psi rho RT CPU psi rho RT CPU Xeon Phi 3 omp parallel do OpenMP omp parallel omp do nowait CPU CPU Xeon Phi CPU 90% CPU Symmetric Xeon Phi 32 CPU CPU 64 Symmetric CPU Xeon Phi Xeon Phi

7 7 8 () 9 ( ) CPU 32 Xeon Phi Symmetric 16 90% CPU 1 Xeon Phi 1 COMA 2 Symmetric 2 Xeon Phi CPU Symmetric Xeon Phi CPU 10 OpenMP Xeon Phi Symmetric 7. ARTED Xeon Phi CPU Symmetric CPU ARTED CPU 1 Xeon Phi 1 CPU 90% Symmetric Xeon Phi CPU Symmetric CPU Symmetric CPU Xeon Phi 26 7

8 [1] Xeon Phi (Knights Corner) Vol HPC-143, No. 32 (2014). [2] GPU/MIC Vol HPC-144, No. 4 (2014). [3] Xeon Phi Vol HPC-139, No. 20 (2013). [4] Shunsuke A. Sato and Kazuhiro Yabana: Maxwell + TDDFT multi-scale simulation for laser-matter interactions, J. Adv. Simulat. Sci. Eng., Vol. 1, No. 1, pp (2014). [5] Hasegawa, Y., Iwata, J.-I., Tsuji, M., Takahashi, D., Oshiyama, A., Minami, K., Boku, T., Shoji, F., Uno, A., Kurokawa, M., Inoue, H., Miyoshi, I. and Yokokawa, M.: First-principles Calculations of Electron States of a Silicon Nanowire with 100,000 Atoms on the K Computer, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 11, ACM, (online), DOI: / (2011). [6] Schultze, M., Ramasesha, K., Pemmaraju, C., Sato, S., Whitmore, D., Gandman, A., Prell, J. S., Borja, L. J., Prendergast, D., Yabana, K., Neumark, D. M. and Leone, S. R.: Attosecond band-gap dynamics in silicon, Science 12 December 2014, Vol. 346, No. 6215, pp (online), DOI: /science [7] COMA (PACS-IX) ac.jp/files/coma-general/coma_outline.pdf. [8] HA-PACS/TCA TCA InfiniBand Vol HPC-147, No. 32 (2014). [9] Intel: User and Reference Guide for the Intel Fortran Compiler 15.0, compiler_15.0_ug_f. 8

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2 FFT 1 Fourier fast Fourier transform FFT FFT FFT 1 FFT FFT 2 Fourier 2.1 Fourier FFT Fourier discrete Fourier transform DFT DFT n 1 y k = j=0 x j ω jk n, 0 k n 1 (1) x j y k ω n = e 2πi/n i = 1 (1) n DFT