THE PARALLEL Issue UNIVERSE James Reinders Parallel Building Blocks: David Sekowski Parallel Studio XE Cluster Studio Sanjay Goil John McHug

THE PARALLEL Issue 5 2010 11 UNIVERSE James Reinders Parallel Building Blocks: David Sekowski Parallel Studio XE Cluster Studio Sanjay Goil John McHugh

JAMES REINDERS 3 Parallel Studio XE Cluster Studio SANJAY GOIL JOHN MCHUGH 5 Parallel Studio XE Cluster Studio Windows* Linux* C/C++ Fortran Parallel Building Blocks: DAVID SEKOWSKI 17 3 TBB Cilk Plus Array Building Blocks ArBB Array Building Blocks MICHAEL MCCOOL 23 Array Building Blocks ArBB SIMD MKL GREG HENRY SHANE STORY 29 MKL hotspot Print : DON GUNNING NICK MENG PAUL BESL 31 / print 2010 Intel Corporation. Intel Intel Intel Core Itanium Xeon Intel Corporation * 2

James Reinders Intel Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism Parallel Universe 1 SIMD 2010 / 3 http://software.intel.com/en-us/articles/optimization-notice

2010 11 9 Parallel Studio XE Cluster Studio 2 Parallel Studio XE C C++ Fortran Cluster Studio MPI C C++ Fortran 12.0 VTune MPI C/C++ Parallel Building Blocks Co-Array Fortran.NET Parallel Studio XE Linux* Windows* Composer XE VTune Amplifier XE Inspector XE Cluster Studio Linux* Windows* Composer XE MPI MPI / OpenMP* MPI Co-Array Fortran MKL IPP TBB Cilk Plus Array Building Blocks ArBB Parallel Building Blocks PBB Parallel Building Blocks Fortran MKL / Intel Inspector XE: An essential tool during development along with Intel Composer XE Inspector XE: Composer XE Inspector XE hotspot hotspot VTune Amplifier XE MPI MPI On a path to petascale with commodity clusters and Intel MPI MPI HPC MPI / Parallel Studio XE Cluster Studio JAMES REINDERS 2010 11 4

Parallel Studio XE Cluster Studio Sanjay Goil John McHugh 5 http://software.intel.com/en-us/articles/optimization-notice

Parallel Studio XE 2011 Composer XE Inspector XE VTune Amplifier XE 1 2010 9 Parallel Studio 2011 Microsoft* Visual Studio* Windows* C++ Parallel Building Blocks PBB C/C++ Windows*/Linux* C/C++ Fortran Parallel Studio XE 2011 C/C++ Fortran MKL IPP PBB TBB Cilk Plus ArBB Inspector XE / VTune Amplifier XE HPC 10 HPC Cluster Studio MPI HPC MPI / C/C++ Fortran 1 / 1 IA-32 64 MPI HPC HPC 2 / 6

Composer XE C/C++ Fortran > / Inspector XE / > VTune Amplifier XE > 2 Parallel Studio XE 2011 > : Parallel Studio XE Windows* Linux* Mac OS* X C/C++ Fortran > : Inspector XE / > : VTune Amplifier XE MKL IPP / > : Parallel Studio XE > : Composer XE AVX C/ C++ PBB Fortran Fortran 2008 Co-Array Fortran > : Parallel Studio XE 1 hotspot / 7 http://software.intel.com/en-us/articles/optimization-notice

Composer XE C++ C++ XE Visual Fortran Visual Fortran Composer XE Visual Fortran IMSL* Visual Fortran Composer XE IMSL* VTune VTune Amplifier XE Inspector XE Cluster Studio 3 Parallel Studio XE 2011 Windows*/Linux* x86 C/C++ Fortran 3 8

Parallel Studio XE 2011 Windows* C++ Cilk Plus Parallel Studio XE 2011 BR&E Inc. Jorge Martinis 4 SIMD 5 9 http://software.intel.com/en-us/articles/optimization-notice

hotspot BlueJeans Network IPP 7.0 / IPP IPP 7.0 CPU BlueJeans Network Emmanuel Weber 6 Composer XE Composer XE C/C++ Fortran (v12.0) MKL 10.3 IPP 7.0 TBB 3.0 2 C/C++ C++ XE 12.0 AVX Sandy Bridge PBB Cilk Plus TBB ArBB 4 AVX SIMD x86 (GAP) Windows* Visual Studio* 2010 Fortran XE 12.0 x86 Fortran 2003 Fortran 2008 (Co-Array Fortran AVX ) 5 / MKL 10.3 AVX LAPACK C IPP 7.0 AVX AES 10

Inspector XE > Inspector XE 2011 OTRADA Inc. CEO CTO Alex Migdalski 7 / Parallel Studio XE 2011 Parallel Studio Linux* Windows* Windows* Linux* 6 Inspector XE / 7 / / C/C++ Fortran Inspector XE / : > > > > > > Windows*/Linux* GUI > 11 http://software.intel.com/en-us/articles/optimization-notice

8 VTune Amplifier XE - hotspot > > CPU VTune Amplifier XE - VTune Amplifier XE 2011 VTune... > > {+} 9 VTune Amplifier XE 2011 VTune Windows*/Linux* 8 9 VTune : > > hotspot > > > > Windows* > > > > Visual Studio > root (Linux*) > EBS root 12

Inspector XE - > > (SSA) SSA Envivio Mikael Le Guerroue Parallel Studio XE 10 Parallel Studio XE 2011 10 11 Parallel Studio XE (SSA) > > > > 250 > > > > Windows*/Linux* GUI 13 http://software.intel.com/en-us/articles/optimization-notice

Linux* Windows* Windows* Linux* C/C++ Parallel Building Blocks C/C++ Co-Array Fortran (CAF) Fortran 2008 Fortran Fortran Cluster Studio 2011 Fortran 1 GUI 11 Cluster Studio > Composer XE > MPI > / HPC Cluster Studio 2011 IA-32 64 MPI MPI 4.0 / 8.0 MPI C/C++ Fortran MKL 10.3 IPP 7.0 PBB Composer XE Cluster Studio 2011 Co-Array Fortran (CAF) (ICR) IA (TCO) Cluster Studio 2011 MPI MPI OpenMP* PBB Cluster Studio 2011 C/C++ Fortran Linux* Windows* 1 14

PO 12 / > > > > hotspot Cluster Studio 2011 > : MPI > : Composer XE C/C++ PBB Fortran Fortran 2008 Co-Array Fortran SIMD > MPI : / MPI > : Windows* Linux* > : IA HPC (TCO) > : Cluster Studio 1 15 http://software.intel.com/en-us/articles/optimization-notice

MPI MPI InfiniBand* (IB) MPI MPI 1 InfiniBand (IB) C/C++ Parallel Building Blocks SMP C/ C++ Coarray Fortran (CAF) Fortran 2008 Fortran Linux* Windows* Fortran Fortran 2008 Fortran 2003 MKL IPP Linux* Windows* Windows* Linux* 13 Parallel Studio XE Cluster Studio Windows* Linux* C/C++ Fortran Parallel Studio XE 2011 C/C++ Fortran MKL IPP PBB TBB Cilk Plus ArBB Inspector XE / VTune Amplifier XE Cluster Studio 2011 MPI / C/C++ Fortran MKL IPP PBB TBB Cilk Plus ArBB : http://www.intel.co.jp/jp/software/products/ 16

Parallel Building Blocks David Sekowski 17 http://software.intel.com/en-us/articles/optimization-notice

: 5 C++ TBB Adobe* TBB TBB Parallel Building Blocks PBB > > HPC 1 1967 1988 Andy Grove Only the Paranoid Survive 1 / CPU / > > CPU > > > 1 1 > 1 > 18

1 / MPI 3 2 AI I/O for for ( ) : SIMD (Single Instruction Multiple Data) & SIMD 2 19 http://software.intel.com/en-us/articles/optimization-notice

2 : (1) (2) 3 OS 3 ( ) 20

Parallel Building Blocks - OS - Parallel Amplifier Parallel Inspector - - - - - - - - - - OS HW IDE 4 OpenMP* OpenCL OpenMP* Fortran C OpenMP* OpenMP* Web OpenMP* Parallel Building Blocks TBB TBB C++ TBB API TBB (1) (2) Cilk Plus Array Building Blocks ArBB 5 Cilk Plus C/C++ C/C++ 3 21 http://software.intel.com/en-us/articles/optimization-notice

www.threadingbuildingblocks.org TBB www.threadingbuildingblocks.com http://cilk.com Cilk Plus Windows* Microsoft* C++ Linux* GCC Cilk ++ SDK http://software.intel.com/en-us/articles/intel-cilk/ http://intel.com/go/arbb ArBB http://software.intel.com/en-us/articles/ intel-array-building-blocks 5 ArBB Ct TBB C++ JIT CPU C++ ArBB ArBB 3 TBB Cilk Plus ArBB 5 3 3 Parallel Building Blocks PBB Parallel Studio 2011 Parallel Studio XE 2011 PBB ArBB Ct TBB C++ 22

Array Michael McCool Building Blocks

Array Building Blocks ArBB Parallel Building Blocks ArBB Microsoft* GNU C++ ArBB http://intel.com/go/arbb Windows* Linux* ArBB?? ArBB SIMD ArBB API SIMD 2 SIMD SIMD SIMD SIMD Parallel Building Blocks PBB SIMD SIMD SSE 1 SSE 4 AVX SIMD 8 (MIC) SIMD 16 24

Parallel Building Blocks PBB 3 SIMD SIMD 2 1 SIMD 2 SIMD SSE AVX MIC AVX SSE SSE AVX 2 1 C++ 1 SSE AVX AVX SIMD SIMD SIMD C/C++ Parallel Building Blocks PBB 3 1 MKL IPP 2 1 C/C++ Cilk Plus C/C++ 2 ArBB ArBB C++ API ISO C++ C++ - C/C++ ArBB ArBB ArBB C++ API ArBB C++ float) integer) f32 i32 32 C++ ArBB dense<t,d> T D D 1 T ArBB ArBB 25 http://software.intel.com/en-us/articles/optimization-notice

ArBB 2 A B C D 4 dense<f32> A += (B/C) * D; ArBB call void doit(dense<f32>& A, dense<f32> B, dense<f32> C, dense<f32> D) { A += (B/C) * D; }... call(doit)(a,b,c,d); call 1 ArBB call C++ ArBB C++ C++ ArBB ArBB ArBB call capture call C++ ArBB call BLOG highlights Cilk Plus ABI JAMES REINDERS Cilk Plus 2010 11 2 Cilk Plus ABI cilk.com Cilk Plus Windows* Linux* Cilk Plus Cilk Plus 2009 Cilk Arts JAMES Go-Parallel.com Go Parallel : Translating Multicore Power into Application Performance 26

void doit dense<f32>& A, f32 b, f32 c, dense<f32> D) { map(kernel)(a, b, c, D); } call(doit)(a,b,c,d); ArBB void kernel(f32& a, f32 b, f32 c, f32 d) { a += (b/c)*d; } kernel b c f32 kernel A D b c call map 2 ArBB prefix scan) / / ) 1 1 2 A B map call map f32 A _ dot _ B = add _ reduce A * B); void doit(dense<f32>& A, dense<f32> B, dense<f32> C, dense<f32> D) { map(kernel)(a, B, C, D); } call(doit)(a,b,c,d); C++ 1 ArBB ArBB 27 http://software.intel.com/en-us/articles/optimization-notice

capture ArBB C++ ArBB (_end_for ) _for (;) (,) ArBB / ArBB STL bind ArBB C++ call doit map 1 2 1 1 1 ArBB int max _ count = MAX _ COUNT; void mandel(i32& d, std::complex<f32> c) { i32 i; std::co m plex<f32> z = 0.0f; _ for (i = 0, i < max _ count, i++) { _ if (abs z) >= 2.0f) { _ break; } _ end _ if; z = z*z + c; } _ end _ for; d = i; } v o i d doit(dense<i32,2>& D, dense<std::co m plex<f32>,2> P) { map(mandel)(d,p); } dense<std::co m plex<f32>,2> pos; bind(pos, c _ pos, cols, rows) dense<i32,2> dest; bind(dest, c _ dest, cols, rows) call(doit)(dest, pos); Array Building Blocks ArBB ArBB ArBB C++ C++ ArBB ArBB http://intel.com/go/arbb std::complex ArBB C++ max_count call 28

MKL MKL Greg Henry Shane Story MKL MKL hotspot MKL BLAS (Basic Linear Algebra Subroutines) LAPACK (Linear Algebra PACKage) (FFT) (VML) (VSL) (PARDISO) BLAS MKL ScaLAPACK BLAS (PBLAS) FFT IA-32/ 64 AMD* Linux* Windows* Mac OS* X MKL DGEMM BLAS MKL MKL Core 2 Duo Core i7 MKL MKL OpenMP* LAPACK BLAS BLAS VML FFT FFT LAPACK BLAS LAPACK MKL 29 http://software.intel.com/en-us/articles/optimization-notice

MKL OpenMP* MKL MKL_NUM_THREADS OpenMP* MKL MKL OpenMP* MKL OpenMP* / OpenMP* OpenMP* MKL MKL Microsoft* GNU OpenMP* OpenMP* MKL_DOMAIN_NUM_THREADS mkl_domain_set_num_threads() MKL MKL 2 BLAS 4 MKL_ALL=2, MKL_BLAS=4 Cilk Plus TBB pthreads (Linux*) OpenMP* MKL MKL MKL_NUM_THREADS 1 MKL LAPACK BLAS LAPACK BLAS BLAS LAPACK DGETRF BLAS LAPACK (MPI) MKL MP LINPACK (DGETRF ) MPI-OpenMP* 1 1 MPI OpenMP* MPI MKL OpenMP* MPI MKL 1 OpenMP* 1 OpenMP* MKL MKL MKL_DYNAMIC FALSE TRUE mkl_set_dynamic() FFT 2 MKL Web : [BLAS] http://www.netlib.org/blas/index.html [LAPACK] http://www.netlib.org/lapack/index.html [MKL] http://software.intel.com/en-us/intel-mkl/ [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [MPI] http://www.intel.com/go/mpi [OPENMP] www.openmp.org [SCALAPACK] http://www.netlib.org/scalapack/index.html 30

print : Don Gunning Nick Meng Paul Besl MPI 4.0 / / print MPI 4.0 30 50 % LSTC dyna 1,500 fluent 3,000 4.0 / ISV 1 64 1 31 http://software.intel.com/en-us/articles/optimization-notice

5.5 5 4.5 4 3.5 3 2.5 2 (H) 1.5 1.0.5 0.0 0 2.5k 5k 7.5k 10k 15k 20k 1. source /opt/intel/itac/8.0.1.001/bin/itacvars.sh export LD _ PRELOAD=/opt/intel/itac/8.0.1.001/slib/libVT.so # 8p run runexec small _ model.pre -np 8 --mpi-options -trace --machines-file $PBS _ NODEFILE #256p run runexec large _ model.pre -np 256 --mpi-options -trace --machines-file $PBS _ NODEFILE / MPI trace mpi-options trace ITAC / 8 256 2 STF 2 3 2 P0 P1 P7 P0 P1 32

MPI BCAST 2. / MPI MPI : [ ] : [ ] MPI BCAST 3. / MPI MPI : : 33 http://software.intel.com/en-us/articles/optimization-notice

4. / 256 : MPI : 5. / MPI MPI 3 MPI_Bcast 256 4 5 6 7 4 MPI MPI MPI 6 7 MPI_BCAST 6 7 MPI MPI 34

6. / 0 60 7. / 21.136 21.176 : MPI : MPI / 35 http://software.intel.com/en-us/articles/optimization-notice

MPI 4 MPI MPI 5 6 7 MPI _BCAST MPI_BCAST MPI_BCAST I_MPI_ADJUST_BCAST=4 / ISV 64 Livermore Software Technology Corporation (LSTC) LSTC OpenMP* MPI / LSTC DYNA : LSTC LS-DYNA 1,000 43 LSTC DEVELOPER SPOTLIGHT & UNIX System V 10 8 6 (ISS) IA 32 IA IA 32 IA ERP Sparc* Sun* Solaris* IA Linux* Larrabee Larrabee IA-32 Itanium Larrabee 36

8. / 4 9. / 32 37 http://software.intel.com/en-us/articles/optimization-notice

Xeon 7560 1 32 Xeon 5560 8 1 8 64 MPP (MPI) 44013 18521 MPI OpenMP* 7047 5541 6.25 3.34 10. LSTC CYL1E6 : LSTC / print 100 MPP DYNA MPI LSTC / 8 9 MPI 4 32 2 OpenMP* MPI MPI MPI 100 : LSTC LS DYNA MPP DYNA HYBRID LS-DYNA > OpenMP* MPI 1 OpenMP* LS-DYNA > MPI I/O 10 38

11. / 128 100 MPI_BCAST MPI MPI_RECV 12. / 128 100 39 http://software.intel.com/en-us/articles/optimization-notice

13. / 128 100 ITAC MPI 4.0 HYBRID LS-DYNA 100 / 8.0.1 11 12 13 MPI_RECV MPI_BCAST / MPI_RECV 40

SIMD SIMD 2 SSE2 SIMD 3 SSE3 SIMD 3 (SSSE3) #20110307 Web http://www.intel.co.jp/jp/software/products/ 41 http://software.intel.com/en-us/articles/optimization-notice