: (1), ( ) 1 1.1,, 1 OpenMP [3, 5, 21, 22], MPI [13, 18, 23].., (C Fortran)., OS,. C Fortran,,,,. ( ),,.,,.,,,.,,,.,.,. 1

Size: px

Start display at page:

Download ": (1), ( ) 1 1.1,, 1 OpenMP [3, 5, 21, 22], MPI [13, 18, 23].., (C Fortran)., OS,. C Fortran,,,,. ( ),,.,,.,,,.,,,.,.,. 1"

てるえさかいざわ
4 years ago
Views:

1 : (1), ( ) 1 1.1,, 1 OpenMP [3, 5, 21, 22], MPI [13, 18, 23].., (C Fortran)., OS,. C Fortran,,,,. ( ),,.,,.,,,.,,,.,.,. 1

2 1.2,.,,,,.. CPU,,., (, ). (NUMA ).,.,. Flat MPI,,.,,. GPU, SIMD, [11]. C Fortran,., SIMD GPU,,..,., C Fortran,,.,,. 2

3 ,,.,.,, x FLOPS ( ), x/2 8 = 4x /., /FLOPS = 4,., ,,,. C Fortran, Python Perl, MATLAB Excel,.,,,,.,, Java concurrency framework [12], fork/join [16] (JDK 7 ), CUDA [31] OpenCL GPU, Hadoop [15, 29],., DARPA High Productivity Computer Systems (HPCS)., IBM X10 [26, 30], Cray Chapel [4, 8], Sun Microsystems ( ) Fortress [1, 10],. HPCS, Ubiquitous High Performance Computing (UHPC) Program, (easier to program than current systems) , MPI OpenMP,,,.,.. 3

4 ,. HPC.,., Linux, HA8000.,, , TBB, C++.,. Cilk [6, 14]: MIT Leiserson, C,.,. Cilk Arts, Intel Cilk Plus [7], Intel Parallel Building Blocks [24]. MIT [6] Cilk Intel Threading Building Blocks (TBB) [25, 27]: Intel Parallel Building Blocks, C++., Cilk,.,. Unified Parallel C (UPC) [9, 28]:. C.,,,. MPI SPMD,. Chapel [4, 8]: Cray,. UPC,,. 4

5 MPI UPC, 1,. X10 [26, 30]: IBM,. Chapel. Chapel, ( ).,,., MPI, OpenMP, Co-Array Fortran (CAF) [2, 19].,,., 1., 2.,,, 3.,, ,. 1. (Embarrasingly Parallel) ,,,,.,.,.,,... 5

6 1: ( ), ( ),, 1.4,,.. 2,,.,,.,.,,,,.,,.,,,,,. (, ).,,,,..,. 2.1,,. 6

7 2.1.1, f() + g() f() g(),., f(), g() ( ),,. 1. Cilk int x = spawn f(); int y = g(); sync;... x + y.... spawn f(), f(),. Chapel begin, var x : sync int; var y : int; begin x = f(); y = g();... x + y.... begin,. Cilk spawn sync, Chapel, x (sync ),., cobegin. var x, y: int; cobegin { x = f(); y = g(); }... x + y...,, coforall., begin, sync. OpenMP, 3.0 task,. 1. 7

8 int x, y; #pragma omp task shared(x) x = f(); #pragma omp task shared(y) y = g(); #pragma omp taskwait... x + y... X10, TBB..,,,.,, ,,.,,,.,.,.,,., =1 OS. OS, X10, Chapel., CPU =1 OS,.,, ( ;) OpenMP parallel. 1,,. 3.,,..,. Cilk, TBB, OpenMP task. OpenMP. 8

9 GCC 4.3..,, 3.,. Cilk TBB,,.,. Chapel X10 Locale Place, 1., (Locale Place) OS., on at Single Porgram Multiple Data (SPMD),,,., ( ),,., 1 OS 1 1,,., OS,.,, (Bulk Synchronous, Loosely Synchronous ).,, (, ). Single Program Multiple Data (SPMD)., Mulitple Data ( ),, Single Program ( ). SPMD, SPMD,,,,. 9

10 SPMD MPI. 2 MPI, main, ( mpirun -np )., MPI. UPC CAF SPMD., OpenMP, main, parallel (#pragma omp parallel). parallel, ([22] 35 ). parallel,., parallel, parallel., parallel, SPMD. OpenMP SPMD., parallel, parallel,. X10 (clock resume/next ; [26] 167 )., SPMD, Work sharing SPMD,.,,,.,,.,., SPMD (fragmented view)., (global view)., for : for (i = 0; i < n; i++) { f(i); }, coforall (i in 0..n - 1) { f(i); } 2, MPI,. 10

11 (coforall Chapel for ),, begin_idx = ( my_rank * n) / n_procs; end_idx = ((my_rank + 1) * n) / n_procs; for (i = begin_idx; i < end_idx; i++) { f(i); }.,.,. quicksort 2 : quicksort(a, p, q) {... quicksort(a, p, r); quicksort(a, r, q); } Cilk quicksort(a, p, q) {... spawn quicksort(a, p, r); quicksort(a, r, q); sync; }, SPMD. SPMD, for, work sharing. UPC upc_forall (i = 0; i < n; i++; 7*i) { f(i); }. C for (;). 7*i, affinity, i (7*i mod P ) (P )., (7*i mod P ) ID i for. OpenMP for (#pragma omp for).,, parallel,. 11

12 2:. SPMD work sharing MPI N N Y N CAF N N Y N UPC N N Y Y upc forall affinity OpenMP Y Y/N 3 Y Y #pragma omp for schedule Cilk Y Y N - TBB Y Y N - X10 Y N Y - : at, : OS Chapel Y N N - : on, : OS Chapel coforall, SPMD.,,, for.. Work sharing. for :? Y. :,, ( )? Y. SPMD: SPMD ( )? Y. work sharing :, for (work sharing )? Y. :,. 3 OpenMP 3.0 task.. 12

13 2.2,,.,,. : double a[n][n];... a[i][j] = 0.25 * (a[i+1][j] + a[i][j+1] + a[i-1][j] + a[i][j-1]);, a,,.,,.,,,. OpenMP, Cilk, TBB. a, ,,. send/receive.. SPMD, (1-N ), (N-1 ), (N-N ).,.. 1. ( ),. 2., ( ), ( ).,,. 3.,,. 13

14 4.,,., A a, B b, A B, b B A, a (fetch deadlock).,. 5.,, (send deadlock).,,,,. Fetch deadlock,,, a/b,. SPMD,,,.,., a b, send deadlock,. (MPI Isend )., A a b B a b. MPI ,,,. 2 ( ). 14

15 (one-sided communication) (Remote Memory Access; RMA). put, get API,,.,,. MPI 2 MPI Get, MPI Put API, RMA.,,,., NIC,, CPU RMA. RMA,,. MPI 2, MPI Get, MPI Put (MPI Win create).,., (MPI Win fence)., ( ) RMA (Global Address Space).,,,.. Partitioned Global Address Space (PGAS). PGAS,. PGAS 2,, p x, p a 5,, ( ) a 23., a, a 5. (local view) PGAS, (global view) PGAS. CAF, UPC, Chapel, X10.,, (global address space),, a 5 15

16 ,, PGAS CAF, co-array. real, dimension(n) :: a real, dimension(n)[*] :: a a co-array, n a (CAF, )., 3 5, a(5)[3] PGAS UPC Chapel,,. UPC (shared),. shared int x;, x, x.,., shared int b[100*threads]; 100 THREADS ( ), 100., 0 i < 100 THREADS.,,. ( ). UPC shared ( ), ( ),, 16

17 . shared,. Chapel (class ). shared,. X10., Chapel UPC.,. (1 ), (GlobalRef),.,,., PGAS,., UPC, Chapel, X10, -.,. Chapel,., Chapel X10,., on at,. UPC, (upc alloc)., (upc global alloc, upc all alloc) :?,,? Y. RMA: Y, (RMA)? Y. 17

18 3: RMA PGAS global view MPI Y Y 4 N N - CAF Y Y Y N - UPC Y Y Y Y block-cyclic OpenMP N Cilk N TBB N X10 Y Y 5 Y Y block Chapel Y Y Y Y block-cyclic, PGAS: Y, (PGAS)? Y. Global View: Y, global view, local view. Y. : Y,, 3,,,,,.,,. X10 Chapel.,,,., PGAS,,. (aggregation),. UPC upc memget, upc memput API., upc memget ( 4 MPI 2 5 at 18

19 ).,, MPI., Chapel X10.,, 2.,.,,,,.,, MPI.,, GPU/CPU,, GPU., (Cilk ), GPU, [17, 20].,,. [1] Eric Allen, David Chase, Joe Hallett, Victor Luchangco, Jan-Willem Maessen, Sukyoung Ryu, Guy L. Steele Jr., and Sam Tobin-Hochstadt. The Fortress language specification version 1.0. Technical report, Sun Microsystems, Inc., [2] Co-Array Fortran. [3] Rohit Chandra, Ramesh Menon, Leo Dagum, David Kohr, Dror Maydan, and Jeff McDonald. Parallel Programming in OpenMP. Morgan Kaufmann, [4] The Chapel parallel programming language. [5] Barbara Chapman, Gabriele Jost, and Ruud van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

20 [6] The Cilk project. [7] Intel Cilk Plus. [8] Cray. Chapel language specification Technical report, Cray Inc, [9] Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. UPC: Distributed Shared Memory Programming. John Wiley & Sons Inc., [10] Project Fortress community. [11] Michael Garland and David B. Kirk. Understanding throughput-oriented architectures. Communications of the ACM, 53(11):58 66, [12] Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. Java Concurrency in Practice. Addison-Wesley, [13] William Gropp, Ewing Lusk, and Rajeev Thakur. Using MPI-2: Advanced Features of the Message Passing Interface. MIT Press, [14] Supercomputing Technologies Group. Cilk Reference Manual. MIT Laboratory for Computer Science. [15] Hadoop. [16] Doug Lea. A Java fork/join framework. In JAVA 00: Proceedings of the ACM 2000 conference on Java Grande, pages 36 43, New York, NY, USA, ACM. [17] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages ACM, [18] The message passing interface (MPI) standard. [19] Robert W. Numrich and John Reid. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 17:1 31, [20] Satoshi Ohshima, Shoichi Hirasawa, and Hiroki Honda. OMPCUDA : OpenMP execution framework for CUDA based on Omni OpenMP compiler. In Proceedings of International Workshop on OpenMP, volume 6132 of Lecture Notes in Computer Science, pages Springer Berlin / Heidelberg,

21 [21] OpenMP. [22] OpenMP Application Program Interface Version [23] Peter Pacheco. Parallel Programming with MPI. Morgan Kaufmann, [24] Intel Parallel Building Blocks. [25] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi- Core Processor Parallelism. Oreilly & Associates Inc, [26] Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David Grove. Report on the programming language X10 version 2.1. Technical report, IBM, latest.pdf. [27] Intel Threading Building Blocks 3.0 for open source. [28] Unified Parallel C. [29] Tom White. Hadoop: The Definitive Guide. Oreilly & Associates Inc, [30] X10. [31] and. CUDA. I O,

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N GPU 1 1 2 1, 3 2, 3 (Graphics Unit: GPU) GPU GPU GPU Evaluation of GPU Computing Based on An Automatic Program Generation Technology Makoto Sugawara, 1 Katsuto Sato, 1 Kazuhiko Komatsu, 2 Hiroyuki Takizawa