連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 分子 ( 粒子 ) セル 系の状態 ステップ 18

Similar documents
連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

連載講座 : 高生産並列言語を使いこなす (3) ゲーム木探索問題 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 概要 17 2 ゲーム木探索 必勝 必敗 引き分け 盤面の評価値 αβ 法 指し手の順序付け (mo

IntelR Compilers Professional Editions

DPD Software Development Products Overview

XcalableMP入門

Microsoft PowerPoint - 03_What is OpenMP 4.0 other_Jan18

N08

Cell/B.E. BlockLib

01_OpenMP_osx.indd

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

C のコード例 (Z80 と同機能 ) int main(void) { int i,sum=0; for (i=1; i<=10; i++) sum=sum + i; printf ("sum=%d n",sum); 2

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£²¡Ë

VXPRO R1400® ご提案資料

cpp1.dvi

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

O(N) ( ) log 2 N

02_C-C++_osx.indd

( ) ( ) 30 ( ) 27 [1] p LIFO(last in first out, ) (push) (pup) 1


DocuPrint C2424 取扱説明書(詳細編)

r07.dvi

ohp07.dvi

: (1), ( ) 1 1.1,, 1 OpenMP [3, 5, 21, 22], MPI [13, 18, 23].., (C Fortran)., OS,. C Fortran,,,,. ( ),,.,,.,,,.,,,.,.,. 1

Java演習(4) -- 変数と型 --

/* do-while */ #include <stdio.h> #include <math.h> int main(void) double val1, val2, arith_mean, geo_mean; printf( \n ); do printf( ); scanf( %lf, &v

: : : TSTank 2

,,,,., C Java,,.,,.,., ,,.,, i

C

Microsoft PowerPoint - OpenMP入門.pptx

OpenMP (1) 1, 12 1 UNIX (FUJITSU GP7000F model 900), 13 1 (COMPAQ GS320) FUJITSU VPP5000/64 1 (a) (b) 1: ( 1(a))

‚æ4›ñ

2/66

先進的計算基盤システムシンポジウム Shuffle KVP KVP MapReduce KVP 7) Jimmy PageRank MapReduce.69 Jimmy KVP Jimmy key KVP value KVP MapReduce 3 PageRank 4 Jimmy M

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

インテル(R) C++ Composer XE 2011 Windows版 入門ガイド

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

演習1: 演習準備

( ) 1.1 Polychoric Correlation Polyserial Correlation Graded Response Model Partial Credit Model Tetrachoric Correlation ( ) 2 x y x y s r 1 x 2

EGunGPU

2. OpenMP OpenMP OpenMP OpenMP #pragma#pragma omp #pragma omp parallel #pragma omp single #pragma omp master #pragma omp for #pragma omp critica

cpp2.dvi

II ( ) prog8-1.c s1542h017%./prog8-1 1 => 35 Hiroshi 2 => 23 Koji 3 => 67 Satoshi 4 => 87 Junko 5 => 64 Ichiro 6 => 89 Mari 7 => 73 D

Microsoft PowerPoint ppt [互換モード]

NOW204

XACCの概要

2

26

Java (7) Lesson = (1) 1 m 3 /s m 2 5 m 2 4 m 2 1 m 3 m 1 m 0.5 m 3 /ms 0.3 m 3 /ms 0.6 m 3 /ms 1 1 3

untitled

Emacs ML let start ::= exp (1) exp ::= (2) fn id exp (3) ::= (4) (5) ::= id (6) const (7) (exp) (8) let val id = exp in


() / (front end) (back end) (phase) (pass) 1 2

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

II 3 yacc (2) 2005 : Yacc 0 ~nakai/ipp2 1 C main main 1 NULL NULL for 2 (a) Yacc 2 (b) 2 3 y

1.ppt

倍々精度RgemmのnVidia C2050上への実装と応用

Cプログラミング - 第8回 構造体と再帰

本文ALL.indd

Microsoft PowerPoint ppt [互換モード]

XMPによる並列化実装2

新版明解C言語 実践編

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co

Microsoft Word - openmp-txt.doc


最新の並列計算事情とCAE

6-1

修士論文 物性研究 電子版 Vol. 5, No. 2, (2016 年 5 月号 ) 27-2 F14A001B

新・明解Javaで学ぶアルゴリズムとデータ構造

2012年度HPCサマーセミナー_多田野.pptx

< F2D B838A835882CC8CF68EAE2E6A7464>

(search: ) [1] ( ) 2 (linear search) (sequential search) 1

JavaScript Web Web Web Web Web JavaScript Web Web JavaScript JavaScript JavaScript GC GC GC GC JavaScript SSJSVM GC SSJSVM GC GC GC SSJSVM GC GC SSJSV

Taro-リストⅢ(公開版).jtd

1 1.1 C 2 1 double a[ ][ ]; 1 3x x3 ( ) malloc() 2 double *a[ ]; double 1 malloc() dou

Intel_ParallelStudioXE2013_ClusterStudioXE2013_Introduction.pptx

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

明解Javaによるアルゴリズムとデータ構造

program.dvi

1. ( ) SPH 1. 1: 2: 2.

WinHPC ppt

スパコンに通じる並列プログラミングの基礎

07-二村幸孝・出口大輔.indd

参加報告書

P70

untitled

スパコンに通じる並列プログラミングの基礎

-2-

DPD Software Development Products Overview

1 1.1 C 2 1 double a[ ][ ]; 1 3x x3 ( ) malloc() malloc 2 #include <stdio.h> #include

untitled

GPU CUDA CUDA 2010/06/28 1

8 if switch for while do while 2

ex01.dvi

ドキュメント1

A/B (2010/10/08) Ver kurino/2010/soft/soft.html A/B

ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

B演習(言語処理系演習)第一回

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _

<B54CB5684E31A4E9C0CBA4E5AA6BC160BEE3B27AA544A5552E706466>

Transcription:

連載講座 : 高生産並列言語を使いこなす (5) 分子動力学シミュレーション 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 問題の定義 17 2 逐次プログラム 17 2.1 分子 ( 粒子 ) 17 2.2 セル 17 2.3 系の状態 18 2.4 1ステップ 18 2.5 力の計算 19 2.6 速度と位置の更新 20 2.7 セル間の分子の移動 21 3 OpenMP による並列化 22 4 Cilk による並列化 23 5 Intel TBB による並列化 24 6 実験 27 6.1 評価環境 27 6.2 初期条件 27 6.3 逐次性能 28 6.4 台数効果 28 7 まとめ 29 8 今後の予定 29

1 [1, 2, 4, 5]., 2 Lenard Jones. { (σ ) 12 ( σ U(r) = 4ɛ r r ) 6. (1), ɛ, σ, r. A B, AB= r, U(r) = 24ɛ {2 σ12 σ6 r14 r 8 r (2). {... r, r., r, r., r 0. r 7,,,,. Lenard Jones, 0,., (2),,.,. bounding box ( ), bounding box. 2 2.1 ( ),,,,. typedef struct particle { int idx; double m; /* */ double x[3]; /* */ double v[3]; /* */ double f[3]; /* */ particle, * particle_t; 2.2 N N 2., 0,, N 2. O(N),,.,, [2].

( ),,,. 3σ, 3.2σ., particle.,,,. /* */ typedef struct cell { particle_array particles_in_cell; /* */ cell, * cell_t;, particle array, particle ( ),,. typedef struct particle_array { int n; /* a */ particle * a; /* */ int sz; /* a */ particle_array, * particle_array_t; 2.3 config, cell. cell, xyz 3, C. typedef struct config { int step; /* */ int n_steps; /* */ int n_particles; /* */ double side_len; /* */ int n_cells_on_side; /* */ double cell_len; /* */ cell_t cells; /* */ config, * config_t; 2.4 1, 1. (calc force cells) 2., (update cells) 3. (migrate cells)

3. void step(config_t g) { /* */ double U = 0.5 * calc_force_cells(g); /* */ double K = update_cells(g); record_u_and_k(k, U) /* */ migrate_cells(g);,,,.. 2.5 calc force cell,,. double calc_force_cells(config_t g) { double U = 0.0; int i, j, k; /*, */ for (i = 0; i < nos; i++) { for (j = 0; j < nos; j++) { for (k = 0; k < nos; k++) { U += calc_force_cell(g, i, j, k); return U; calc force cell(g, i, j, k), (i, j, k),,. double calc_force_cell(config_t g, int i, int j, int k) { cell_t c = g->cells[i*nos*nos + j*nos + k]; double U = 0.0; int ii, jj, kk; int idx; if (c->particles_in_cell.n == 0) return 0.0;

/* 0 */ for (idx = 0; idx < c->particles_in_cell.n; idx++) { particle_t p = &c->particles_in_cell.a[idx]; v_zero(p->f); /* */ for (ii = i - 1; ii <= i + 1; ii++) { for (jj = j - 1; jj <= j + 1; jj++) { for (kk = k - 1; kk <= k + 1; kk++) { U += calc_force_cell_cell(g, i, j, k, ii, jj, kk); return U; calc force cell cell(g, i, j, k, i, j, k ), (i, j, k), (i, j, k ).,.,,. 2.6, leapfrog [5]. particle v, (t t/2). v(t + t/2) = v(t t/2) + f t m (3) x(t + t) = x(t) + v(t + t/2) t (4) update cells, calc force cells,. double update_cells(config_t g) { int i, j, k; double K = 0.0; for (i = 0; i < nos; i++) { for (j = 0; j < nos; j++) { for (k = 0; k < nos; k++) { return K; K += update_cell(g, i, j, k);

2.7, (migrate cells)., calc force cells update cells. void migrate_cells(config_t g) { int i, j, k; for (i = 0; i < nos; i++) { for (j = 0; j < nos; j++) { for (k = 0; k < nos; k++) { migrate_cell(g, i, j, k); migrate cell(g, i, j, k), (i, j, k), (migrate particle). void migrate_cell(config_t g, int i, int j, int k) { cell_t c = &g->cells[i*nos*nos + j*nos + k]; int idx = 0; while (idx < c->particles_in_cell.n) { particle_t p = &c->particles_in_cell.a[idx]; if (migrate_particle(g, p, i, j, k) == 0) { idx++; migrate particle,, (i, j, k)., (i, j, k), (i, j, k ). int migrate_particle(config_t g, particle_t p, int i, int j, int k) { int new_i, new_j, new_k; translate_to_inside(g, p); new_i = (int)floor(p->x[0] / g->cell_len) % nos; new_j = (int)floor(p->x[1] / g->cell_len) % nos; new_k = (int)floor(p->x[2] / g->cell_len) % nos; if (new_i < 0) new_i += nos; if (new_j < 0) new_j += nos;

if (new_k < 0) new_k += nos; if (i == new_i && j == new_j && k == new_k) return 0; cell_t cell = &g->cells[i*nos*nos + j*nos + k]; cell_t new_cell = &g->cells[new_i*nos*nos + new_j*nos + new_k]; /* */ particle_array_add(&new_cell->particles_in_cell, p); /* */ particle_array_remove(&cell->particles_in_cell, p); return 1;, OpenMP, Cilk, Intel TBB. 3 OpenMP 3 ( calc force cells, update cells, migrate cells), OpenMP parallel for,. calc force cells pragma. double calc_force_cells(config_t g) { double U = 0.0; int i, j, k; #pragma omp parallel for collapse(3) reduction(+:u) for (i = 0; i < nos; i++) { for (j = 0; j < nos; j++) { for (k = 0; k < nos; k++) { U += calc_force_cell(g, i, j, k); return U; collapse(3), pragma 3 ( ),. update cells migrate cells, migrate cells,.,,,.,,,, (i, j, k),. (static), nos, i.

4 Cilk MIT Cilk spawn,., ( ), x,y,z 2., 2, CPU ( 2 ; Orthogonal Recursive Bisection), /.,, /. pragma,, /., 3, Chapel, X10, TBB.,. /* 3 [i0,i1]x[j0,j1]x[k0,k1] */ typedef struct dom3 { int i0; int i1; int j0; int j1; int k0; int k1; dom3; d 2, p[0], p[1] split dom3. 0. int split_dom3(dom3 d, dom3 p[2]) { int i0 = d.i0, i1 = d.i1; int j0 = d.j0, j1 = d.j1; int k0 = d.k0, k1 = d.k1; if (i1 - i0 == 1 && j1 - j0 == 1 && k1 - k0 == 1) return 0; else { p[0] = p[1] = d; /* d */ /* */ if (i1 - i0 >= j1 - j0 && i1 - i0 >= k1 - k0) { p[0].i1 = p[1].i0 = (i0 + i1) / 2; else if (j1 - j0 >= k1 - k0) { p[0].j1 = p[1].j0 = (j0 + j1) / 2; else { p[0].k1 = p[1].k0 = (k0 + k1) / 2; return 1;

d. cilk double calc_force_cells_aux(config_t g, dom3 d) { dom3 p[2]; if (split_dom3(d, p) == 0) { /* (1 ) */ double K; spawn K = calc_force_cell(g, d.i0, d.j0, d.k0); sync; return K; else { double K0, K1; K0 = spawn calc_force_cells_aux(g, p[0]); K1 = spawn calc_force_cells_aux(g, p[1]); sync; return K0 + K1; split dom3,,. OpenMP,. cilk double calc_force_cells(config_t g) { dom3 d = { 0, nos, 0, nos, 0, nos ; double K; K = spawn calc_force_cells_aux(g, d); sync; return K; update cells, migrate cells., Cilk, OpenMP task, Intel TBB taskgroup., Intel Cilk Plus, GCC 4.7 Cilk, cilk for for, for 2. cilk for,,,. 5 Intel TBB Intel TBB 3 (blocked range, blocked range2d, blocked range3d),, parallel for, parallel reduce. calc force cells, (reduction). parallel reduce [3]. parallel reduce,

parallel reduce(, ); blocked range, blocked range2d, blocked range3d., (reduction type).. C. : splitting :, C(const C& c, tbb::split);, c (this)., c., parallel reduce, splitting ( ) tbb::split. void operator():, void operator(const blocked range3d<int>& r);, r. r parallel reduce.. join : ( ). void join(const C& c);, c this operator(), this.,. class array_sum { double * a; public: double val; /* */ array_sum(double * a) { this->a = a; val = 0.0; /* splitting */ array_sum(const array_sum &c, tbb::split) { a = c.a; val = 0.0; /* operator() */

void operator()(blocked_range<int> r) { int i; for (i = r.begin(); i!= r.end(); i++) { val += a[i]; /* */ void join(const array_sum &other) { val += other.val; ;. double sum(double * a, int n) { sum_chunk s(a); parallel_reduce(blocked_range<int>(0, n), s); return s.val;, C++ (lambda ),. #include <tbb/tbb.h> using namespace tbb; double sum_without_class(double * a, int n) { return parallel_reduce(blocked_range<int>(0, n), 0.0, // [=](blocked_range<int> r, double ps)->double { int i; for (i = r.begin(); i!= r.end(); ++i) { ps += a[i]; return ps;, std::plus<double>()); parallel reduce,. double calc_force_cells_iter(config_t g) { return parallel_reduce(blocked_range3d<int>(0, nos, 0, nos, 0, nos),

performance (GFLOPS) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 density 1: C., 1.0 0.0, [=](blocked_range3d<int>& r, double U)->double { int i, j, k; for (i = r.pages().begin(); i!= r.pages().end(); ++i) { for (j = r.rows().begin(); j!= r.rows().end(); ++j) { for (k = r.cols().begin(); k!= r.cols().end(); ++k){ U += calc_force_cell(g, i, j, k); return U;, std::plus<double>()); 6 6.1 24 (48 ). CPU: Intel Nehalem-EX (E7540) 2.0GHz (6 core/12 4 ) L3 cache: 18MB memory: 24GB DDR3 6.2., x, y, z ±0.05σ..,,

relative performance 1.2 1 0.8 0.6 0.4 0.2 0 C Cilk OpenMP TBB 2: OpenMP, Cilk, TBB, C ( 6 2σ). 3.2σ., 35-40.,. 500,000, ( ). 6.3 1, C.,, GFLOPS. 1, ( ) 12, 17., = 1.0,.,, n, n 2, n. 80, 1. 1.5GFLOPS, SIMD, C. 2, OpenMP, Cilk, TBB 1,, C ( ). 6.4 6.4. 1. 8 5, 12 10,, 95%. 24 17-18., Cilk 22, 23,,.

20 15 Cilk OpenMP TBB 10 5 0 0 5 10 15 20 25 7 Lenard Jones, OpenMP, Cilk, TBB,., OpenMP parallel for, Cilk spawn, TBB parallel reduce, parallel for, 24 17-19. 8,.,, (Chapel, X10, UPC ). [1] Daan Frenkel and Berend Smit. Understanding Molecular Simulation. From Algorithms to Applications. Academic Press, 1996. [2] D. C. Rapaport. The Art of Molecular Dynamics Simulation. Cambridge University Press. [3] Intel Threading Building Blocks design patterns. available from http://threadingbuildingblocks.org/documentation.php. Last accessed Nov. 9, 2011. [4]. ( 2 )., 2011. [5]. HOW TO,,,., 2004.