スライド 1

Size: px

Start display at page:

Download "スライド 1"

ときはぎにわ
5 years ago
Views:

1 High Performance and Productivity 並列プログラミング課題と挑戦

2 HPC システムの利用の拡大の背景シュミレーションへの要求より複雑な問題をより精度良くシュミレーションすることが求められている HPC システムでの並列処理の要求の拡大 1. モデルアルゴリズム解析対象は何れもより複雑で規模の大きなものになっている 2. マイクロプロセッサのマルチコア化 3. クラスタに代表される並列計算機システム HPCシステムの一般化の低コスト化

3 マイクロプロセッサの性能マイクロプロセッサの性能向上動作周波数からマルチコアへマルチコア上での並列処理による性能向上従来以上の性能向上の実現がマルチコアの技術の最大限の活用 ( マルチスレッドマルチタスク ) によって可能となりますそのための技術習得や開発環境の整備が急務です動作周波数の向上による性能向上 2005 今後のプロセッサ

4 並列アプリケーション従来型の並列アプリケーションの開発プロセスバッチワークフローでの開発プロセス数ヶ月 - 数年デスクトッププロトタイプ開発 HPC システム用プログラミング C/C++,Fortran,MPI HPC システムでのテストと性能向上のための作業プロダクションシュミレーション

5 デスクトップデスクトップでのプロトタイプ計算ハイレベル言語 ( 例えば MATLAB など ) マイクロソフトエクセルなどでの処理ユーザシステムアーキテクチャやマイクロプロセッサのマイクロアーキテクチャを意識することなくプログラミングを行う GUI を利用した開発環境

6 並列プログラミングでの課題 PC 上でのプログラミング API The Development of Custom Parallel Computing Applications Simon Management Group September 2006 複数選択可

7 並列アプリケーションプログラミング専門的な知識と並列 API に関する学習 C,Fortran などのコンパイラを利用し並列処理には MPI(Message Passing Interface) などのコミュニケーションを明示的に記述ユーザ利用する HPC システムのアーキテクチャを意識したプログラミングプログラムの開発時に HPC システムを利用する場合バッチなどにジョブを投入しプログラミングの確認を行うことが必要実際のモデル化やアルゴリズムを実際の解析対象で確認するのはプログラムの完成時まで困難

8 プログラミングの生産性開発サイクルプログラムのコーディング以外にも様々な作業が必要プログラムのデバッグ実際の解析モデルと入力データが必要逐次処理や対話処理が必要スケーラビリティの実現 HPC システムでの高いスケーラビリティの実現には高度なプログラミングが必要アルゴリズムの選択やデータのモデル化の検討

9 並列プログラミングでの課題並列アプリケーション開発でのボトルネックデバッグ環境や開発ツールに関する不満 Debugging 21.6% HPC Software tools 20.7% Code writing 18.1% The Development of Custom Parallel Computing Applications Simon Management Group September 2006 複数選択可

$h> static int num_steps = 1000000000; double step; int main () { int i, nthreads; double start_time, stop_time; double x, pi, sum = 0.$ $0/(double) num_steps; #pragma omp parallel private(x) { nthreads = omp_get_num_threads(); #pragma omp for reduction(+:sum) for (i=0;i<$

10 並列プログラミング並列コンパイラ並列デバッガ並列数学ライブラリ並列コード最適化ツール #include <omp.h> #include <stdio.h> #include <time.h> static int num_steps = ; double step; int main () { int i, nthreads; double start_time, stop_time; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel private(x) { nthreads = omp_get_num_threads(); #pragma omp for reduction(+:sum) for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } } pi = step * sum; printf("%5d Threads : The value of PI is %10.7f n",nthreads,pi); } スレッド解析ツール最適化ツール MPI タスク解析ツール最適化ツール統合インターフェイス

11 ソフトウエアのギャップの解決デスクトップ Windows 環境スレッドベースの並列処理対話処理豊富なデバッグツールと開発環境クラスタシステムバッチ環境での利用複雑なデバッグ MPIなどのメッセージ交換方式でのプログラミング Linux (Unix) ワークステーションサーバクラスタ #Processors

12 プログラミングのギャップ数ヶ月 - 数年プロダクションシュミレーションスケーラブルな性能のアプリケーションの開発プロトタイプ開発並列プログラミング C/C++,MPI OpenMP テストと性能向上のための作業プロトタイプ開発デスクトップテストと性能向上のための作業

13 プログラミングの生産性の向上スケーラブル SMP システムデスクトッププロトタイプ開発 HPC システムでのテストと性能向上のための作業数ヶ月

14 ハイエンド仮想化複数の仮想マシンサーバ ( 仮想化なし ) 仮想化ソフトウエアアプリケーションオペレーティングシステムアプリケーションアプリケーション一台の仮想マシンアプリケーションアプリケーションアプリケーションオペレーティングシステム仮想化ソフトウエア

15 ScaleMP vsmp アーキテクチャアプリケーションについては他の x86 システムと 100% のバイナリ互換を実現 OS は通常の Linux ディストリビューションが利用可能 Hardware は一般の x86 チップセットと標準インターコネクトでシステムの構築が可能 vsmp Foundation でのシステムの SMP 拡張を実現

16 VXSMP 1440 システム VXSMP 1440 システムとは? VXPRO R1440 ベースの vsmp システム 1U の筐体に 4 台の Xeon 55xx/54xx/52xx を搭載し最大 16 コア SMP を実現 96GB の共有メモリ空間 ( 最大 ) 4 台の HDD を OS が共有 ( 最大 6TB) 写真は Xeon 5400 搭載モデル

17 VXSMP1400 Xeon 5400 搭載モデ

18 Bandwidth (MB/sec.) Stream (OMP) ベンチマーク 1333MHz FSB(128cores/16 boards) 1600MHz FSB(128cores/16 boards) 6.4GT/s QPI(128cores/16 boards) All results are in MB/s MB=10^6 B, *not* 2^20 B Machine ID ncpus COPY SCALE ADD TRIAD SGI_Altix_ SGI_Altix_3700_Bx SGI_Altix_ NEC_SX IBM_Power_ NEC_SX-5-16A ScaleMP_XeonX5570_vSMP_16B NEC_SX HP_AlphaServer_GS Cray_T932_ E ScaleMP_XeonX5570_vSMP_8B Fujitsu/Sun_Enterprise_M NEC_SX IBM_System_p5_ HP_Integrity_SuperDome_dc IBM_Power_ VXSMP 2800 (Xeon X5550) Cray_C IBM_System_p SGI_Origin Azul_Vega2_ スレッド数 ( コア数 )

MFLOPS/S OpenMP ベンチマーク NAS Parallel Benchmark (Multi-Zone) 45000 40000 35000 30000 25000

著名な公開ベンチマークツールである NAS Parallel Benchmark (NPB) の一つである NPB-MZ (NPB Multi-Zone)

19 MFLOPS/S OpenMP ベンチマーク NAS Parallel Benchmark (Multi-Zone) LU-MZ SP-MZ OpenMP スレッド数 /N プロセッサコア SP-MZ LU-MZ 著名な公開ベンチマークツールである NAS Parallel Benchmark (NPB) の一つである NPB-MZ (NPB Multi-Zone) はより粒度の大きな並列化の提供を行っています NPB-MZ ではハイブリッド型の並列処理やネストした OpenMP のテストが可能ですここでの結果は OpenMP だけでの並列処理の性能を評価しています

$h> 4 static int num_steps = 1000000000; 5 double step; 6 int main () 7 { 8 int i, nthreads; 9 double start_time, stop_time; 10 double x, pi, sum = 0.0; 11 step = 1.$

20 OpenMP プログラムコンパイルと実行例 $ cat -n pi.c 1 #include <omp.h> // OpenMP 実行時関数呼び出し 2 #include <stdio.h> // のためのヘッダファイルの指定 3 #include <time.h> 4 static int num_steps = ; 5 double step; 6 int main () 7 { 8 int i, nthreads; 9 double start_time, stop_time; 10 double x, pi, sum = 0.0; 11 step = 1.0/(double) num_steps; // OpenMPサンプルプログラム : 12 #pragma omp parallel private(x) // 並列実行領域の設定 13 { nthreads = omp_get_num_threads(); // 実行時関数によるスレッド数の取得 14 #pragma omp for reduction(+:sum) // for ワークシェア構文 15 for (i=0;i< num_steps; i++){ // privateとreduction 指示句 16 x = (i+0.5)*step; // の指定 17 sum = sum + 4.0/(1.0+x*x); 18 } 19 } 20 pi = step * sum; 21 printf("%5d Threads : The value of PI is %10.7f n",nthreads,pi); 22 } $ icc -O -openmp pi.c pi.c(14) : (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. pi.c(12) : (col. 2) remark: OpenMP DEFINED REGION WAS PARALLELIZED. $ setenv OMP_NUM_THREADS 2 $ a.out 2 Threads : The value of PI is

21 共用データ分散仮想共有メモリ (DVSM) インテルクラスタOpenMP DVSM マルチスレッド化されたプログラム...

$h> 4 static int num_steps = 1000000; 5 double step; 6 #pragma intel omp sharable(num_steps) 7 #pragma intel omp sharable(step) 8 int main () 9 { 10 int i, nthreads; 11 double start_time, stop_time;$

22 Cluster OpenMP プログラムコンパイルと実行例 $ cat -n cpi.c 1 #include <omp.h> // OpenMP 実行時関数呼び出し 2 #include <stdio.h> // のためのヘッダファイルの指定 3 #include <time.h> 4 static int num_steps = ; 5 double step; 6 #pragma intel omp sharable(num_steps) 7 #pragma intel omp sharable(step) 8 int main () 9 { 10 int i, nthreads; 11 double start_time, stop_time; 12 double x, pi, sum = 0.0; 13 #pragma intel omp sharable(sum) 14 step = 1.0/(double) num_steps; // OpenMPサンプルプログラム : 15 #pragma omp parallel private(x) // 並列実行領域の設定 16 { 17 nthreads = omp_get_num_threads(); // 実行時関数によるスレッド数の取得 18 #pragma omp for reduction(+:sum) // for ワークシェア構文 19 for (i=0;i< num_steps; i++){ // privateとreduction 指示句 20 x = (i+0.5)*step; // の指定 21 sum = sum + 4.0/(1.0+x*x); 22 } 23 } 24 pi = step * sum; 25 printf("%5d Threads : The value of PI is %10.7f n",nthreads,pi); 26 } 27 $ icc -cluster-openmp -O -xt cpi.c cpi.c(18) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. cpi.c(15) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED. $ cat kmp_cluster.ini --hostlist=node0,node1 --processes=2 --process_threads=2 --no_heartbeat --startup_timeout=500 $./a.out 4 Threads : The value of PI is

23 Cluster OpenMP プログラムベンチマークシステムスケーラビリティサンプル NEXXUS 4820-PT 2.66GHz/1066MHz FSB/16GB Memory/InfiniBand プログラムサンプル NAS Parallel Benchmark / EP ベンチマーク OpenMP サンプルプログラム ( 分子動力学サンプル nparts=10000 で実行 ) OpenMP サンプル (Jacobi 法サンプル 5000x5000)

24 Vertical Scaling MPI OpenMP シングル API での並列処理 Cluster OpenMP はノード内 (SMP) とノード間で同一の並列 API でのプログラミングを可能とします Horizontal Scaling MPI

25 MPI/OpenMP ハイブリッドモデル MPI では領域分割などの疎粒度での並列処理を行う OpenMP は各 MPI タスク内でループの並列化などのより細粒度での並列化を担う計算はタスク - スレッドの階層構造を持つ MPI タスク高性能インターコネクト Memory Memory Memory Memory P P P P P P P P P P P P P P P P OpenMP スレッド

26 MPI/OpenMP ハイブリッドコード MPI で並列化されたアプリケーションに OpenMP での並列化を追加 MPI 通信と OpenMP でのワークシェアを利用して効率良い並列処理の実現 include mpif.h program hybsimp Fortran #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; C/C++ call MPI_Init(ierr) call MPI_Comm_rank (...,irank,ierr) call MPI_Comm_size (...,isize,ierr)! Setup shared mem, comp. & Comm!$OMP parallel do do i=1,n <work> enddo! compute & communicate call MPI_Finalize(ierr) end ierr= MPI_Init(&argc,&argv[]); ierr= MPI_Comm_rank (...,&rank); ierr= MPI_Comm_size (...,&size); //Setup shared mem, compute & Comm #pragma omp parallel for for(i=0; i<n; i++){ <work> } // compute & communicate ierr= MPI_Finalize();

27 OpenMP/MPI ハイブリッドモデル MPI は実績のある高性能な通信ライブラリ計算と通信を非同期に実行することも可能通信はマスタースレッドシングルスレッド全スレッドで実行することが可能 MPI タスク高性能インターコネクト Memory Memory Memory Memory P P P P P P P P P P P P P P P P OpenMP スレッド

28 OpenMP/MPI ハイブリッドコード OpenMP のプログラムに MPI 通信を追加既存の OpenMP プログラムの拡張やスレッドプログラムの新規開発時のオプションとして選択 MPI は非常に高速また最適化されたデータ通信ライブラリ include mpif.h program hybmas Fortran #include <mpi.h> int main(int argc, char **argv){ int rank, size, ierr, i; C/C++!$OMP parallel!$omp barrier!$omp master call MPI_<Whatever>(,ierr)!$OMP end master!$omp barrier!$omp end parallel end #pragma omp parallel { #pragma omp barrier #pragma omp master { ierr=mpi_<whatever>( ) } #pragma omp barrier }

29 ScaleMP vsmp アーキテクチャアプリケーションについては他の x86 システムと 100% のバイナリ互換を実現 OS は通常の Linux ディストリビューションが利用可能 Hardware は一般の x86 チップセットと標準インターコネクトでシステムの構築が可能 vsmp Foundation でのシステムの SMP 拡張を実現

OpenMP スレッド数 SpeedUP OpenMP/MPI/ ハイブリッド This is the Hybrid OpenMP MPI Benchmarkproject ("homb") Hybrid OpenMP MPI Benchmarkproject ("homb") 32 28 24 This project was registered on SourceForge.

30 OpenMP スレッド数 SpeedUP OpenMP/MPI/ ハイブリッド This is the Hybrid OpenMP MPI Benchmarkproject ("homb") Hybrid OpenMP MPI Benchmarkproject ("homb") This project was registered on SourceForge.net on May 16, 2009, and is described by the project team as follows: HOMB is a simple benchmark based on a parallel iterative Laplace solver aimed at comparing the performance of MPI, OpenMP, and hybrid codes on SMP and multi-core based machines MPI タスク数

31 MFLOPS/S OpenMP ベンチマーク NAS Parallel Benchmark (Multi-Zone) z y x SP-MZ x-zones OpenMP スレッド数 /N プロセッサコア著名な公開ベンチマークツールである NAS Parallel Benchmark (NPB) の一つである NPB-MZ (NPB Multi-Zone) はより粒度の大きな並列化の提供を行っています NPB-MZ ではハイブリッド型の並列処理やネストした OpenMP のテストが可能ですここでの結果は OpenMP だけでの並列処理の性能を評価しています Xeon 5550 (2.66GHz) vsmp Foundation

32 ソフトウエアのギャップの解決デスクトップクラスタシステム Windows 環境スレッドベースの並列処理対話処理豊富なデバッグツールと開発環境 vsmp Foundation プラットフォームワークステーションサーバ Cluster OpenMP バッチ環境での利用複雑なデバッグ MPI などのメッセージ交換方式でのプログラミングクラスタ Linux (Unix) #Processors

お問い合わせ 0120-090715 携帯電話 PHS からは ( 有料 ) 03-5875-4718 9:00-18:00 ( 土日祝日を除く ) WEB でのお問い合わせ www.sstc.co.

33 お問い合わせ携帯電話 PHS からは ( 有料 ) :00-18:00 ( 土日祝日を除く ) WEB でのお問い合わせこの資料の無断での引用転載を禁じます社名製品名などは一般に各社の商標または登録商標ですなお本文中では特に TM マークは明記しておりません In general, the name of the company and the product name, etc. are the trademarks or, registered trademarks of each company. Copyright Scalable Systems Co., Ltd., Unauthorized use is strictly forbidden. 10/16/2009

PowerPoint プレゼンテーション

PowerPoint プレゼンテーション vsmp Foundation スケーラブル SMP システムスケーラブル SMP システム製品コンセプト 2U サイズの 8 ソケット SMP サーバコンパクトな筐体に多くのコアとメモリを実装し SMP システムとして利用可能スイッチなし構成でのシステム構築によりラックスペースを無駄にしない構成将来的な拡張性を保証 8 ソケット以上への拡張も可能 2 システム構成例ベースシステム 2U