Parallel Studio XE 2013 エクセルソフト 株 式 会 社 www.xlsoft.com Rev. 1.0 (2012/10/30) 1 / 59
... 4 Parallel Studio XE 2013... 4... 5... 6... 7 Composer XE... 9 Advisor XE... 11 Composer XE... 17 Composer XE... 21 Inspector XE... 24 Composer XE... 27 VTune Amplifier XE... 30 VTune Amplifier XE... 37... 42 Composer XE... 42... 42... 44 Inspector XE... 45... 45... 46... 46... 48... 49...50 vs.... 50 Tips... 51 VTune Amplifier XE... 52... 52 CPI... 53...54... 54 2 / 59
... 57... 57 N-Queens... 58 OpenMP... 58 Intel Cilk Plus... 58 Intel TBB... 58... 59 3 / 59
Parallel Studio XE 2013 C++ Parallel Studio XE 2013 Parallel Studio XE 2013 Microsoft* Visual Studio* 4 Advisor XE 2013 Composer XE 2013 Composer XE 2013 C++ 13.0 IA-32 Intel 64 Fortran 13.0 IA-32 Intel 64 7.1 IPP 11.0 MKL 4.1 TBB Inspector XE 2013 VTune Amplifier XE 2013 CPU 4 / 59
Visual Studio Composer XE VS2008 Advisor XE VTune Amplifier XE Inspector XE 4 Advisor XE Composer XE Inspector XE VTune Amplifier XE Intel(R) Core(TM) i7-2600 CPU 3.4GHz Sandy Bridge 8 OS Microsoft Windows 7 Professional (x64) IDE Microsoft Visual Studio 2010 Professional VS2010 5 / 59
Advisor XE Composer XE Inspector XE VTune Amplifier XE Advisor XE Advisor XE Composer XE Inspector XE VTune Amplifier XE 6 / 59
Advisor XE N-Queens N-Queens Zip C: Program Files (x86) Intel Advisor XE 2013 samples en C++ nqueens_advisor.zip Zip Zip 1_nqueens_serial nqueens_advisor 1_nqueens_serial nqueens_serial.cpp N-Queens Queen Queen Queen Queen N-Queens 1850 1969 Advisor XE N-Queens N-Queens 1_nqueens_serial nqueens_serial.cpp main() N N 14 main() N-Queens solve() timegettime() solve() setqueen() N i Queen 0 i setqueen() setqueen() Queen setqueen() nrofsolutions 7 / 59
int main(int argc, char*argv[]) { if(argc!=2) { cerr << "Usage: 1_nqueens_serial.exe boardsize [default is 14]. n"; size = 14; // チェスボードのデフォルトサイズ else { size = atoi(argv[1]); if (size < 4 size > 15) { cerr << "Boardsize should be between 4 and 15; setting it to 14. n" << endl; size = 14; // チェスボードのデフォルトサイズ ( 中 略 ) DWORD starttime=timegettime(); solve(); // N-Queens 問 題 を 解 く 関 数 をコール DWORD endtime=timegettime(); ( 中 略 ) return 0; void solve() { int * queens = new int[size]; for(int i=0; i<size; i++) { setqueen(queens, 0, i); // 計 算 エンジン 関 数 をNマス 回 コール void setqueen(int queens[], int row, int col) { //check all previously placed rows for attacks for(int i=0; i<row; i++) { // vertical attacks if (queens[i]==col) { return; // diagonal attacks if (abs(queens[i]-col) == (row-i) ) { return; // column is ok, set the queen queens[row]=col; if(row==size-1) { nrofsolutions++; //N-Queens 問 題 の 解 答 を 格 納 else { // try to fill next row for(int i=0; i<size; i++) { setqueen(queens, row+1, i); // 次 の 行 を 計 算 する 再 帰 関 数 8 / 59
Composer XE VS2010 C++ 1. nqueens_advisor.zip 2. nqueens_advisor.sln VS2010 3. C++ 1_nqueens_serial [ ] [ (R) Composer XE 2013] [ (R) C++ ] 9 / 59
4. Release 5. [ ] [ ] 365596 10 / 59
Advisor XE Advisor XE Advisor XE Advisor XE Workflow Survey Target Hotspot Hotspot Annotate Sources Advisor XE Check Suitability Check Correctness Add Parallel Framework Survey Target Survey Report Hotspot Top Loops Total Time 11 / 59
Summary Top Loops solve nqueens_serial.cpp 106 12 / 59
solve solve void solve() { int * queens = new int[size]; //ANNOTATE_SITE_BEGIN(solve); このコメントを 外 す for(int i=0; i<size; i++) { // try all positions in first row //ANNOTATE_ITERATION_TASK(setQueen); このコメントを 外 す setqueen(queens, 0, i); //ANNOTATE_SITE_END(); このコメントを 外 す ANNOTATE_SITE_BEGIN ANNOTATE_SITE_END setqueen ANNOTATE_ITERATION_TASK Check Suitability setqueen size advisor-annotate.h nqueens_serial.cpp //#include <advisor-annotate.h> 13 / 59
$(ADVISOR_XE_2013_DIR)/include Check Suitability Suitability Report Suitability Report All Sites Selected Site All Sites All Sites Selected Site 14 / 59
Check Correctness Check Correctness Debug Debug 8 Check Correctness 15 / 59
Correctness Report Problems and Messages P2 Data Communication P3 Memory reuse 2 Summary 16 / 59
Composer XE Composer XE Win 32 API OpenMP 3.1 Cilk Plus Threading Building Blocks TBB Win 32 API CreateThread _beginthread Win 32 API OpenMP OpenMP #pragma omp Win 32 API OpenMP OpenMP #pragma #pragma OS OpenMP Fork-Join Cilk Plus C++ Cilk Plus cilk_spawn cilk_sync cilk_for 17 / 59
TBB C++ TBB STL TBB TBB OpenMP C++ solve OpenMP OpenMP OpenMP #pragma omp parallel { printf printf OS for #pragma omp for For( i=0; i<n; i++ ) // 0 N { Advisor XE void solve() { int * queens = new int[size]; #pragma omp parallel for for(int i=0; i<size; i++) { // try all positions in first row setqueen(queens, 0, i); 18 / 59
OpenMP for size setqueen size 14 i=0 13 8 Intel(R) Core(TM) i7-2600 0 i= 0 1 1 2 3 2 4 5 6 12 7 13 8 setqueen solve() int * queens = new int[size]; #pragma omp parallel for Fork <スレッド0> setqueen(queens,0,0) setqueen(queens,0,1) <スレッド1> setqueen(queens,0,2) setqueen(queens,0,3) <スレッド2> setqueen(queens,0,4) setqueen(queens,0,5) <スレッド7> setqueen(queens,0,13) Join #pragma omp parallel for for Release OpenMP #pragma omp C++ /Qopenmp 1_nqueens_serial OpenMP 19 / 59
N-Queens 14 365596 Advisor XE Check Correctness Composer XE Inspector XE 20 / 59
Composer XE Composer XE OpenMP* Intel Cilk Plus Inspector XE GUI 1. 1_nqueens_serial [ ] 2. Release 21 / 59
2 Critical Error Warning Critical Error Warning Critical Error Warning False Positive False Negative [OK] Inspector XE Critical 1 Warning 1 2 Problems Weight nrofsolutions solve() 22 / 59
Complexity Metrics Cyclomatic complexity Intel_SSA 23 / 59
Inspector XE Inspector XE Inspector XE Inspector XE Debug Inspector XE Inspector XE Debug [C/C++] [ (R) C++] [OpenMP ] /Qopenmp [ ] - [ ] [ ] 8 Inspector XE New Analysis Configure Analysis Type Threading Error Analysis 3 Locate Deadlocks and Data Races Stack frame depth Scope Normal Extremely Thorough 24 / 59
[Start] Summary Summary Problems Data race 2 P1 P2 Problems P1 queens queens Write Timeline P2 nrofsolutions 25 / 59
Inspector XE queens nrofsolutions 26 / 59
Composer XE queens nrofsolutions OpenMP OpenMP queens nrofsolutions int nrofsolutions=0; void solve() { int * queens = new int[size]; #pragma omp parallel for for(int i=0; i<size; i++) { // try all positions in first row setqueen(queens, 0, i); A nrofsolutions OpenMP queens solve OpenMP B void setqueen(int queens[], int row, int col) { // column is ok, set the queen queens[row]=col; void setqueen(int queens[], int row, int col) { // column is ok, set the queen queens[row]=col; if(row==size-1) { nrofsolutions++; if(row==size-1) { nrofsolutions++; queens setqueen() Queen solve() setqueen() setqueen() queens solve() for queens setqueen nrofsolutions setqueen() Queen 27 / 59
nrofsolutions++ OpenMP critical nrofsolutions++ int nrofsolutions=0; void solve() { //int * queens = new int[size]; #pragma omp parallel for for(int i=0; i<size; i++) { int * queens = new int[size]; // try all positions in first row setqueen(queens, 0, i); void setqueen(int queens[], int row, int col) { // column is ok, set the queen queens[row]=col; nrofsolutions if(row==size-1) { #pragma omp critical nrofsolutions++; //N-Queens 問 題 の 解 答 を 格 納 else { // try to fill next row for(int i=0; i<size; i++) { setqueen(queens, row+1, i); solve() setqueen() queens setqueen() 28 / 59
solve() #pragma omp parallel for Fork <スレッド0> {int * queens = new []; setqueen(queens,0,0) <スレッド1> {int * queens = new []; setqueen(queens,0,2) <スレッド2> {int * queens = new []; setqueen(queens,0,4) <スレッド7> {int * queens = new []; setqueen(queens,0,13) {int * queens = new []; setqueen(queens,0,1) {int * queens = new []; setqueen(queens,0,3) {int * queens = new []; setqueen(queens,0,5) Join Debug Inspector XE Release [ ] [C/C++] [Language] - [OpenMP Support] /Qopenmp VTune AmplifierXE 29 / 59
VTune Amplifier XE CPU Release VTune Amplifier XE VTune Amplifier XE [New Analysis] [Analysis Type] [Analysis Type] [Concurrency] [Start] [Summary] Histogram Thread Concurrency Histogram Ideal Poor CPU Usage Histogram CPU CPU Ideal Poor 30 / 59
CPU [Bottom-up] [Bottom-up] 3 Hotspot CPU setqueen Hotspot CPU Poor Transitions Transition CPU Time CPU CPU Usage Thread Concurrency CPU [Summary] CPU 31 / 59
CPU Time [Zoom In on Selection] 1ms Running Waits Transitions ON 32 / 59
[Grouping] Function / Thread / Call Stack setqueen setqueen CPU CPU [Locks and Waits] [Bottom-up] Grouping Sync Object / Function / Call Stack 33 / 59
Sync Object OMP Critical setqueen:# Object OpenMP 34 / 59
ID solve() setqueen() #include <iostream> #include <windows.h> #include <mmsystem.h> #include "omp.h" // OpenMP 関 数 を 使 用 するためのヘッダー using namespace std; int solcnt[32]; // スレッド 単 位 の 解 答 を 格 納 する 配 列 ( 最 大 32スレッド) void solve() { int thrd_max = omp_get_max_threads(); #pragma omp parallel { int thrd_id = omp_get_thread_num(); #pragma omp for for(int i=0; i<size; i++) { int * queens = new int[size]; // try all positions in first row setqueen(queens, 0, i, thrd_id); // 利 用 可 能 なスレッド 数 の 最 大 値 の 取 得 // 本 関 数 を 実 行 するスレッドIDの 取 得 // スレッドID を 引 数 に 追 加 // pragma omp parallel(join) for(int i=0; i<thrd_max; i++) { nrofsolutions += solcnt[i] ; // スレッド 単 位 での 結 果 の 合 計 が 解 答 void setqueen(int queens[], int row, int col, int thrd_id) { // スレッドIDを 引 数 に 追 加 // column is ok, set the queen queens[row]=col; if(row==size-1) { solcnt[thrd_id]++; // スレッド 単 位 に 独 立 した 変 数 ( 同 期 処 理 不 要 ) else { // try to fill next row for(int i=0; i<size; i++) { setqueen(queens, row+1, i, thrd_id); // スレッドIDを 引 数 に 追 加 35 / 59
void solve( void ) int thrd_max = omp_get_max_threads(); // 8 #pragma omp parallel 0 1 6 7 Fork thrd_id ( 0 ) = omp_get_thread_num(); thrd_id ( 1 ) = omp_get_thread_num(); thrd_id ( 6 ) = omp_get_thread_num(); thrd_id ( 7 ) = omp_get_thread_num(); #pragma omp for setqueen(queens,0,0, thrd_id(=0)) setqueen(queens,0,1 thrd_id(=0)) setqueen(queens,0,2, thrd_id(=1)) setqueen(queens,0,3 thrd_id(=1)) setqueen(queens,0,12, thrd_id(=6)) setqueen(queens,0,13, thrd_id(=7)) setqueen( ) { setqueen( ) { setqueen( ) { setqueen( ) { // solcnt[thrd_id(=0)]++; // solcnt[thrd_id(=1)]++; // solcnt[thrd_id(=6)]++; // solcnt[thrd_id(=7)]++; Join for(int i=0; i<thrd_max; i++) { nrofsolutions += solcnt[i] ; // Release VTune Amplifier XE setqueen() 36 / 59
VTune Amplifier XE CPU CPU CPU VTune Amplifier XE EBS PMU CPU CPU CPU L1 / L2 / LLC ITLB / DTLB FP / MMX / SIMD / LOAD / STORE OP CPU VTune Amplifier XE EBS CPU CPU CPU 64 64 37 / 59
CPU Intel(R) Core(TM) i7-2600 Sandy Bridge Sandy Bridge [Analysis Type] [Sandy Bridge Analysis] [Access Contention] 6 MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS [Start] [PMU Events] setqueen queens col solcnt 2 38 / 59
queens cmp EBS queens mov 39 / 59
queens solcnt solcnt inc solcnt queens solve() OpenMP for (size(int) * 14) 32 56 _mm_malloc _mm_free solcnt int queens solcnt 64 queens 0x00 0x40 0x80 solcnt 40 / 59
#include <iostream> #include <windows.h> #include <mmsystem.h> #include "omp.h" using namespace std; //int solcnt[32]; // コメントアウト(キャッシュラインの 共 有 を 避 ける) declspec(align(64)) int solcnt[32][16]; // 64バイトの 要 素 間 隔 と64バイト 境 界 で 宣 言 void solve() { int thrd_max = omp_get_max_threads(); #pragma omp parallel { int thrd_id = omp_get_thread_num(); #pragma omp for for(int i=0; i<size; i++) { //int * queens = new int[size]; // コメントアウト int * queens = (int *) _mm_malloc(sizeof(int)*size, 64); // try all positions in first row setqueen(queens, 0, i, thrd_id); _mm_free(queens); // メモリー 解 放 // 64バイト 境 界 for(int i=0; i<thrd_max; i++) { nrofsolutions += solcnt[i][0] ; // 解 答 ( 配 列 宣 言 に 合 わせて 変 更 ) void setqueen(int queens[], int row, int col, int thrd_id) { // column is ok, set the queen queens[row]=col; if(row==size-1) { solcnt[thrd_id][0]++; // キャッシュラインの 共 有 を 避 ける else { // try to fill next row for(int i=0; i<size; i++) { setqueen(queens, row+1, i, thrd_id); VTune Amplifier XE EBS CPU 41 / 59
Composer XE /O3 IDE [C/C++] [ ] [ ] /O2 /O2 > icl /O3 main.cpp IPO /Qipo IDE [C/C++] [ [ (R) C++]] [ ] Release /Qx{SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX CORE-AVX2 CORE-AVX-I Host IDE [C/C++] [ [ (R) C++]] [ ] SSE SSE AVX AVX CPU > icl /O2 /QxAVX main.cpp AVX2 AVX2 CPU > icl /O2 /QxCORE-AVX2 main.cpp /QxHost Host SSE main.exe > icl /O2 /QxHost main.cpp 42 / 59
/Qax{SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX CORE-AVX2 CORE-AVX-I IDE [C/C++] [ [ (R) C++]] [ ] SSE2 /Qx AVX SSE2 > icl /O2 /QaxAVX main.cpp AVX SSE4.2 > icl /O2 /QaxAVX /QxSSE4.2 main.cpp SSE4.2 IA-32 x86/x87 > icl /O2 /QaxSSE4.2 /arch:ia32 main.cpp /arch:{sse2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX IA-32 IDE [C/C++] [ ] [ ] CPU SSE2 /O2 Release > icl /O2 /arch:sse2 main.cpp /Qparallel IDE [C/C++] [ [ (R) C++]] [ ] > icl /Qparallel main.cpp /Qpar-threshold[n] (n = 0 100) IDE [C/C++] [ ] [ ] 100 > icl /Qparallel /Qpar-threshold:90 main.cpp 43 / 59
OpenMP /Qopenmp IDE [C/C++] [ [ (R) C++]] [OpenMP ] OpenMP #pragma omp OpenMP libiomp5md.dll > icl /Qopenmp main-omp.cpp /Qopenmp-stubs IDE [C/C++] [ [ (R) C++]] [OpenMP ] OpenMP OpenMP OpenMP > icl /Qopenmp-stubs main-omp.cpp /Qopt-report{0-3 IDE [C/C++] [ [ (R) C++]] [ ] 0 3 > icl /O3 /Qipo /Qopt-report2 main.cpp /Qvec-report{0-6 IDE [C/C++] [ [ (R) C++]] [ ] /Qpar-report{0-3 0 6 > icl /O2 /QxAVX /Qvec-report3 main.cpp > icl /O2/Qvec-report2 main.cpp ( /arch:sse2) IDE [C/C++] [ ] [ ] /Qopenmp-report{0-2 IDE [C/C++] [ ] [ ] 0 3 > icl /Qparallel /Qpar-report3 main.cpp OpenMP 0 2 > icl /Qopenmp /Qopenmp-report2 main-omp.cpp < > Composer XE 2013 Documentation en_us ssadiag_docs problem_type_reference.chm 44 / 59
Inspector XE Inspector XE Analysis Type Memory Error Analysis Start [Explain Problem] 45 / 59
Inspector XE Intel Inspector XE 2013 - Problem Type Reference Cross-thread Stack Access Data Race Deadlock GDI Resource Leak GDI Incorrect memcpy Call memcpy Linux Invalid Deallocation Invalid Memory Access Invalid Partial Memory Access Kernel Resource Leak Lock Hierarchy Violation Memory Growth Memory Leak Memory Not Deallocated Mismatched Allocation/Deallocation Missing Allocation Thread Start Information Unhandled Application Exception Uninitialized Memory Access Uninitialized Partial Memory Access Inspector XE 3 46 / 59
[Debug This Problem] Analysis Type [Enable debugger when problem detected] Analysis Type [Select analysis start location with debugger] Visual Studio [ ] [Continue With Intel Inspector XE Analysis] 47 / 59
[Problem Details] [Disable Breakpoint] [Re-enable Breakpoints] Inspector XE Advanced Modules: [Modify ] Include only the following module(s) Exclude the following module(s) 48 / 59
Inspector XE Inspector XE Advanced Enable Collection progress information Show details 49 / 59
Inspector XE /Zi /ZI /debug /Od Release /MDd /MD /RTC[su1] Inspector XE false positives false negatives Inspector XE vs. Parallel Studio XE Inspector XE 50 / 59
Tips Visual Studio [ ] [ ] Collection Log window Inspector XE [Application Output] Summary Create Problem Report 51 / 59
VTune Amplifier XE VTune Amplifier XE 10ms 5% 3 Hotspots CPU Concurrency Lock and Waits PMU PMU Sample After Value 1ms 2% PMU Core Microarchitecture Hotspots Concurrency Locks and Waits PMU 52 / 59
CPI Lightweight Hotspots CPI CPI Clockticks per Instructions Retired CPU CPU CPI CPI = CPU Core Microarchitecture 1 CPI 0.25 = / CPI 0.25 VTune Amplifier XE CPI 1.00 CPU 5% multiply2 CPI 1.112 CPI Lightweight Hotspots CPU_CLK_UNHALTED.THREAD CPU INST_RETIRED.ANY CPI CPU_CLK_UNHALTED.THREAD INST_RETIRED.ANY CPI CPI CPU CPU_CLK_UNHALTED.THREAD 53 / 59
VTune Amplifier XE /Zi /debug /O2 Release /MD /MDd C TBB /D"TBB_USE_THREADING_TOOLS" /debug:inline-debug-info /Ob0 VTune Amplifier XE TBB VTune Amplifier XE inline mode VTune Amplifier XE Hotspot CPU Hotspot Pause Resume VTune Amplifier XE API [Start] [Start Paused] [Resume] 54 / 59
API VTune Amplifier XE Resume itt_resume Pause itt_pause Resume Pause API #include ittnotify.h Func() { // API // itt_resume(); A itt_pause(); itt_resume(); B itt_pause(); // // // // // // // // ittnotify.h x86 C: Program Files Intel VTune Amplifier XE 2013 include x64 C: Program Files (x86) Intel VTune Amplifier XE 2013 include 55 / 59
x86 C: Program Files Intel VTune Amplifier XE 2013 lib32 x64 C: Program Files (x86) Intel VTune Amplifier XE 2013 lib32 Intel64 lib32 lib64 libittnotify.lib [Start Paused] Resume Pause Resume 56 / 59
Windows Visual Studio [ ] Visual Studio* IDE [ ] F1 57 / 59
N-Queens N-Queens 1_nqeens_serial OpenMP N-Queens N-Queens OpenMP 3.0 Intel Cilk Plus Intel TBB OpenMP OpenMP OpenMP* http://jp.xlsoft.com/documents/intel/compiler/525j-001.pdf OpenMP* http://jp.xlsoft.com/documents/intel/compiler/526j-001.pdf C http://jp.xlsoft.com/documents/intel/compiler/527j-001.pdf Fortran OpenMP http://openmp.org/wp/ Intel Cilk Plus Intel Cilk Plus Intel Cilk Plus http://software.intel.com/en-us/intel-cilk-plus Intel TBB TBB TBB http://threadingbuildingblocks.org/ TBB http://www.xlsoft.com/jp/products/intel/perflib/tbb/index.html http://software.intel.com/en-us/intel-tbb 58 / 59
http://xlsoft.com/jp/products/intel/studio_xe/index.html https://www.xlsoft.com/jp/services/xlsoft_form.html http://www.isus.jp/ 59 / 59