IPSJ SIG Technical Report Vol.2021-HPC-178 No /3/16 MPI 1,a) Extra-P Extra-P TSUBAME3.0 NPB 256 A C D 19.3% 5% MPI,,, 1. Extra-P [5] Ex

Similar documents
VXPRO R1400® ご提案資料

07-二村幸孝・出口大輔.indd

GPU n Graphics Processing Unit CG CAD

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

untitled

HP High Performance Computing(HPC)

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

HPEハイパフォーマンスコンピューティング ソリューション

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

1 M32R Single-Chip Multiprocessor [2] [3] [4] [5] Linux/M32R UP(Uni-processor) SMP(Symmetric Multi-processor) MMU CPU nommu Linux/M32R Linux/M32R 2. M

先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

untitled

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

supercomputer2010.ppt

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

卒業論文

hpc141_shirahata.pdf

fiš„v8.dvi

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

workshop Eclipse TAU AICS.key

倍々精度RgemmのnVidia C2050上への実装と応用


AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,

HPE Moonshot System ~ビッグデータ分析&モバイルワークプレイスを新たなステージへ~

HPC可視化_小野2.pptx

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

XACCの概要

Microsoft PowerPoint - sales2.ppt

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

橡3_2石川.PDF

Second-semi.PDF

HP Workstation Xeon 5600

it-ken_open.key

122 丸山眞男文庫所蔵未発表資料.indd

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

システムソリューションのご紹介

Ver Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

Ver. 3.9 Ver E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,, HT,

JSplus24蜿キ.indd

0911_hyo1.eps

GPUコンピューティング講習会パート1

フカシギおねえさん問題の高速計算アルゴリズム

Microsoft Word - HOKUSAI_system_overview_ja.docx


1重谷.PDF

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm


258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

Microsoft PowerPoint - GPU_computing_2013_01.pptx

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

09中西

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

JAPAN MARKETING JOURNAL 111 Vol.28 No.32008

JAPAN MARKETING JOURNAL 113 Vol.29 No.12009

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

JAPAN MARKETING JOURNAL 110 Vol.28 No.22008

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

1

福岡大学人文論叢47-3

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

Ver. 3.8 Ver NOTE E v3 2.4GHz, 20M cache, 8.00GT/s QPI,, HT, 8C/16T 85W E v3 1.6GHz, 15M cache, 6.40GT/s QPI,,

HPC (pay-as-you-go) HPC Web 2

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

2012年度HPCサマーセミナー_多田野.pptx

Vol. 46, No. SIG 12(ACS 11), pp , August c MegaScript, MegaScript MegaScript MegaScript MegaScript Construction of Accurate Task

次世代スーパーコンピュータのシステム構成案について

untitled

ProLiant ML110 Generation 4 システム構成図

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

HP ProLiant 500シリーズ

1 Kinect for Windows M = [X Y Z] T M = [X Y Z ] T f (u,v) w 3.2 [11] [7] u = f X +u Z 0 δ u (X,Y,Z ) (5) v = f Y Z +v 0 δ v (X,Y,Z ) (6) w = Z +

AV 1000 BASE-T LAN 90 IEEE ac USB (3 ) LAN (IEEE 802.1X ) LAN AWS (Amazon Web Services) AP 3 USB wget iperf3 wget 40 MBytes 2 wget 40 MByt

01_20.eps

PowerPoint プレゼンテーション

main.dvi

develop

IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 I/O Jianwei Liao 1 Gerofi Balazs 1 1 Guo-Yuan Lien Prototyping F

[1] [2] [3] (RTT) 2. Android OS Android OS Google OS 69.7% [4] 1 Android Linux [5] Linux OS Android Runtime Dalvik Dalvik UI Application(Home,T

1 Web DTN DTN 2. 2 DTN DTN Epidemic [5] Spray and Wait [6] DTN Android Twitter [7] 2 2 DTN 10km 50m % %Epidemic 99% 13.4% 10km DTN [8] 2

GPGPU

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

Microsoft PowerPoint - RBU-introduction-J.pptx

untitled

5988_4096JA.qxd


IPSJ SIG Technical Report Vol.2012-HCI-149 No /7/20 1 1,2 1 (HMD: Head Mounted Display) HMD HMD,,,, An Information Presentation Method for Weara

2009 4

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

,., ping - RTT,., [2],RTT TCP [3] [4] Android.Android,.,,. LAN ACK. [5].. 3., 1.,. 3 AI.,,Amazon, (NN),, 1..NN,, (RNN) RNN

2ndD3.eps

Transcription:

Vol.221-HPC-178 No.19 MPI 1,a) 1 1 1 1 Extra-P Extra-P TSUBAME3. NPB 256 A C D 19.3% 5% MPI,,, 1. Extra-P [5]Extra-P 1 1-5-1, Chofugaoka, Chofu, Tokyo 182 8585, Japan a) arima@hpc.is.uec.ac.jp Extra-P Extra-P Extra-P MPI 2. 2.1 CPU 1

[4] TAU Score-P Extra-P [3][4] 2.2 Extra-P Extra- P[5] 21 [2] Extra-P Extra-P 2.3 IF 異なる条件で複数回 解析対象のアプリケーションを実行してプロファイルを取得する 問題サイズもしくは実行プロセス数を変数としてモデルの構築を行う 構築したモデルの中で 最も適合度の高いモデルを選択する 1: 1 3. 3.1 Extra-P 1 1 1 3.2 Vol.221-HPC-178 No.19 4 x 2

y (1) y = ax + b (1) (2) y = a log 1 x + b (2) (3) y = a x + b (3) x x y (4) ax + b (x < x ) y = ax + b (x x ) (4) 1 4. 4.1 2 1: TSUBAME3. 54 CPU Memory 12.15PFlops 138,24GB 2: TSUBAME3. Intel Xeon E5-268 V4 14(28) 2.4GHz 256GB 153.6GB/s GPU NVIDIA Tesla P1 4.2 TSUBAME3. TAU TSUBAME3. TSUBAME3. 54 1 CPU(Intel Xeon E5-268 V4) 2 TSUBAME3. 1 2 TAU (Tuning and Analysis Utilities) TAU C Python [4] NAS Parallel Benchmarks NAS Parallel Benchmarks (NPB) NPB 3 5 3 [1]NPB A, B, C, D 4 B A 4 C B 4 D C 16 4.3 4.3.1 3.2 4 (MAPE) (F t ) (A t ) 5 Vol.221-HPC-178 No.19 3

Vol.221-HPC-178 No.19 IS EP CG MG FT BT SP LU 3: NAS Parallel Benchmarks 3 3 1 5 1 LU 4.3.2 MAP E = 1% N N A t F t A t (5) t=1 [%] = 1 2 1 1 128 256 2 A C D 4.3.3 () () PC 4.4 4.4.1 MAPE 4 5 MAPE NoData 4 64 A, B, C 5 B BT SP 1, 4, 16, 64 1, 2, 4, 8, 16, 32, 64 BT, SP BT SP MAPE. 1 MAPE. MAPE 4 4.4.2 4

Vol.221-HPC-178 No.19 4: 64 (MAPE [%] MAPE [%]) [%] BT 99(.,.) 1(.,.) (NoData) (NoData) CG 69(.,.) 13(21.2, 3.1) (NoData) 18(.,.) EP 1(.,.) (NoData) (NoData) (NoData) FT 57(.,.) 6(22.5, 22.6) (NoData) 37(.,.) IS 1(.,.) (NoData) (NoData) (NoData) LU 81(.,.) 19(.1,.4) (NoData) (NoData) MG 48(.,.) 4(27., 28.2) 3(1.3, 1.3) 9(.,.) SP 98(.,.) 2(.,.) (NoData) (NoData) 5: B (MAPE [%] MAPE [%]) [%] BT 78(.,.) 22(.,.5) (NoData) (NoData) CG 69(.,.) (NoData) (NoData) 31(., 12.2) EP 1(.,.) (NoData) (NoData) (NoData) FT 62(.,.) (NoData) (NoData) 38(., 88.7) IS 82(.,.) 14(14., 14.) (NoData) 4(88.7, 88.7) LU 77(.,.) 21(., 17.2) 2(.,.) (NoData) MG 72(.,.) (NoData) 14(91., 91.) 14(19.4, 19.4) SP 79(.,.) 21(.,.5) (NoData) (NoData) 2: 64 2 3 2 64 3 B 6 8 2 1 2 3 4 3: B 3 4 3.2 4 3 2 2 5

Vol.221-HPC-178 No.19 6: 4 A, B, C, D 3 A, B, C 2 A, B 1 A 7: BT SP 5 1, 4, 16, 64, 256 4 1, 4, 16, 64 3 1, 4, 16 2 1, 4 1 1 8: BT SP 9 1, 2, 4, 8, 16, 32, 64, 128, 256 8 1, 2, 4, 8, 16, 32, 64, 128 6 1, 2, 4, 8, 16, 32 4 1, 2, 4, 8 2 1, 2 1 1 MAPE 3 3 4.4.3 4 5 4 64 5 B 1.. 4 3 4: 64 5: B 3 2 8 5 2% 5 2 1 3 1 3 1 4.4.4 64 9 B 1 9 3 1 BT SP 4 BT SP 6

9: 64 [%] [%] BT 35.6 6.225 CG 1.4 3.667 EP. 7.433 FT 15.7 5.24 IS 3.7 6.35 LU 15.1 8.229 MG 52.8 5.525 SP 2.9 4.341 19.3 5.837 1: B [%] [%] BT.1 265.1 CG 1.2 449.1 EP. 669.4 FT 6.3 538.2 IS 15. 566. LU 4.2 39.8 MG 1. 451.5 SP.1 233.5 4.6 445.5 8 9 19.32% 5.83% 1 4.65% 445.5% 5. 5.1 / Extra-P Extra-P 4.65% 4.5 19.3% 5% 5.2 3.2 4 Vol.221-HPC-178 No.19 JSPS JP2H4193 [1] Bailey, D. H.: The NAS Parallel Benchmarks, RNR-94-7 (1994). [2] Calotoiu, A., Hoefler, T., Schulz, M., Shudler, S. and Wolf, F.: Insightful Automatic Performance Modeling, https://apps.fz-juelich.de/scalasca/releases/ extra-p/slides/insightfulautomaticperformance ModelingTutorialPartI.pdf. [3] Knüpfer, A., Rössel, C., an Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A. D., Nagel, W. E., Oleynik, Y., Philippen, P., Saviankou, P., Schmidl, D., Shende, S., Tschüter, R., Wagner, M., Wesarg, B. and Wolf, F.: Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir (211). [4] Performance Research Lab: TAU, https: //www.cs.uoregon.edu/research/tau/home.php. [5] Technical University of Darmstadt: Extra-P, https://www.scalasca.org/scalasca/software/ extra-p/download.html. 7

MPI 1,a) 1 1 1 1 MPI TSUBAME3. NAS Parallel Benchmark L1 58.7% 11.3% MPI,,, 1. 2 CPU [1] [2], [5] 1 1 1-5-1, Chofugaoka, Chofu, Tokyo 182 8585, Japan a) hasegawa@hpc.is.uec.ac.jp [3], [7] 1

2. HPC TAU[5] TAU THROTTLE TAU THROTTLE 1 [6] Extra-P [7]Extra-P MPI OpenMP 3. 3.1 1 L1 MPI L1 L1 2 ( 8,32,128 ) L1 ( 256 ) L1 L1 L1 L1 3.2 linear,inverse,log,exponentail 4 y = ax + b (1) y = a + b ( a) x (2) y = log x + b (1 < a) log a (3) y = ab x + c (1 < b, c) (4) x y L1 a, b, c 4 4 MAPE MAPE MAPE MAP E = 1% N A t F t N A t (5) t=1 A t L1 F t L1 2

1: TSUBAME3. 54 CPU() Intel Xeon E5-268 V4 Processor(Broadwell-EP, 14, 2.4GHz) 2 RAM() 256GiB (DDR4-24 32GB 8) Intel DC P35 2TB (NVMe, PCI-E 3. x4, R27/W18) Intel Omni-Path 1Gb/s 4 DDN SFA14KXE EXAScaler 2: TSUBAME3. [KB] L1 32 L1 32 L2 256 L3 35,84 N (6) linear y = ax + b (a > ) (6) x y, a, b (1) L1 4. 4.1 TSUBAME3.[4] TSUBAME3. 1 TSUB- AME 54 2 CPU CPU 14 CPU 2 TSUBAME3. CPU L1 / L2 L3 [8] NAS Parallel Benchmarks (NPB) 6 3 [9] A, B, C, D 4 B C A B 4 D C 16 8 256 FT, IS, LU 3 D 8 1 MPI 1 3: cg ep Embarassingly parallel ft 3 is lu mg L1 TAU PAPI[1] 4.2 4 L1 L1 PC 4.3 8 128 L1 256 L1 3

4: [%] (MAPE [%], MAPE [%]) linear inverse log exponential cg 17.86 (.67, 9.35) 57.14 (.57, 14.86) 1.79 (1.75, 1.75) 23.21 (.23, 1.6) ep. (-,-) 1. (., 3.24). (-, -). (-, -) ft 7.14 (1.51, 155.19) 69.39 (., 127.4) 3.6 (8.35, 15.84) 2.41 (.31, 25.26) is 7.14 (1.1, 6.92) 48.21 (.29, 9.71) 1.79 (1.27, 1.27) 42.86 (.1, 32.82) lu 9.85 (.34, 16.9) 57.58 (.77, 61.84).76 (3.17, 3.17) 31.82 (.35, 29.87) mg 2.27 (2.26, 8.11) 72.73 (.11, 1241.3). (-, -) 25. (.81, 25.96) 16,,,,,,,,,, :64, :128, :128, :128, :128, :64:128 8 256 256 L1 A C 256 D A B 2 A C 3 A D 4 D L1 5. 5.1 L1 8,16,32,64,128,256 MAPE MAPE A,B,C,D 4 1 MAPE 1.8% mg D MAPE 2% mg D comm3 ex MAPE 1,% A,B,C,D 8,16,32 32 32 64,128,256 comm3 ex MAPE 7 6 5 4 3 2 1 A B C D cg ep ft is lu mg benchmark 1: 4 linear,inverse,log,exponential MAPE 4 4 linear ft inverse inverse MAPE 1% 127% 1241% inverse inverse log MAPE exponential linear log MAPE 33% 5.2 (7) 4

average_error 2 175 15 125 1 75 5 25 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg ep ft is lu mg average_error 7 6 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg ep ft is lu mg 2: ( A) 4: ( C) average_error 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg ep ft is lu mg average_error 7 6 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg ep ft is lu mg 3: ( B) average error = 1 N f N f t=1 A t F t A t (7) N f A t F t A 2 B 3 C 4 D 5 8,64,128 13.1% ABCD 4 3 4 3 lu,ft,is 3 8,16,32 16,32,64 5: ( D) cg A 8,64,128 256 5 8 relative error relative error = 1 A F A (8) A F 1% relative error relative error 1% 6 2, 3, 4 A BA CA D 5

5: cg, A function name relative error[%].tau 3.8886 main 3.9389 MAIN 3.9416 makea 9.1129 sprnvc 9.8794 conj grad 2.8559 initialize mpi.3995 randlc 14.2681 icnvrt 23.1941 vecset 12.894 sparse 2.638 alloc space 4.7269 setup submatrix info.27 setup proc info 5.8198 relaitve_cost[%] 7 6 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg_a ep_a ft_a is_a lu_a mg_a 7: A average_error 6 5 4 3 2 1 cg ep ft is lu mg 2 3 4 profile_number 6: relaitve_cost[%] 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg_b ep_b ft_b is_b lu_b mg_b 8: B (9) 2 199% 3 58.7% 6 A C 3 1% A,B,C,D A 1,4,16,256 3 1,4,16 256 256 3 32,64,128 5.3 256 relative cost = 1 C p C E (9) C E C p (relative cost) 7 1 8,64,128 18.1% cg A,B ep 7 A 1 D 6

relaitve_cost[%] 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg_c ep_c ft_c is_c lu_c mg_c Relative_cost[%] 12 1 8 6 4 2 cg ep ft is lu mg 2 3 4 profile_number 9: C 11: 256 relaitve_cost[%] 7 6 5 4 3 2 1 :64 :128 :128 :128 :128 :64:128 :64:128:256 :128:256 cg_d ep_d ft_d is_d lu_d mg_d 1: D D D C E C p (9) 11 11 Relative cost 2 4.% 3 11.3% 2 2% 3 25% 6. 6.1 MPI L1 TSUBAME3. NPB 1.8% 8,64,128 256 13.2% 1.8 A C 3 D 58.7% D 11.3% 6.2 L1 L2 L3 MPI MPI 8 256 A D 256 E F HPC JSPS JP2H4193 7

[1] TOP 5 November 22 https://www.top5.org/lists/top5/22/11/ (Accessed on 1/28/221) [2] Knüpfer A. et al. (212) Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Brunst H., Müller M., Nagel W., Resch M. (eds) Tools for High Performance Computing 211. Springer, Berlin, Heidelberg. [3] PMaC Performance Modeling and Charactrization https://www.sdsc.edu/pmac/researchareas/index.html (Accessed on 1/26/221) [4] TSUBAME3. https://helpdesk.t3. gsic.titech.ac.jp/manuals/handbook.ja/jobs/ (Accessed on 221/2/23) [5] S. Shende and A. D. Malony, The TAU Parallel Performance System, International Journal of High Performance Computing Applications, SAGE Publications, 2(2):287-331, Summer 26 [6] TAU throttle https://www.cs.uoregon.edu/ research/tau/docs/tutorial/ch1s5.html (Accessed on 2/7/221) [7] Extra-P https://www.scalasca.org/software/extrap/download.html (Accessed on 1/8/221) [8] Intel Xeon E5-268 V4 https://ark.intel.com/content/www/jp/ja/ark/ products/91754/intel-xeon-processor-e5-268- v4-35m-cache-2-4-ghz.html (Accessed on 1/2/221) [9] NAS Parallel Benchmarks https: //www.nas.nasa.gov/publications/npb.html#url (Accessed on 1/1/221) [1] Terpstra, D., Jagode, H., You, H., Dongarra, J. Collecting Performance Data with PAPI-C, Tools for High Performance Computing 29, Springer Berlin / Heidelberg, 3rd Parallel Tools Workshop, Dresden, Germany, pp. 157-173, 21. 8