untitled

Similar documents
メモリ階層構造を考慮した大規模グラフ処理の高速化


untitled

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

untitled

untitled

Microsoft PowerPoint - stream.ppt [互換モード]

untitled

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

or58_8_455.dvi

or57_12_673.dvi

2 HI LO ZDD 2 ZDD 2 HI LO 2 ( ) HI (Zero-suppress ) Zero-suppress ZDD ZDD Zero-suppress 1 ZDD abc a HI b c b Zero-suppress b ZDD ZDD 5) ZDD F 1 F = a

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

soturon.dvi

07-二村幸孝・出口大輔.indd

スライド 1

or58_11_651.dvi

\\ \Data_in4\TeX\OR\63-7\07\or63_7_401.dvi

GPGPU


untitled

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

HP High Performance Computing(HPC)

workshop Eclipse TAU AICS.key

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

untitled

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

09中西

FS_handbook.indd

untitled

Publish/Subscribe KiZUNA P2P 2 Publish/Subscribe KiZUNA 2. KiZUNA 1 Skip Graph BF Skip Graph BF Skip Graph Skip Graph Skip Graph DDLL 2.1 Skip Graph S

untitled

i

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

01_OpenMP_osx.indd

GPU n Graphics Processing Unit CG CAD

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

橡3_2石川.PDF

最新Linuxデバイスドライバ開発応用-修正版-PDF.PDF

Cisco 1711/1712セキュリティ アクセス ルータの概要

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

untitled

develop

untitled

untitled

bit bit bit VAST N d i d 1 <d 2 <...<d k <...<d N d k VAST d k 3 d k 3 d k 2 d k 1 d k 4 w w=4 ) HW HW 32bit γ δ [4] PForDelta [3] HW CPU VAST VAST VA

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

ポストペタスケール高性能計算に資するシステムソフトウェア技術の創出 平成 23 年度採択研究代表者 H27 年度 実績報告書 藤澤克樹 九州大学マス フォア インダストリ研究所 教授 ポストペタスケールシステムにおける超大規模グラフ最適化基盤 1. 研究実施体制 (1) 大規模最適化 グループ( 九

HPC (pay-as-you-go) HPC Web 2

ICDE2013study.ppt

2004 Copyright by Tatsuo Minohara Programming with Mac OS X in Lambda 21 - page 2

MacOSXLambdaJava.aw

untitled

1重谷.PDF

1, 4,a) 1, 4 1, 4 1, , 4 3, 4 HPC HPC HPC Slurm 1. HPC Tianhe MW MW [1] MW CREST a)

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis


先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

HPEハイパフォーマンスコンピューティング ソリューション

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

HTM RaR HTM 2. 2) 3) HTM 2 3 Yoo 4) HTM Adaptive Transaction Scheduling Akpinar 5) HTM Gaona 6) HTM 3. Read-after-Read HTM 3.1 Read-after-Read Read Wr

or58_10_599.dvi

プロセッサ・アーキテクチャ

卒業論文

untitled

DEIM Forum 2010 D Development of a La

untitled

倍々精度RgemmのnVidia C2050上への実装と応用

PassMark PerformanceTest ™

or58_8_462.dvi

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

<95DB8C9288E397C389C88A E696E6462>

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

東京大学情報基盤センターFX10スパコンシステム(Oakleaf-FX)活用事例

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

or57_4_175.dvi

VXPRO R1400® ご提案資料

AV 1000 BASE-T LAN 90 IEEE ac USB (3 ) LAN (IEEE 802.1X ) LAN AWS (Amazon Web Services) AP 3 USB wget iperf3 wget 40 MBytes 2 wget 40 MByt

untitled

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 I/O Jianwei Liao 1 Gerofi Balazs 1 1 Guo-Yuan Lien Prototyping F

ScaleGraph

C++ TPDPL(Template Parallel Distributed Processing Library) C X10 1) Place Activity X10 Place 2) 2.2 C++ C/C++OpenMP MPI C/C++ OpenMP

HPC可視化_小野2.pptx

Second-semi.PDF

Dual Stack Virtual Network Dual Stack Network RS DC Real Network 一般端末 GN NTM 端末 C NTM 端末 B IPv4 Private Network IPv4 Global Network NTM 端末 A NTM 端末 B

untitled

Estimation of Photovoltaic Module Temperature Rise Motonobu Yukawa, Member, Masahisa Asaoka, Non-member (Mitsubishi Electric Corp.) Keigi Takahara, Me

4.1 % 7.5 %

28 NTMobile Java Proposal and Implementation of Java Wrapper for NTMobile ( : ) :

IPSJ SIG Technical Report Vol.2015-ARC-215 No.13 Vol.2015-OS-133 No /5/ ,a) % 13.9% 1. Transactional Memory: TM [1] TM TM 1 Nag

Microsoft PowerPoint - CCS学際共同boku-08b.ppt

Transcription:

c NUMA 1. 18 (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3. NUMA NUMA (Non-uniform memory access) 819 0395 744 1 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 21 601

Intel Xeon X5460 Harpertown CPU 2 CPU 4 1 8(=2 4 1) 2 2-way Intel Xeon X5460 NUMA UMA (Uniform memory access) 2 UMA 3 NUMA UMA 2 CPU Intel Xeon X5460 2 CPU CPU (RAM) RAM NUMA NUMA NUMA CPU NUMA 3 CPU Intel Xeon E5-4640 4 NUMA 4 4. STREAM 1 1 STREAM: Sustainable Memory Bandwidth in High Performance Computers http://www.cs.virginia.edu/ stream/ Intel Xeon E5-4640 SandyBridge-EP CPU 4 CPU 8 2 64 (= 4 8 2) 3 4-way Intel Xeon E5 4640 STREAM 4 1 Triad n a, b, c R n r a b + rc 1 bytes 4 OpenMP Triad C/C++ 4 OpenMP Triad 5 4-way Intel Xeon E5 4640 n n = {2 10,...,2 30 } Triad (GB/s) 2 20 STREAM 20 16, 32, 64 95, 98, 92 GB/s 64 602 22 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

5 STREAM TRIAD 2 Hyper-threading 32 4.1 Linux numactl NUMA numactl NUMA node 0 16 16 NUMA node 3 --physcpubind --membind NUMA ID NUMA ID Linux /proc/cpuinfo processor ID physical id NUMA ID Portable Hardware Locality (HWLOC) [1] 6 n = {2 10, 2 11,...,2 30 } NUMA NUMA NUMA 0 16 16 Triad NUMA 6 NUMA 0 NUMA (GB/s) 12 GB/s NUMA 3GB/s NUMA 1/4 4.2 numactl --localalloc 32 NUMA 0, 1 32 4.3 4KBytes NUMA NUMA numactl --interleave 32 NUMA 0, 1 32 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 23 603

NUMA 0, 1 4 Local allocation 5. NUMA 4.4 7(a) 7(b) STREAM TRIAD (GB/s) NUMA 1, 2, 4 n = {2 10,...,2 30 } NUMA (Local-allocation) (Interleaving) 2 20 NUMA 1 16 2 32 4 64 Local allocation 13 GB/s, 21 GB/s, 24 GB/s Interleaving 13 GB/s, 6 GB/s, 8 GB/s Interleaving TRIAD STREAM 4 Local allocation Interleaving 6 NUMA numactl Linux sched_setaffinity() sched_getaffinity() mbind() sched_setaffinity() sched_setaffinity() mbind() NUMA NUMA 5.1 STREAM TRIAD TRIAD a, b, c 1 NUMA 7 STREAM TRIAD: (GB/s) 604 24 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

8 TRIAD 9 NUMA 8 TRIAD NUMA 1, 2, 4 24, 48, 96 GB/s 5 5.2 (Breath-first search; BFS) BFS G =(V,E) n = V m = E O(n + m) HPC Graph500 1 Graph500 2 BFS 2010 11 2 SCALE edgefactor =m/n 16 (a) (b) (c) (a) n=2 SCALE m=n edgefactor Kronecker graph (b) (c) 64 BFS 1 (traversed edges per second; TEPS) (c) 64 TEPS Green Graph500 3 Graph500 TEPS TEPS/W 9 1 BFS (Level) Level-synchronized BFS Beamer [3] Top-down Bottom-up Small-world Top-down Bottom-up Beamer 2 28 2 32 Kronecker graph 4-way Intel Xeon E7-8870 5.1 GTEPS (10 9 TEPS) NUMA 2.2 11.15 GTEPS [4] Bottom-up Small-world 2.68 [5] [4, 5] CSR (Compressed Sparse Row) 2 Graph500: http://www.graph500.org. 3 Green Graph500: http://green.graph500.org. 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 25 605

1 (n, m) TEPS Madduri Cray MTA-2 (40 procs) (2 21,2 30 ) 0.5 G Agarwal [2] Intel Xeon X7560 4 (2 20,2 26 ) 1.3 G Beamer [3] Intel Xeon E7-8870 4 (2 28,2 32 ) 5.1 G Yasui [4] Intel Xeon E5-4640 4 (2 26,2 30 ) 11.1 G Yasui [5] Intel Xeon E5-4640 4 (2 27,2 31 ) 29.0 G V k = { [ )} kn (k +1)n v j V j, l l A Top-down v V A F k (v) Bottom-up w V k A B k (w) l 1 A F k (v) A B k (w) A F k (v)={w w {V k A(v)}}, v V, A B k (w)={v v A(w)}, w V k. NUMA Graph500 2014 6 4 10 NUMA BFS Graph500 10(a) NUMA 10(b) NUMA l G l {G k}, (k = {0, 1,...,l 1}) NUMA k V k A k V k SGI UV2000 2 32 2 36 Kronecker 640 131.4 GTEPS Green Graph500 2014 6 5 Big Data category 4-way Intel Xeon E5-4640 2 30, 2 34 28.5 GTEPS 59.1 MTEPS/W 1 UV 2000 5.3 SDPARA (SemiDefinite Programming Algorithm PARAllel version) [6] SDPA (Semidefinite Programming Algorithms) ZDD (Zero-suppressed decision diagram) [7] [8] NUMA ULIBC (Ubiquity Library for Intelligently Binding Cores) 4 http://www.graph500.org/results jun 2014 5 http://green.graph500.org/list 2014 06 isc.php 606 26 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited.

6. NUMA NUMA (JST) CREST SGI Silicon Graphics International Corp. [1] F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault and R. Namyst, hwloc: A generic framework for managing hardware affinities in HPC applications, Proc. IEEE Int. Conf. PDP2010, 2010. [2] V. Agarwal, F. Petrini, D. Pasetto and D. A. Bader, Scalable graph exploration on multicore processors, Proc. ACM/IEEE Int. Conf. SC10, 2010. [3] S. Beamer, K. Asanović and D. A. Patterson, Direction-optimizing breadth-first search, Proc. ACM/IEEE Int. Conf. SC12, 2012. [4] Y. Yasui, K. Fujisawa and K. Goto, NUMAoptimized parallel breadth-first search on multicore single-node system, Proc. IEEE Int. Conf. BigData 2013, 2013. [5] Y. Yasui, K. Fujisawa and Y. Sato, Fast and energy-efficient breadth-first search on a single NUMA system, Proc. IEEE Int. Conf. ISC 14, 2014. [6] K. Fujisawa, T. Endo, Y. Yasui, H. Sato, N. Matsuzawa, S. Matsuoka and H. Waki, Peta-scale general solver for semidefinite programming problems with over two million constraints, Proc. IEEE Int. Conf. IPDPS 2014, 2014. [7] ULIBC 2014 (HPCS2014) HPCS2014 2014. [8]Y.Yasui,K.Fujisawa,K.Goto,N.Kamiyamaand M. Takamatsu, NETAL: High-performance implementation of network analysis library considering computer memory hierarchy, J. Oper. Res. Soc. Jpn., 54, 259 280, 2011. 2014 10 Copyright c by ORSJ. Unauthorized reproduction of this article is prohibited. 27 607