001.dvi

Similar documents
Second-semi.PDF

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

09中西

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

GPGPU

1 DHT Fig. 1 Example of DHT 2 Successor Fig. 2 Example of Successor 2.1 Distributed Hash Table key key value O(1) DHT DHT 1 DHT 1 ID key ID IP value D

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

Web Web Web Web Web, i

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

& Vol.2 No (Mar. 2012) 1,a) , Bluetooth A Health Management Service by Cell Phones and Its Us

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

6_27.dvi

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

fiš„v5.dvi

HASC2012corpus HASC Challenge 2010,2011 HASC2011corpus( 116, 4898), HASC2012corpus( 136, 7668) HASC2012corpus HASC2012corpus

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

07-二村幸孝・出口大輔.indd

e-learning e e e e e-learning 2 Web e-leaning e 4 GP 4 e-learning e-learning e-learning e LMS LMS Internet Navigware

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Vol. 48 No. 3 Mar PM PM PMBOK PM PM PM PM PM A Proposal and Its Demonstration of Developing System for Project Managers through University-Indus

HP High Performance Computing(HPC)

IPSJ SIG Technical Report Vol.2016-CE-137 No /12/ e β /α α β β / α A judgment method of difficulty of task for a learner using simple

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

Journal of Geography 116 (6) Configuration of Rapid Digital Mapping System Using Tablet PC and its Application to Obtaining Ground Truth

Fig. 1 Schematic construction of a PWS vehicle Fig. 2 Main power circuit of an inverter system for two motors drive

untitled

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

IPSJ SIG Technical Report Vol.2011-IOT-12 No /3/ , 6 Construction and Operation of Large Scale Web Contents Distribution Platfo

第3回戦略シンポジウム緑川公開用

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

3 * *

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

HP High Performance Computing(HPC)

KII, Masanobu Vol.7 No Spring

fiš„v8.dvi

Microsoft PowerPoint - sales2.ppt

WebRTC P2P Web Proxy P2P Web Proxy WebRTC WebRTC Web, HTTP, WebRTC, P2P i

3_23.dvi

i

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

Appropriate Disaster Preparedness Education in Classrooms According to Students Grade, from Kindergarten through High School Contrivance of an Educati

JFE.dvi

熊本大学学術リポジトリ Kumamoto University Repositor Title 特別支援を要する児童生徒を対象としたタブレット端末 における操作ボタンの最適寸法 Author(s) 竹財, 大輝 ; 塚本, 光夫 Citation 日本産業技術教育学会九州支部論文集, 23: 61-

Transcription:

THE SCIENCE AND ENGINEERING DOSHISHA UNIVERSITY, VOL.XX, NO.Y NOVEMBER 2003 Construction of Tera Flops PC Cluster System and evaluation of performance using Benchmark Tomoyuki HIROYASU * Mitsunori MIKI * and Hiroshi ARAKUTA ** (Received October 6, 2004) Complicated and diverse of objective problems with the development of technology, demand of high performance computer is increasing. Instead of vector superocomputer, in order to meet the needs of these demand, PC cluster systems have gotten a lot of attention in recent years. PC cluster systems consist of many PCs connected by network and are used for parallel or distributed computation. In scientific and engineering fields, HPC cluster systems are attractive to the computationally intensive tasks. We setting up of HPC cluster system called Supernova, toward a performance of 1 Tera Flops. Supernova is composed 256-node and running Linux Operating System. We evaluated this cluster system using High-Performance LINPACK Benchmark and Himeno Benchmark. In this paper, we present the result of performance and the combination of parameters using these Benchmarks in Supernova. Through experimenting we obtained knowledge to perform parameter tuning for better performance in these Benchmarks and knowledge about construction of PC Cluster System. Key words PC Cluster, Himeno Benchmark, LINPACK Benchmark, Linux PC,,, 1. 1 PC 1, 2) 500 TOP500 Supercomputer Sites 3) TOP500 * Department of Knowledge Engineering and Computer Sciences, Doshisha University, Kyoto Telephone:+81-774-65-6930, Fax:+81-774-65-6796, E-mail:tomo@is.doshisha.ac.jp, mmiki@mail.doshisha.ac.jp ** Graduate Student, Department of Knowledge Engineering and Computer Sciences, Doshisha University, Kyoto Telephone:+81-774-65-6716, Fax:+81-774-65-6716, E-mail:arakuta@mikilab.doshisha.ac.jp

TOP500 1TFlops Supernova Cluster System Supernova Benchmark High-Performance LINPACK Benchmark Supernova 2. PC PC 1 CPUOS PC HPC(High Performance Computing) HPC HPC Beowulf Beowulf 1990 NASA Beowulf 4, 5, 6) Beowulf Linux FreeBSD OS MPI PVM Beowulf SCore 7, 8) SCore Beowulf RWCP Real-World Computing Partnership Beowulf SCore SCore Linux PC HA(High Availability) HA HA 3. Supernova Cluster System 1TFlops 2003 9 Fig. 1 Supernova Cluster System Supernova Supernova AMD 64bit CPU Opteron 512CPU Table 1 Fig. 1. Supernova Cluster System. Table 1. Supernova. 256 CPU AMD Opteron 1.8GHz 512 Memory 2GB 256 total 512GB OS Turbolinux for AMD64 mpich-1.2.5 TCP/IP Gigabit Ethernet

Opteron 1 2 FPU 1 1 Supernova Rpeak (1) 1.8432TFlops Rpeak =#CPU ClockFrequency #FPU (1) 3.1 Opteron Supernova AMD Opteron Opteron Opteron HyperTransport HyperTransport 9) I/O 3 HyperTransport 19.2GB/s 3.2 Supernova Force10 Networks E1200 E1200 1.44Tbps 5 336 Supernova 256 E1200 3.3 Supernova Supernova 3.1 Opteron HyperTransport 100Mbps 1Gbps PC Myricom Myrinet InfiniBand 10) Supernova 4. Benchmark 4.1 Benchmark PC Benchmark 11) LINPACK 12) Nas ParallelBenchmark 13) Benchmark LIN- PACK 4.2 Benchmark Benchmark Poisson Jacobi

4.2.1 Benchmark Benchmark Benchmark Jacobi Table 2 Table 2. Benchmark. Array Size #Array Elements XS 65 33 33 S 128 64 64 M 256 128 128 L 512 256 256 XL 1024 512 512 x y z CPU CPU 4.2.2 Poisson Poisson (2) 2 u x 2 + 2 u y 2 + 2 u = f(x, y, z) (2) z2 (2) (3) f i,j,k = u i+1,j,k 2u i,j,k + u i 1,j,k x 2 + u i,j+1,k 2u i,j,k + u i,j 1,k y 2 + u i,j,k+1 2u i,j,k + u i,j,k 1 z 2 (3) x = y = z (3) (4) u i,j,k = 1 6 [u i+1,j,k + u i 1,j,k + u i,j+1,k + u i,j 1,k + u i,j,k+1 + u i,j,k 1 ( x) 2 f i,j,k ] (4) u i,j,k Gauss-Seidel Jacobi Benchmark Jacobi u m+1 i,j,k (m) (5) u m+1 i,j,k = 1 6 [um i+1,j,k + um i 1,j,k + um i,j+1,k + um i,j 1,k + u m i,j,k+1 + um i,j,k 1 ( x)2 f i,j,k ] Benchmark Jacobi (5) 4.3 LINPACK Benchmark LINPACK Tenessee J.Dongarra LU Fortran LINPACK BLAS Basic Linear Algebra Subprograms LINPACK 3 LINPACK Benchmark N=100 N=100 LU DGEFA DGESL SGEFA SGESL 2 LU x Toward Peak Performance N=1000 Highly Parallel Computing TOP500 HPL HPL

LINPACK (6) = 2 3 N 3 + O(N 2 ) (6) A LU x LU 4.3.1 4.3.1 LU A n n b n Ax = b (7) A LU A L U (8) A = LU (8) n n L L ij L ij (9) L ij =0(i>j) (9) 4.4 High-Performance LINPACK Benchmark HPL High-Performance LINPACK Benchmark LINPACK BLAS ATLAS Automatically Tuned Linear Algebra Software goto-library HPL 4.4.1 HPL HPL Fig. 2 2 LU Fig. 3 Panel Factorization Panel Broadcast LU L U (13) n n U U ij U ij (10) N 00 01 02 03 04 05 10 11 12 13 14 15 20 21 22 23 24 25 30 31 32 33 34 35 P0 P2 P4 00 03 01 04 02 05 20 23 21 24 22 25 40 43 41 44 42 45 U ij =0(i<j) (10) 40 41 42 43 44 45 50 51 52 53 54 55 P1 P3 P5 10 13 11 14 12 15 A LU (7) (11) N Global Array 30 33 50 53 31 34 32 35 51 54 52 55 Pn : Process Number Local Array Ax = LUx = b (11) y = Ux Ly = b (12) (12) Fig. 2. Factorization U. Factorization U y =(y 1 y 2 y n ) (12) y n y n 1 y 1 O(n 2 ) y (13) x O(n 2 ) Ux = y (13) L update Broadcast Fig. 3. L update Broadcast.

4.4.2 HPL 16 N P Q Broadcast N P Q Panel Broadcast Look-ahead Update long U mix L1 U alignment Panel Factorization Panel Factorization Performance[MFlops] Performance[MFlops] 8.0E+02 GNU PGI 6.0E+02 4.0E+02 2.0E+02 0.0E+00 (1,1,1) Division of Grid (a) 1CPU (b) 2CPU 3.0E+03 GNU PGI 2.0E+03 1.0E+03 0.0E+00 (1,1,4) (1,2,2) (1,4,1) (2,1,2) (2,2,1) (4,1,1) Division of Grid Performance[MFlops] Performance[MFlops] 1.5E+03 GNU PGI 1.0E+03 5.0E+02 0.0E+00 (1,1,2) (1,2,1) (2,1,1) Division of Grid (c) 4CPU (d) 8CPU Performance[MFlops] 6.0E+03 GNU PGI 4.0E+03 2.0E+03 0.0E+00 (1,1,8) (1,2,4) (1,4,2) (1,8,1) (2,1,4) (2,2,2) (2,4,1) (4,1,2) (4,2,1) (8,1,1) Division of Grid 1.0E+04 GNU PGI 8.0E+03 6.0E+03 4.0E+03 2.0E+03 0.0E+00 (1,1,16) (1,2,8) (1,4,4) (1,8,2) (1,16,1) (2,1,8) (2,2,4) (2,4,2) (2,8,1) (4,1,4) (4,2,2) (4,4,1) (8,1,2) (8,2,1) (16,1,1) Division of Grid (e) 16CPU Fig. 4.. Factorization Factorization 5. Benchmark 5.1 4.2.1 Benchmark Benchmark Fig. 4 Fig. 5 M CPU 1CPU 2CPU 4CPU 8CPU 16CPU GNU Fortran Compiler 3.2 PGI Fortran Compiler 5.0 Fig. 4 PGI GNU Fig. 5 CPU PGI GNU Difference of Performance[MFlops] 1.0E+04 GNU PGI 8.0E+03 6.0E+03 4.0E+03 2.0E+03 0.0E+00 1 2 4 8 16 #CPU Fig. 5. CPU. PGI CPU (x y z) Fig. 4 1CPU (1 1 1) 2CPU (2 1 1) 4CPU (2 1 2) 8CPU (2 2 2) 16CPU (4 2 2) CPU

5.2 5.1 PGI PGI -O1 -fthread-jumps -defer-pop -O2 -O1 -O3 -O2 -O2 -finline-functions -fast -Mvect=assoc -Mvect=cachesize:* -Mcache align -Mnontemporal Prefetch 128 M CPU 16 5.1 (4 2 2) Table 3 -fast -Mvect=assoc -O2 11271MFlops Supernova 2003 12 RIKEN BMT 2003 PC 1 Table 3.. Compile Option Performance [MFlops] None 9911 -fast -Mvect=assoc -O1 6610 -fast -Mvect=assoc -O2 11271 -Mcache align -Mnontemporal 9661 -Mvect=cachesize:1048576 -O3 10918 -Mvect=assoc,cachesize:1048576 11209 6. HPL LINPACK HPL 4.4.2 6.1 N N HPL HPL N N N 6.2 HPL 6.3 P Q P Q P Q P Q P Q 6.4 Panel Broadcast Panel Broadcast Increasing- 1ring Increasing-2ring Bandwidth-reducing 3 Panel Factorization modified 3 6 normal modified

normal Update Panel Factorization modified Update Panel Factorization 7. HPL 7.1 HPL 32 256 14) ATLAS CPU 24 28 N:10000 BCAST:1ring gcc3.2formit-frame-pointer -O3 -funroll-loops atlas-3.5.6 Fig. 6 3.10 3.05 3.00 2.95 2.90 2.85 2.80 24n 28n 48 56 72 84 96 112 120 140 (n=2) (n=3) (n=4) (n=5) (a) 1CPU 5.50 5.40 5.30 5.20 5.10 5.00 4.90 4.80 4.70 24n 28n Fig. 6.. 48 56 72 84 96 112 120 140 (n=2) (n=3) (n=4) (n=5) (b) 2CPU Fig. 6 28 28 Fig. 7 3.10 3.05 3.00 2.95 2.90 56 84 112 140 168 196 224 252 (a) 1CPU Fig. 7.. 5.70 5.50 5.30 5.10 4.90 56 84 112 140 168 196 224 252 (b) 2CPU Fig. 7 224 28 56 7.2 Panel Broadcast Panel Broadcast Fig. 8 BCAST Table 4 gcc3.2-fomit-framepointer -O3 -funroll-loops atlas- 3.5.6 Table 4. BCAST. 64cpu 128cpu 256cpu 512cpu N 80000 110000 160000 220000 224 (P Q) (8 8) (8 16) (16 16) (16 32) 156.0 154.0 152.0 150.0 148.0 146.0 144.0 1rg 1rM 2rg 2rM Lng LnM Topology 300.0 295.0 290.0 285.0 280.0 275.0 270.0 1rg 1rM 2rg 2rM Lng LnM Topology (a) 64CPU (b) 128CPU 620.0 600.0 580.0 560.0 540.0 520.0 500.0 1rg 1rM 2rg 2rM Lng LnM Topology (c) 256CPU (d) 512CPU Fig. 8. Panel Broadcast. Fig. 8 BCAST CPU normal modified modified normal Fig. 8 normal modified Fig. 8 CPU Supernova BCAST Long bandwidth reducing modified modified Long Supernova 1150.0 1100.0 1050.0 1000.0 950.0 900.0 850.0 1rg 1rM 2rg 2rM Lng LnM Topology

7.3 HPL AT- LAS goto-library Fig. 9 Table 5 gcc3.2-fomit-frame-pointer -O3 -funroll-loops Table 5.. 4cpu 8cpu 16cpu 32cpu 64cpu N 20000 28000 40000 56000 80000 224 (P Q) (2 2) (2 4) (4 4) (4 8) (8 8) library atlas-3.5.6 libgoto opteron-r0.7.so BCAST Increasing-1ring Performance[GFlops] 1200 1000 800 600 400 200 0 atlas-3.5.6 libgoto_opteron-r0.7.so 4 8 16 32 64 128 256 512 #CPU 250 200 150 100 50 0 4 8 16 32 64 128 256 512 #CPU (a) (b) Fig. 9.. Fig. 9(a) atlas goto-library Fig. 9(b) atlas goto-library Fig. 9(a) goto-library Fig. 9(b) CPU goto-library 7.4 (P Q) P Q 6.3 512CPU (P Q) (16 32) Supernova Difference of Performance[GFlops] Fig. 10 Table 6 gcc3.2-fomit-frame-pointer -O3 -funroll-loops Table 6.. N 200000 224 (1 512) (2 256) (4 128) (P Q) (8 64) (4 128) library libgoto opteron-r0.7.so BCAST Increasing-1ring Performance[GFlops] 1200 1000 800 600 400 200 0 (1,512) (2,256) (4,128) (8,64) (16,32) Process Grid Fig. 10.. Fig. 10 (P Q) (16 32) (1 512) (2 256) HPL HPL HPL 100% (1 512) (2 256) 7.5 N 6.1 N HPL Supernova 80% N 226274 N 220000 HPL N 200000 5000

Table 7.. 1cpu 2cpu 4cpu 8cpu 16cpu 32cpu 64cpu 128cpu 256cpu 512 cpu N 14000 20000 28000 40000 56000 80000 113000 160000 220000 220000 224 (P Q) (1 1) (1 2) (2 2) (2 4) (4 4) (4 8) (8 8) (8 16) (16 16) (16 32) BCAST Increasing-1ring library atlas-3.5.6 Fig. 11 224 (P Q) (16 32) BCAST LnM gcc3.2-fomit-frame-pointer -O3 -funroll-loops 1170 1160 1150 1140 1130 200000 205000 210000 215000 220000 Fig. 11. HPL. Fig. 11 N 220000 220000 Supernova HPL Table 8 Table 8.. N 220000 224 (P Q) (16 32) BCAST Long bandwidth reducing modified library libgoto opteron-r0.7.so 8. 3.2 Supernova E1200 E1200 NETGEAR 15) Gigabit Switch GS524T 1CPU 512CPU Table 7 Fig. 12 gcc3.2 -fomit-frame-pointer -O3 -funroll-loops atlas-3.5.6 Performance[GFlops] Difference of Performance[GFlops] 1200 1000 800 600 400 200 0 0 450 400 350 300 250 200 150 100 50 Fig. 12. GS524T E1200 1 2 4 8 16 32 64 128 256 512 #CPU (a) 1 2 4 8 16 32 64 128 256 512 #CPU (b).

Fig. 12(a) E1200 Fig. 12(b) CPU E1200 GS524T CPU E1200 Fig. 12(a) 128CPU CPU 128CPU PC GS524T PC 9. Benchmark LINPACK HPL Benchmark Benchmark PGI HPL ATLAS HPL gotolibrary PC Benchmark Supernova 16CPU 11271MFlops 2003 12 RIKEN BMT 2003 PC Supernova 1 HPL Supernova 1.169TFlops 63.4% 2003 11 TOP500 93 6 PC 1 Supernova 1) Rajkumar Buyya. High Performance Cluster Computing: Architecture and Systems, Vol. 1. Prentice Hall, 1999. 2) Rajkumar Buyya. High Performance Cluster Computing: Programming and Applications, Vol. 2. Prentice Hall, 1999. 3) TOP500 Supercomputer Sites. http://www. top500.org/. 4) T. Sterling, D. Savarese, D. J. Beeker, J. E. Dorband, U. A. Renawake, and C. V. Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings of the 24th International Conference on Parallel Processing, pp. 11 14, 1995. 5) Donald J. Becker, Thomas Sterling, Daniel Savarese, John E. Dorband, Udaya A. Ra nawak, and Charles V. Packer. BEOWULF: A PAR- ALLEL WORKSTATION FOR SCIENTIFIC COMPUTATION. In Proceedings of International Conference on Parallel Processing, 1995. 6) T. L. Sterling, J. Salmon, D. J. Beeker, Savarese, and D. F. Savarese. How to build a beowulf: A guide to the implementation and application of pc clusters. MIT Press, 1999. 7) PC Cluster Consortium. http://pdswww.rwcp. or.jp/. 8) H. Tezuka, A Hori, Y. Ishikawa, and M. Sato. Pm: An operating system coordinated high

performance communication library. In Highperformance Computing and Networking97, pp. 708 717, 1997. 9) HyperTransport Consortium. http://www. hypertransport.org/. 10) InfiniBand Trade Association Home Page. http: //www.infinibandta.org/. 11) Himeno Benchmark xp Home Page. http://www.w3cic.riken.go.jp/hpc/ HimenoBMT/index.html. 12) The linpack benchmark. http://www.netlib. org/benchmark/top500/lists/linpack.html. 13) The NAS Parallel Benchmarks Home Page. http://www.nas.nasa.gov/software/npb/. 14) HPL Algorithm. http://www.netlib.org/ benchmark/hpl/algorithm.html. 15) NETGEAR Home Page. http://www.netgear. com/.