THE SCIENCE AND ENGINEERING DOSHISHA UNIVERSITY, VOL.XX, NO.Y NOVEMBER 2003 Construction of Tera Flops PC Cluster System and evaluation of performance using Benchmark Tomoyuki HIROYASU * Mitsunori MIKI * and Hiroshi ARAKUTA ** (Received October 6, 2004) Complicated and diverse of objective problems with the development of technology, demand of high performance computer is increasing. Instead of vector superocomputer, in order to meet the needs of these demand, PC cluster systems have gotten a lot of attention in recent years. PC cluster systems consist of many PCs connected by network and are used for parallel or distributed computation. In scientific and engineering fields, HPC cluster systems are attractive to the computationally intensive tasks. We setting up of HPC cluster system called Supernova, toward a performance of 1 Tera Flops. Supernova is composed 256-node and running Linux Operating System. We evaluated this cluster system using High-Performance LINPACK Benchmark and Himeno Benchmark. In this paper, we present the result of performance and the combination of parameters using these Benchmarks in Supernova. Through experimenting we obtained knowledge to perform parameter tuning for better performance in these Benchmarks and knowledge about construction of PC Cluster System. Key words PC Cluster, Himeno Benchmark, LINPACK Benchmark, Linux PC,,, 1. 1 PC 1, 2) 500 TOP500 Supercomputer Sites 3) TOP500 * Department of Knowledge Engineering and Computer Sciences, Doshisha University, Kyoto Telephone:+81-774-65-6930, Fax:+81-774-65-6796, E-mail:tomo@is.doshisha.ac.jp, mmiki@mail.doshisha.ac.jp ** Graduate Student, Department of Knowledge Engineering and Computer Sciences, Doshisha University, Kyoto Telephone:+81-774-65-6716, Fax:+81-774-65-6716, E-mail:arakuta@mikilab.doshisha.ac.jp
TOP500 1TFlops Supernova Cluster System Supernova Benchmark High-Performance LINPACK Benchmark Supernova 2. PC PC 1 CPUOS PC HPC(High Performance Computing) HPC HPC Beowulf Beowulf 1990 NASA Beowulf 4, 5, 6) Beowulf Linux FreeBSD OS MPI PVM Beowulf SCore 7, 8) SCore Beowulf RWCP Real-World Computing Partnership Beowulf SCore SCore Linux PC HA(High Availability) HA HA 3. Supernova Cluster System 1TFlops 2003 9 Fig. 1 Supernova Cluster System Supernova Supernova AMD 64bit CPU Opteron 512CPU Table 1 Fig. 1. Supernova Cluster System. Table 1. Supernova. 256 CPU AMD Opteron 1.8GHz 512 Memory 2GB 256 total 512GB OS Turbolinux for AMD64 mpich-1.2.5 TCP/IP Gigabit Ethernet
Opteron 1 2 FPU 1 1 Supernova Rpeak (1) 1.8432TFlops Rpeak =#CPU ClockFrequency #FPU (1) 3.1 Opteron Supernova AMD Opteron Opteron Opteron HyperTransport HyperTransport 9) I/O 3 HyperTransport 19.2GB/s 3.2 Supernova Force10 Networks E1200 E1200 1.44Tbps 5 336 Supernova 256 E1200 3.3 Supernova Supernova 3.1 Opteron HyperTransport 100Mbps 1Gbps PC Myricom Myrinet InfiniBand 10) Supernova 4. Benchmark 4.1 Benchmark PC Benchmark 11) LINPACK 12) Nas ParallelBenchmark 13) Benchmark LIN- PACK 4.2 Benchmark Benchmark Poisson Jacobi
4.2.1 Benchmark Benchmark Benchmark Jacobi Table 2 Table 2. Benchmark. Array Size #Array Elements XS 65 33 33 S 128 64 64 M 256 128 128 L 512 256 256 XL 1024 512 512 x y z CPU CPU 4.2.2 Poisson Poisson (2) 2 u x 2 + 2 u y 2 + 2 u = f(x, y, z) (2) z2 (2) (3) f i,j,k = u i+1,j,k 2u i,j,k + u i 1,j,k x 2 + u i,j+1,k 2u i,j,k + u i,j 1,k y 2 + u i,j,k+1 2u i,j,k + u i,j,k 1 z 2 (3) x = y = z (3) (4) u i,j,k = 1 6 [u i+1,j,k + u i 1,j,k + u i,j+1,k + u i,j 1,k + u i,j,k+1 + u i,j,k 1 ( x) 2 f i,j,k ] (4) u i,j,k Gauss-Seidel Jacobi Benchmark Jacobi u m+1 i,j,k (m) (5) u m+1 i,j,k = 1 6 [um i+1,j,k + um i 1,j,k + um i,j+1,k + um i,j 1,k + u m i,j,k+1 + um i,j,k 1 ( x)2 f i,j,k ] Benchmark Jacobi (5) 4.3 LINPACK Benchmark LINPACK Tenessee J.Dongarra LU Fortran LINPACK BLAS Basic Linear Algebra Subprograms LINPACK 3 LINPACK Benchmark N=100 N=100 LU DGEFA DGESL SGEFA SGESL 2 LU x Toward Peak Performance N=1000 Highly Parallel Computing TOP500 HPL HPL
LINPACK (6) = 2 3 N 3 + O(N 2 ) (6) A LU x LU 4.3.1 4.3.1 LU A n n b n Ax = b (7) A LU A L U (8) A = LU (8) n n L L ij L ij (9) L ij =0(i>j) (9) 4.4 High-Performance LINPACK Benchmark HPL High-Performance LINPACK Benchmark LINPACK BLAS ATLAS Automatically Tuned Linear Algebra Software goto-library HPL 4.4.1 HPL HPL Fig. 2 2 LU Fig. 3 Panel Factorization Panel Broadcast LU L U (13) n n U U ij U ij (10) N 00 01 02 03 04 05 10 11 12 13 14 15 20 21 22 23 24 25 30 31 32 33 34 35 P0 P2 P4 00 03 01 04 02 05 20 23 21 24 22 25 40 43 41 44 42 45 U ij =0(i<j) (10) 40 41 42 43 44 45 50 51 52 53 54 55 P1 P3 P5 10 13 11 14 12 15 A LU (7) (11) N Global Array 30 33 50 53 31 34 32 35 51 54 52 55 Pn : Process Number Local Array Ax = LUx = b (11) y = Ux Ly = b (12) (12) Fig. 2. Factorization U. Factorization U y =(y 1 y 2 y n ) (12) y n y n 1 y 1 O(n 2 ) y (13) x O(n 2 ) Ux = y (13) L update Broadcast Fig. 3. L update Broadcast.
4.4.2 HPL 16 N P Q Broadcast N P Q Panel Broadcast Look-ahead Update long U mix L1 U alignment Panel Factorization Panel Factorization Performance[MFlops] Performance[MFlops] 8.0E+02 GNU PGI 6.0E+02 4.0E+02 2.0E+02 0.0E+00 (1,1,1) Division of Grid (a) 1CPU (b) 2CPU 3.0E+03 GNU PGI 2.0E+03 1.0E+03 0.0E+00 (1,1,4) (1,2,2) (1,4,1) (2,1,2) (2,2,1) (4,1,1) Division of Grid Performance[MFlops] Performance[MFlops] 1.5E+03 GNU PGI 1.0E+03 5.0E+02 0.0E+00 (1,1,2) (1,2,1) (2,1,1) Division of Grid (c) 4CPU (d) 8CPU Performance[MFlops] 6.0E+03 GNU PGI 4.0E+03 2.0E+03 0.0E+00 (1,1,8) (1,2,4) (1,4,2) (1,8,1) (2,1,4) (2,2,2) (2,4,1) (4,1,2) (4,2,1) (8,1,1) Division of Grid 1.0E+04 GNU PGI 8.0E+03 6.0E+03 4.0E+03 2.0E+03 0.0E+00 (1,1,16) (1,2,8) (1,4,4) (1,8,2) (1,16,1) (2,1,8) (2,2,4) (2,4,2) (2,8,1) (4,1,4) (4,2,2) (4,4,1) (8,1,2) (8,2,1) (16,1,1) Division of Grid (e) 16CPU Fig. 4.. Factorization Factorization 5. Benchmark 5.1 4.2.1 Benchmark Benchmark Fig. 4 Fig. 5 M CPU 1CPU 2CPU 4CPU 8CPU 16CPU GNU Fortran Compiler 3.2 PGI Fortran Compiler 5.0 Fig. 4 PGI GNU Fig. 5 CPU PGI GNU Difference of Performance[MFlops] 1.0E+04 GNU PGI 8.0E+03 6.0E+03 4.0E+03 2.0E+03 0.0E+00 1 2 4 8 16 #CPU Fig. 5. CPU. PGI CPU (x y z) Fig. 4 1CPU (1 1 1) 2CPU (2 1 1) 4CPU (2 1 2) 8CPU (2 2 2) 16CPU (4 2 2) CPU
5.2 5.1 PGI PGI -O1 -fthread-jumps -defer-pop -O2 -O1 -O3 -O2 -O2 -finline-functions -fast -Mvect=assoc -Mvect=cachesize:* -Mcache align -Mnontemporal Prefetch 128 M CPU 16 5.1 (4 2 2) Table 3 -fast -Mvect=assoc -O2 11271MFlops Supernova 2003 12 RIKEN BMT 2003 PC 1 Table 3.. Compile Option Performance [MFlops] None 9911 -fast -Mvect=assoc -O1 6610 -fast -Mvect=assoc -O2 11271 -Mcache align -Mnontemporal 9661 -Mvect=cachesize:1048576 -O3 10918 -Mvect=assoc,cachesize:1048576 11209 6. HPL LINPACK HPL 4.4.2 6.1 N N HPL HPL N N N 6.2 HPL 6.3 P Q P Q P Q P Q P Q 6.4 Panel Broadcast Panel Broadcast Increasing- 1ring Increasing-2ring Bandwidth-reducing 3 Panel Factorization modified 3 6 normal modified
normal Update Panel Factorization modified Update Panel Factorization 7. HPL 7.1 HPL 32 256 14) ATLAS CPU 24 28 N:10000 BCAST:1ring gcc3.2formit-frame-pointer -O3 -funroll-loops atlas-3.5.6 Fig. 6 3.10 3.05 3.00 2.95 2.90 2.85 2.80 24n 28n 48 56 72 84 96 112 120 140 (n=2) (n=3) (n=4) (n=5) (a) 1CPU 5.50 5.40 5.30 5.20 5.10 5.00 4.90 4.80 4.70 24n 28n Fig. 6.. 48 56 72 84 96 112 120 140 (n=2) (n=3) (n=4) (n=5) (b) 2CPU Fig. 6 28 28 Fig. 7 3.10 3.05 3.00 2.95 2.90 56 84 112 140 168 196 224 252 (a) 1CPU Fig. 7.. 5.70 5.50 5.30 5.10 4.90 56 84 112 140 168 196 224 252 (b) 2CPU Fig. 7 224 28 56 7.2 Panel Broadcast Panel Broadcast Fig. 8 BCAST Table 4 gcc3.2-fomit-framepointer -O3 -funroll-loops atlas- 3.5.6 Table 4. BCAST. 64cpu 128cpu 256cpu 512cpu N 80000 110000 160000 220000 224 (P Q) (8 8) (8 16) (16 16) (16 32) 156.0 154.0 152.0 150.0 148.0 146.0 144.0 1rg 1rM 2rg 2rM Lng LnM Topology 300.0 295.0 290.0 285.0 280.0 275.0 270.0 1rg 1rM 2rg 2rM Lng LnM Topology (a) 64CPU (b) 128CPU 620.0 600.0 580.0 560.0 540.0 520.0 500.0 1rg 1rM 2rg 2rM Lng LnM Topology (c) 256CPU (d) 512CPU Fig. 8. Panel Broadcast. Fig. 8 BCAST CPU normal modified modified normal Fig. 8 normal modified Fig. 8 CPU Supernova BCAST Long bandwidth reducing modified modified Long Supernova 1150.0 1100.0 1050.0 1000.0 950.0 900.0 850.0 1rg 1rM 2rg 2rM Lng LnM Topology
7.3 HPL AT- LAS goto-library Fig. 9 Table 5 gcc3.2-fomit-frame-pointer -O3 -funroll-loops Table 5.. 4cpu 8cpu 16cpu 32cpu 64cpu N 20000 28000 40000 56000 80000 224 (P Q) (2 2) (2 4) (4 4) (4 8) (8 8) library atlas-3.5.6 libgoto opteron-r0.7.so BCAST Increasing-1ring Performance[GFlops] 1200 1000 800 600 400 200 0 atlas-3.5.6 libgoto_opteron-r0.7.so 4 8 16 32 64 128 256 512 #CPU 250 200 150 100 50 0 4 8 16 32 64 128 256 512 #CPU (a) (b) Fig. 9.. Fig. 9(a) atlas goto-library Fig. 9(b) atlas goto-library Fig. 9(a) goto-library Fig. 9(b) CPU goto-library 7.4 (P Q) P Q 6.3 512CPU (P Q) (16 32) Supernova Difference of Performance[GFlops] Fig. 10 Table 6 gcc3.2-fomit-frame-pointer -O3 -funroll-loops Table 6.. N 200000 224 (1 512) (2 256) (4 128) (P Q) (8 64) (4 128) library libgoto opteron-r0.7.so BCAST Increasing-1ring Performance[GFlops] 1200 1000 800 600 400 200 0 (1,512) (2,256) (4,128) (8,64) (16,32) Process Grid Fig. 10.. Fig. 10 (P Q) (16 32) (1 512) (2 256) HPL HPL HPL 100% (1 512) (2 256) 7.5 N 6.1 N HPL Supernova 80% N 226274 N 220000 HPL N 200000 5000
Table 7.. 1cpu 2cpu 4cpu 8cpu 16cpu 32cpu 64cpu 128cpu 256cpu 512 cpu N 14000 20000 28000 40000 56000 80000 113000 160000 220000 220000 224 (P Q) (1 1) (1 2) (2 2) (2 4) (4 4) (4 8) (8 8) (8 16) (16 16) (16 32) BCAST Increasing-1ring library atlas-3.5.6 Fig. 11 224 (P Q) (16 32) BCAST LnM gcc3.2-fomit-frame-pointer -O3 -funroll-loops 1170 1160 1150 1140 1130 200000 205000 210000 215000 220000 Fig. 11. HPL. Fig. 11 N 220000 220000 Supernova HPL Table 8 Table 8.. N 220000 224 (P Q) (16 32) BCAST Long bandwidth reducing modified library libgoto opteron-r0.7.so 8. 3.2 Supernova E1200 E1200 NETGEAR 15) Gigabit Switch GS524T 1CPU 512CPU Table 7 Fig. 12 gcc3.2 -fomit-frame-pointer -O3 -funroll-loops atlas-3.5.6 Performance[GFlops] Difference of Performance[GFlops] 1200 1000 800 600 400 200 0 0 450 400 350 300 250 200 150 100 50 Fig. 12. GS524T E1200 1 2 4 8 16 32 64 128 256 512 #CPU (a) 1 2 4 8 16 32 64 128 256 512 #CPU (b).
Fig. 12(a) E1200 Fig. 12(b) CPU E1200 GS524T CPU E1200 Fig. 12(a) 128CPU CPU 128CPU PC GS524T PC 9. Benchmark LINPACK HPL Benchmark Benchmark PGI HPL ATLAS HPL gotolibrary PC Benchmark Supernova 16CPU 11271MFlops 2003 12 RIKEN BMT 2003 PC Supernova 1 HPL Supernova 1.169TFlops 63.4% 2003 11 TOP500 93 6 PC 1 Supernova 1) Rajkumar Buyya. High Performance Cluster Computing: Architecture and Systems, Vol. 1. Prentice Hall, 1999. 2) Rajkumar Buyya. High Performance Cluster Computing: Programming and Applications, Vol. 2. Prentice Hall, 1999. 3) TOP500 Supercomputer Sites. http://www. top500.org/. 4) T. Sterling, D. Savarese, D. J. Beeker, J. E. Dorband, U. A. Renawake, and C. V. Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings of the 24th International Conference on Parallel Processing, pp. 11 14, 1995. 5) Donald J. Becker, Thomas Sterling, Daniel Savarese, John E. Dorband, Udaya A. Ra nawak, and Charles V. Packer. BEOWULF: A PAR- ALLEL WORKSTATION FOR SCIENTIFIC COMPUTATION. In Proceedings of International Conference on Parallel Processing, 1995. 6) T. L. Sterling, J. Salmon, D. J. Beeker, Savarese, and D. F. Savarese. How to build a beowulf: A guide to the implementation and application of pc clusters. MIT Press, 1999. 7) PC Cluster Consortium. http://pdswww.rwcp. or.jp/. 8) H. Tezuka, A Hori, Y. Ishikawa, and M. Sato. Pm: An operating system coordinated high
performance communication library. In Highperformance Computing and Networking97, pp. 708 717, 1997. 9) HyperTransport Consortium. http://www. hypertransport.org/. 10) InfiniBand Trade Association Home Page. http: //www.infinibandta.org/. 11) Himeno Benchmark xp Home Page. http://www.w3cic.riken.go.jp/hpc/ HimenoBMT/index.html. 12) The linpack benchmark. http://www.netlib. org/benchmark/top500/lists/linpack.html. 13) The NAS Parallel Benchmarks Home Page. http://www.nas.nasa.gov/software/npb/. 14) HPL Algorithm. http://www.netlib.org/ benchmark/hpl/algorithm.html. 15) NETGEAR Home Page. http://www.netgear. com/.