第3回戦略シンポジウム緑川公開用

2010 5 15 - -

(SDSM) SMS MpC DLM

Top500 Top 500 list of Supercomputers (http://www.top500.org)

Top 500 list of Supercomputers (http://www.top500.org) 1998 11

SMP Symmetric Multiprocessor CPU CPU CPU CPU CPU CPU CPU CPU cluster CPU CPU

1990100 10 20 50

send a recv b recv x send y

a = 10 b = a data

CPU 1. send p0 a; receive p1 b; ) MPI ( 2. b=a; OpenMP ( )

P0 P1 int a; int a; a=10; send P1 a; recv P0 a; P0 P1 int a; a=10; a a

2 b a P0 P1 int a=10; int b send P1 a; recv P0 b; P0 P1 int a=10; int b; b=a; a,b a,b

SMS SDSM b = a x = y

(SDSM) SMS MpC DLM

(SDSM) SMS MpC DL

PC 0 PC 1 PC 2 P0 P1 P2 PC

MpC C shared auto register static extern typedef shared

MpC P0 shared int a; int b; main( ){ int c=10; a=20; } P1 shared int a; int b; main( ){ int c=10; b=a; } PC

mpcc prg.mpc -osms_prog use sms MpC MpC MpC MpC MpC C C SMS pthread SDSM SMS, TreadMarks, JIAJIA pthread SDSM pthread

MpC vs. OmniOpenMP (Score : RWC floyd : Shortest path search 1Gbps Ethernet2Gbps Myrinet2000 (sec) MpC OmniOpenMP 2 Myrinet PM H. Midorikawa, et al.: "The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread", Proc. of IEEE/ACM Inter. Symp. on Cluster Computing and the Grid (CCGrid2005),Vol.2, pp.889-896

(sec) ep (sec) laplace

(sec) mm(blocking) (sec) mm(nonblocking) MpC

NPB3.0 API 74% MpC 7% H. Midorikawa, et al.: "The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread", Proc. of IEEE/ACM Inter. Symp. on Cluster Computing and the Grid (CCGrid2005),Vol.2, pp.889-896, (2005) MpC,,Vol.46 No.SIG4(ACS9), pp.69-85, 2005)

(SDSM) SMS MpC DLM

64bitOS x86_6448bit 256TB) 48bit (256tebibytes) 56bit (64pebibytes) 64bit (16exbibytes)

DLM ( Distributed Large Memory) Cal Thread calhost Cal Process Com Thread DLM memhost1 Memserv Process memhost2 Memserv Process memhost3 Memserv Process usr_prog args -- -n 4 f hostfile hostfile calhost 2048 // 2GB memhost1 8192 // 8GB memhost2 4096 // 4GB memhost3 4096 // 4GB memhost4 4096 // 4GB : DLM OS

DLM 1 matv.c #include <stdio.h> #include <dlm.h> #define N 16384 // total memory 2 31 B + 2 15 B, 2GiB dlmcc matv.c -omatv dlm double a[n][n], x[n], y[n]; // DLM int main(int argc, char *argv[]) { int i,j; double temp; // a for ( i = 0; i< N; i++) for ( j = 0; j<n; j++) a[i][j] = i; // x for (i = 0; i < N; i++) x[i] = i; // a[n][n]*x[n]=y[n] for(i = 0; i < N; i++){ temp = 0; for(j = 0; j<n; j++) temp += a[i][j]*x[j]; y[i] = temp; } return 0; } dlm DLM

10GbEthe CSLM 10GbEther (CSLM) swap 10GB Cluster Node CPU Node Memory PCI bus OS HP DL585 G2 x 5 Nodes Opteron 2.8GHz x 4 (8Cores) 64GByte 64GiB) 64bit/100MHz PCI-X, PCI-Expressx4 PCI-Expressx8 Linux kernel 2.6.9-42 x86_64 TCP socket Compiler gcc version 3.4.6 Network NIC Switch Hard Disk 10GbEthernet protocol Myri-10G Fujitsu XG1200(10GbE Switch) SAS 147GB 10krpm 2 RAID1 Smart array 5i HP 431958-B21 (SAS 147GB, 10krpm, TransRate 300MBps, seektime 4(Ave),8.1(Max)ms) op4 6000 // 60GB op3 6000 hostfile

SwapDisk DLM 10GbEthernet, 64GBMemory,10GBswap matv.c swap 15%, DLM 10 Disk DLM 2 67.1GB 64GiB swap3%, DLM : DLMDLM, Vol.102, No.398, pp.29-34, 2007

SwapDisk DLM 1GbEthernet, 1GBMem, 4GBswap matv.c swap 160%, DLM 9.5 swap DLM 4.5 5.5 : DLMDLM, Vol.102, No.398, pp.29-34, 2007

DLM (STREAM Benchmark) 380MB/sec ~ 40MB/ sec DLM 1MB ~ 4KB DLM Panda[2005] InfiniBand RDMA, 119MB/s [2006] 10Gb Ethernet, NIC,RDMA 131MB/S 204MB/s

) DLM Himeno Benchmark Large) 35 DLM 4KB DLM 5 10 8% DLM 1MB 5 DLM [2006] GbE NIC RDMA 55 128KB H.Midorikawa et al. : "DLM: A Distributed Large Memory System using Remote Memory Swapping over Cluster Nodes", Proc. of IEEE Cluster2008 pp.268-273, (2008-09) " DLM10GbEthernet ",, Vol.1, No.3, pp.136-157 2008

DLM-MPI MPI DLM MPI Ethernet, InfiniBand, Myrinet ) 2 T2K-Tokyo Myri-10G 4 40Gbps) T2K-Tsukuba InfiniBand 64Gbps)

!"#$%&'($)*&'+,-.&*/'+ ੩ᄢቇ ᚑ 䋨ᣣ 䋩 Top500 45位日本3位 Nov. 2009 インターコネクト Myri-10G x 4本 /node Node間 40Gbps x双方向 J?$7-K&"J8'?&6 /"L:M""/)E:F'&> 3;44<:='&>?=$# @#?&7>$##&>? A$7"/*B"#$%&'C D01E:F'&>"G"/"H 9*1E:F'&>"G"9!"#$%&'! ()* +,&-.! /01234$,' 5&6$78! 9/"2: 4CPUs(16コア /node AMD Opteron8356 メモリ 32GB/node 8nodes メモリ 128GB/node 3;44<:='&>?=$#"@#?&7>$##&>? A$7")/*"#$%&'C"*I)2:F'&>!"#䉲䊮䊘䉳䉡䊛䈧䈒䈳"$$%!"#䉲䊮䊘䊘䉳䉡䊛䈧䈒䈳"$$% システム図朴泰祐 T2Kシンポジウムつくば2008 資料 http://www.ccs.tsukuba.ac.jp/workshop/t2k-sympo2008/

STREAM Kernel COPY SCALE ADD TRIAD Code a(i) = b(i) a(i) = q*b(i) a(i) = b(i) + c(i) a(i) = b(i) + q*c(i) DLM-MPI: MPI-MX 493MB/s Myri-10G 2 613MB/s Myri-10G 4 DLM-socket: TCP/IP EthernetonMyri-10G x 1 380MB/s

DLM Himeno benchmark XLARGE (112GB) 179.4MFLOPS, Relative Time 2.32 ( based on the time in Elarge,15GB) float 1025 x 1025 x 2049 20GB/node x 6 nodes Local memory ratio 17.4% Bonding = 4 XLARGE-d (241GB)88.8 MFLOPS, Relative Time 4.68 ( based on the time in Elarge,15GB) double1025 x 1025 x 2049 20GB/node x12nodes Localmemory ratio 8.1% Bonding = 4

!"#$%&'($)*&'+,-.&*/'+ 01&&+- 2$3+45$64&4(7 Top500 56 6 Nov. 2009!"#$%&'! ()* +,&-.! /0"123$,' 4&5$67! 89"1: InfiniBand 4x DDR 4 /node Node 64Gbps x 2C33D:E'&B<E$#"F#<&6B$##&B< 0G8"1:A' 4CPUs(16 /node AMD Opteron8356 32GB/node ;<$6-=&";7'<&5 *99"1:>""?8@:A'&B T2K 2008 http://www.ccs.tsukuba.ac.jp/workshop/t2k-sympo2008/

DLM-M DLM TCP/IP

Clients DLM fora Cluster on LAN Memory Servers

WANInTrigger http://www.intrigger.jp/ WAN 17, 21 2010 5 WAN 2008 6 11 319 /848 Intrigger :, Vol.49, No.8, pp.939-944, Aug.2008

DLM forclusters on WAN Client User User Program Cluster(LAN) DLM-LAN Admin DLM-WAN Admin Cluster(LAN) DLM-LAN Admin Group of Clusters (WAN) Cluster(LAN) DLM-LAN Admin Memory Server Calculate Node Memory Server Calculate Node Memory Server Calculate Node

Thank you! http://www.ci.seikei.ac.jp/midori/paper