taisuke@cs.tsukuba.ac.jp http://www.hpcs.is.tsukuba.ac.jp/~taisuke/
CP-PACS HPC PC post CP-PACS CP-PACS II
1990 HPC RWCP,
HPC
かつての世界最高速計算機も 1996年11月のTOP500 第一位 ピーク性能 614 GFLOPS Linpack性能 368 GFLOPS (地球シミュレータの前 に日本が一位を取った 最後の計算機 2003年11月のTOP500 ついに drop off!! CCSシンポジウム (2004/06/10)
6Gflops 1000 6 Tflops Infiniband (x4): 1 Gbyte/s MyrinetXP (dual): 500 Mbyte/s CP-PACS 16 bank
Flare cluster DELL PowerEdge 1750 Xeon 3.06GHz dual 12 nodes, 72 GFLOPS Gigabit Ethernet Linux CPU Orion cluster Compaq AlphaServer DS20L Alpha EV68 833MHz dual 30 nodes, 100 GFLOPS Fast Ethernet Linux + SCore CPU Perseus cluster HP ProLiant DL360G3 Xeon 2.8GHz dual 37 nodes, 414 GFLOPS Myrinet2000 Linux + SCore (HMCS) Corona cluster HP ProLiant DL380G3 Xeon 3.06GHz dual 8 nodes, 48 GFLOPS Gigabit Ethernet x 6 Linux+SCore trunk
Perseus custer Xeon dual, Myrinet2000, 37 nodes) SCore+PBS+CMU MPI on PM/Myrinet, no SCore-D GRAPE-6 HMCS: Heterogeneous Multi-Computer System) 6 13 200GFLOPS Myrinet2000 full connection
PC-Cluster (Xeon dual) Parallel I/O System PAVEMENT/PIO MPP for Particle Simulation (GRAPE-6) Paralel File Server (SGI Origin2000) 100base-TX Switches 32bit PCI N Hybrid System Communication Cluster (Compaq Alpha) Parallel Visualization Server (SGI Onyx2) Parallel Visualization System PAVEMENT/VIZ
CPU (Alpha EV68 dual, Xeon dual) QCDpost processing WS
CP-PACS CPU 99% 10 20TFLOPS
full QCD FFT + CG HMCS MPP
IntelXeon, Opteron, Itanium2 Dual CPU Network bound SAN (System Area Network) MyrinetXP: dual connection Infiniband: x4 GbEthernet
ex. InfiniBand, Quadrix vs ex. GbE n
PC QCD Lattice QCD x, y, z, t PC
[flop] Load [byte] Store [byte] [byte/flop] t 288 672 192 3.00 x 336 864 192 3.14 y 336 864 192 3.14 z 336 864 192 3.14 clover 600 864 192 1.76 5088[byte]/1896[flop] = 2.68 [byte/flop]
4 Nx*Ny*Nz*Nt3 3nx*ny*nz x y z [byte] 12*2*(Nt/2+1)*(Ny/ny)*(Nz/nz)*16 12*2*(Nt/2+1)*(Nx/nx)*(Ny/ny)*16 12*2*(Nt/2+1)*(Nx/nx)*(Ny/ny)*16 1 Nx=Ny=Nz=Ns, nx=ny=nz=ns 0.608 * ((Nt+2)/Nt) / (Ns/ns) [byte/flop]
P4 (Pentium4, 2.4GHz): HyperThreading PC3200 memory single CPU Xeon (Pentium Xeon, 2.8 GHz): PC2100 memory, 2 CPU SMP EV7 (Alpha EV7, 1.15 GHz): PC3200 memory, 16 CPU HyperTransport connected Alpha EV7 HP
CPU [Mflops] No copy P4 Copy No copy P4, Xeon SSE2 Xeon Copy No copy EV7 Copy 2*2*2*64 1251 957 811 598 1190 949 4*4*4*64 1020 878 633 536 1144 1034 8*8*8*64 1045 958 686 625 1140 1082 16*16*16*64 N/A N/A 604 573 1122 1101 No Copy: CPU Copy:
EV7 16 CPU ( 16*16*16*64) CPU 1 2 4 8 16 1 0.94 0.86 0.72 0.32 8 CPU HyperTransport
32*32*32*64 1 trajectory=20284 [Tflop] [Tflop] [Tbyte] [Tflop/Tbyte] 1 20284.0 ----- ----- 8 2535.50 77.1 32.9 64 316.938 19.3 16.4 512 39.6 4.82 8.22 4096 4.95 1.20 4.11 32768 0.619 0.301 2.06
1Gflops0.6Gbyte/s 32*32*32*645000 trajectory CPU 512 (ns=8) 4096 (ns=16) [ ] 2768 405 [%] 17 29
Xeon 3.06GHz, 1CPU)
Xeon 3.06GHz)
PC3200 (3.2 Gbyte/s) x 2 3MB L2 short vector SSE2 SSE3
CPU CPU
CPU NIC+Switch
24.6 Tflops (4GHz CPU) 3072 CPU (1536 boards) 2 CPU CPU 6.1 TB 1.05 PB (RAID0 mirror) GbEthernet trunk (dual link 3 ) PCI-X dual Gigabit Ethernet 3-D Hyper Crossbar 88 (Node=48, Switch=40)
IDE CPU CPU IDE HDD HDD chip-set memory chip-set memory x0, x1: X dual link y0, y1: Y dual link management net (100Mbps) data net (GbE x 6) data net (GbE x 6) management net (100Mbps) z0, z1: Z dual link x0 x1 y0 y1 z0 z1 x0 x1 y0 y1 z0 z1 management network switch
X Z=12 Y Z CPU Y=16 1 CPU CPU CPU X=16 dual link
CPU([0-F],[0-F],[0-1])1/6=512 CPU [Z] Y(x=0 15, z=z) [Y] Z(x=0 15, y=y) Y1-Y2,Z CPU (x=0 F,y=Y1 Y2,z=Z) Y(1128 ) Z(132 ) [Z] X(y=0 15, z=z) X(1128 ) 0-3,0 8-B,0 4-7,0 C-F,0 0-3,1 8-B,1 4-7,1 C-F,1 [0] [0] [2] [2] [1] [1] [3] [3] [2] [0] [1] [3] [6] [4] [5] [7] [4] [4] [6] [6] [5] [5] [7] [7] [A] [8] [9] [B] [E] [C] [D] [F] [8] [8] [A] [A] [9] [9] [B] [B] 0-3,2 8-B,2 4-7,2 C-F,2 0-3,3 8-B,3 4-7,3 C-F,3 0-3,4 8-B,4 4-7,4 C-F,4 0-3,5 8-B,5 4-7,5 C-F,5 0-3,6 8-B,6 4-7,6 C-F,6 0-3,7 8-B,7 4-7,7 C-F,7 0-3,8 8-B,8 4-7,8 C-F,8 0-3,9 8-B,9 4-7,9 C-F,9 0-3,A 8-B,A 4-7,A C-F,A 0-3,B 8-B,B 4-7,B C-F,B
CPUSSE dual CPU SMP 3
PC