6 19 4 27
1. 2. 3. 3.1 3.2 A 3.3 B 4. 5. 2007/4/27 4 1
1. 2007/4/27 4 2
NEC NHF2 18 9 19 19 2 28 10PFLOPS2.5PB 30MW 3,200 18 12 12 SimFold, GAMESS, Modylas, RSDFT, NICAM, LatticeQCD, LANS HPL, NPB-FT 19 2 28 2007/4/27 4 3
NH 1,280 N 40,960 SMP CPU 40,960 163,840 10.48PFLOPS : 2.5PB N 2TB Fat-tree Fat-tree 16GB/s32 32 20Gbps 17.5MW (Linpack) SW2 #00 SW2 #15 SW2 #16 SW2 #31 SW2 #32 SW2 #47 SW2 #48 SW2 #63 Fat-tree 4 SW1 SW1 SW1 #00 #15 #16 16 16 SW0 SW0 SW0 #00 #15 #16 16 SW1 SW1 #31 #32 SW0 SW0 #31 #32 SW1 SW1 #47 #48 SW0 SW0 #47 #48 SW1 SW1 #63 #64 SW0 SW0 #63 #64 SW1 #79 SW0 #79 16GB/s x 16links x 2 16GB/s x 16links x 2 N : 32CPU, 128Core, 8.19TFLOPS, 2TB N : 32CPU, 128Core, 8.19TFLOPS, 2TB N NUMA 16GB/s x 16links x 2 N NUMA 16GB/s x 16links x 2 CPU: 256GFLOPS CPU: 256GFLOPS CPU: 256GFLOPS CPU: 256GFLOPS Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (32) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (1280 N ) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (32) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS L2$: 8MB L2$: 8MB L2$: 8MB L2$: 8MB 128GB/s 128GB/s 128GB/s 128GB/s MEM: 64GB MEM: 64GB MEM: 64GB MEM: 64GB 2007/4/27 4 4
NH 45nmCPU 256GFLOPS CPU42GHz 2FMAx8128KB 8MB 4RDB Reusable Data Buffering L2 1CPU4 SMP 40,960CPU10.48PFLOPS2.5PB N 32CPU OSMPI CPU CPU 140W Linpack 328TB/s 3Fat tree 1280 N 2007/4/27 4 5
NH OS: LinuxIO OS : : OpenMP MPI : Fortran HPF CAF C/C++ MPI 2007/4/27 4 6
F 82,944 CPU 82,944 663,552 10.61PFLOPS 2.53PB32GB ToFu: +3D 18CPU 1 4608 3D 5.0GB/s 2 1 30GB/s 6 15.5MW (Linpack) 3D 30GB/s x 6 2 /9 / 30GB/s x 6 2 /9 / CPU: 2GHz, 128GFLOPS (8Cores) Core: Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core SIMD(4FMA) SIMD(4FMA) 16GFLOPS L2$: 6MB MEM: 32GB 64GB/s 82,944 CPU: 2GHz, 128GFLOPS (8Cores) Core: Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core SIMD(4FMA) Core: SIMD(4FMA) SIMD(4FMA) 16GFLOPS L2$: 6MB MEM: 32GB 64GB/s 2.5GB/s x 8 links x 2 180GB/s 2.5GB/s x 8 links x 2 180GB/s 2007/4/27 4 7
F 45nm 1CPU LSI 128GFLOPS 1CPU82GHz FP128SPARC-V9 4 SIMD4FMA 4 HPC 6MB L28 / 82,944CPU 10.6PFLOPS2.53PB Linpack 58W/CPU 20 ToFu Torus-connected Full connection 18CPU 1 3D 2 2007/4/27 4 8
F OS POSIX UNIX OS OpenMP MPI : 8SMP8SMP D ToFu Fortran XP Fortran HPF CAF C/C++ MPI 2007/4/27 4 9
NH F PFLOPS 10.48 10.61 PB 2.50 2.53 PB 140 140 / m 2 1,446 / 2,976 1,475 / 3,198 / MW 17.5 / 23 Linpack 15.5 / 22.8 Linpack CPU 40,960 82,944 163,840 663,552 Fat Tree D 2007/4/27 4 10
NH F GHz 2 GFLOPS 64 16 16: 2FMA x 8VPP) SIMD 4FMA 256 64 128 GFLOPS 256 128 4 8 CPU Byte/Flop L2 0.5 MB 8 6 Byte/Flop 4 2 2007/4/27 4 11
2GHz Thin Fat NH 40,960 F 82,944 () HPC 2007/4/27 4 12
2 NH 4 16 F 8 66 NH F SIMD NH Fat Tree F D 2007/4/27 4 13
21 9 7 SimFold GAMESS Modylas RSDFT NICAM LatticeQCD LANS HPL High Performance LinpackNPB-FT 2007/4/27 4 14
9 12 10 NH F PFLOPS 8 6 4 2 0 SimFold GAMESS Modylas RSDFT NICAM LatticeQCD LANS HPL NPB-FT 7 HPL NPB- FT 2007/4/27 4 15
9 2.5 2.0 NH LatticeQCD LANS 1.5 1.0 0.5 0.0 NH F NH F NH F NH F NH F NH F NH F NH F SimFold GAMESS Modylas RSDFT NICAM LatticeQCD LANS NPB-FT RSDFTNPB-FT 2007/4/27 4 16
12 NH F PFLOPS BMT 2007/4/27 4 17
10PFLOPS2.5PB 30MW 3,200 BMT CPU F NH F NH 2007/4/27 4 18
2 22 2 2 2007/4/27 4 19
2. 2007/4/27 4 20
1. LINPACK 10PFLOPS 2. 10PFLOPS 10PFLOPS 3-5PFLOPS PC 3. 3PFLOPS 3PFLOPS 1PFLOPS 2007/4/27 4 21
F 10PFLOPSNH 3PFLOPS 121 3 A B Fat Tree Fat Tree ToFu Fat Tree NIC F NH F NH F NH ToFu 10 3 10 3 10 3 2007/4/27 4 22
FNH 10PFLOPS 3PFLOPS Linpack 10PFLOPS A B ToFu Fat Tree F NH 2007/4/27 4 23
1/3 + 2007/4/27 4 24
2/3 SIMD 2007/4/27 4 25
3/3 CPU 2007/4/27 4 26
3. 2007/4/27 4 27
A B ToFu Fat Tree A B LINPACK 10PFLOPS A 10PFLOPS B 3PFLOPS A: 11.2PFLOPS x 85% LINPACK =9.52PFLOPS B: 3.1PFLOPS x 90% LINPACK =2.79PFLOPS 1.2TB/ 15PB F 80PB NH 5PB A+B LINPACK 90% 11.08PFLOPS 85% 10.46PFLOPS 80% 9.85PFLOPS A 1/8 B/FLOPS B 1/4 1/8 B/FLOPS 100PB A B10 A B B 2007/4/27 4 28
On-the-fly 2007/4/27 4 29
On-the-fly 10PFLOPS t 1 t 2 t 3 2, 2, 2, t 1 t 2 t 3 A B 2007/4/27 4 30
On-the-fly 10A A 2 10TB B 10PFLOPS 3PFLOPS 10TB 10TB 1PFLOPS 2 on N 1 2 10TB 10TB 2 on N 2 2 10TB 10TB 2 on N n 2 21.6 16GB/CPU 1.0 1TB/ A B 2007/4/27 4 31
- e - e I - I - 3 3PF 1PF 30GB 40TB 45GB 0.3GB SCF-CI 4GB 3GB 2007/4/27 4 32
A 10PFLOPS 15PB A ToFu F 80PB B Fat Tree NH 5PB B 1PFLOPS 1TB NUMA 2007/4/27 4 33
A 13PF A 10PF+B 3PF (1PF )10 A 13PF 100 30 130 1.7 1.16 A 10PF 100 51 B 3PF 151 5,000 800 4,500 B : NICAM 1 : 1.9 LANS 1 : 1.5 2007/4/27 4 34
3.1 2007/4/27 4 35
CPU 99,840 749,568 14.3PFLOPS 1.7-2.1PB 100PB A B 24MW 3,800 1.68MW/PFLOPS 266 /PFLOPS 15PB CPU 87,552 700,416 11.2PFLOPS 1.34PB 15.2MW 1,900 CPU 12,288 49,152 3.14PFLOPS 0.375-0.75PB 6.8MW 900 5PB 80PB 1.2TB/s 2.0MW 700 2007/4/27 4 36
MPI A MPI B 2007/4/27 4 37
A B ACL MPI API 2007/4/27 4 38
A B 57m 52m A B 1,900 900 700 54.5m 3,800 36m 70m 12.5m 17m 2007/4/27 4 39
2007/4/27 4 40 2007 2008 2009 2010 2011 A LSI OS LSI OS B 2 2 2 2 1 1 2 2 1 1 1 1
3.2 A 2007/4/27 4 41
A 87,552 CPU 87,552 700,416 11.2PFLOPS 1.34PB16GB ToFu +3D 18CPU 1 20x16x16 =5,120 3D 15.2 MW (Linpack) 3D 30GB/s x 6 2 /9 / 30GB/s x 6 2 /9 / CPU: 2GHz, 128GFLOPS (8Cores) Core: Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core SIMD(4FMA) SIMD(4FMA) 16GFLOPS L2$: 6MB MEM: 16GB 64GB/s 87,552 CPU: 2GHz, 128GFLOPS (8Cores) Core: Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core: SIMD(4FMA) Core SIMD(4FMA) Core: SIMD(4FMA) SIMD(4FMA) 16GFLOPS L2$: 6MB MEM: 16GB 64GB/s 2.5GB/s x 8 links x 2 180GB/s 2.5GB/s x 8 links x 2 180GB/s 2007/4/27 4 42
A 45nm 1CPU(LSI)128GFLOPS 1CPU82GHz FP128SPARC-V9 4 ) SIMD (4FMA 4 ) HPC 6MB L28 / 42W/CPU Linpack 58W/CPU20 ToFu (Torus-Full connection) 18CPU 1 3 2 2007/4/27 4 43
8128FP 2GHz SIMD 4 4 16GFLOPS CPU 128GFLOPS 6MB 64GB/s 32GB/s 32GB/s L2 L1 2B/FLOP L2 0.5B/FLOP CPU 128GF 16GFx8 2GH 8 2 2SIMD 2 2SIMD 2 2 1 1 8KB(2way) 116KB(2way) 2 6MB(12way) 64GB/s 2007/4/27 4 44
SIMD 4,8 (1) SIMD 2 (2) SIMD 4 Basic FPR(%b0-%b63) FPR(%e0-%e63) FPR(%b0-%b63) FPR(%e0-%e63) FMA FMA FMA FMA A-pipe B-pipe C-pipe D-pipe FMA FMA FMA FMA A-pipe B-pipe C-pipe D-pipe Extend (3) SIMD1 SIMD2 FPR(%b0-%b63) FPR(%e0-%e63) FPR(%b0-%b63) FPR(%e0-%e63) FMA FMA FMA FMA A-pipe B-pipe C-pipe D-pipe FMA FMA FMA FMA A-pipe B-pipe C-pipe D-pipe 2007/4/27 4 45
CPU 16GB SBCPU 2 32GB ICC Interconnect Controller CPU-ICC 32GB/s ICCPCI Express gen2 DIMM DIMM 32GB/s 32GB/s DIMM CPU CPU 32GB/s 32GB/s 82GB/s ICC DIMM 32GB/s 32GB/s PCIe Gen2 4GB/s x3 ToFu 6.4Gbps / differential pair PCIe Gen2... 5Gbps / differential pair Full / ToFu 5GB/s x8 Torus / ToFu 10GB/s x 2(+1) 2007/4/27 4 46
ToFu ToFu Torus-connected Full-connection 2 9SB 2.5GB/s 2) ToFu 2 20x16x16 3 5GB/s x 3 x 2 = 30GB/s 0.1 1.6 0.8 MPI 1.1 2.6 1.8 3D 2007/4/27 4 47
8 1600 750 2000 mm 3 52m 36m 2007/4/27 4 48
25 SW (8) (50) 10GbE SW 10GbE 50TB RAID10 18 8 SCFB 50TB RAID10 18 8 SW SW IO SB SW SW (320) (320) (320) (320) 10GbE SW 10GbE SW 10GbE SW 10GbE SW 1GbE SW 1GbE SW 1(12) 8GFC 56 56 77PB 2007/4/27 4 49
OS POSIXUNIX OS SW OpenMP MPI : Fortran HPF CAF XP Fortran C/C++ A 8SMP 87,552 B ToFu 2007/4/27 4 50
RAS CPU ECC RAM 3 2007/4/27 4 51
3.3 B 2007/4/27 4 52
B 12,288 384 N CPU 12,288 49,152 3.14PFLOPS 0.375-0.75PB32-64GB N 32CPUs NUMA1TB-2TB 2 Fat-tree (24 + 16) x 16 7MW 900 Fat-tree SW2 24 #00 SW0 #00 16 SW0 #02 SW2 #02 SW0 #03 SW2 #15 SW0 #23 16GB/s x 16links x 2 16GB/s x 16links x 2 N : 32CPU, 128Core, 8.19TFLOPS, 1-2TB N : 32CPU, 128Core, 8.19TFLOPS, 1-2TB N NUMA 16GB/s x 16links x 2 N NUMA 16GB/s x 16links x 2 CPU: 256GFLOPS CPU: 256GFLOPS CPU: 256GFLOPS CPU: 256GFLOPS Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (32) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (384 N ) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS (32) Core: 2GHz Core: (2FMA 2GHz Core: x 8VPP) (2FMA 2GHz Core: 64GFLOPS x 8VPP) (2FMA 2GHz 64GFLOPS x 8VPP) (2FMA 64GFLOPS x 8VPP) 64GFLOPS L2$: 8MB L2$: 8MB L2$: 8MB L2$: 8MB 256GB/s 256GB/s 256GB/s 256GB/s MEM: 32-64GB MEM: 32-64GB MEM: 32-64GB MEM: 32-64GB 2007/4/27 4 53
B 45nmCPU 256GFLOPS CPU42GHz 8FMAx2128KB 8MB L24 RDB (Reusable Data Buffering) 12,288CPU3.14PFLOPS0.375-0.75PB N 32CPU OS : 140W/CPU Linpack 98TB/s 2Fat tree 384 N 2007/4/27 4 54
4 1 8MB L2 2GHz 64GFLOPS CPU 256GFLOPS 1B/FLOP 8MB L2 256GB/s 128GB/s 1B/FLOP L2 4B/FLOP RDB (Reusable Data Buffering) 256GF 64GFx4 8MB 8way- 64B/4 Unified 1B/FLOP 16GB/s 2 256GB/s 2007/4/27 4 55
128 4way 8 2 / 1 2007/4/27 4 56
N 4CPU 1U 8U 32CPU I/O NUMA 2CPUN33x33 16GB/s x 2 I/Ox86 NN N N 16GB/s x x 16 MM MM MM MM C C C C C C C C C C C C C C C C MM MM MM MM MM MM MM MM C C C C C C C C C C C C C C C C MM MM MM MM MM MM MM MM C C C C C C C C C C C C C C C C MM MM MM MM I/O I/O U #0 U #1 U #7 CPU CPU 2007/4/27 4 57
N 2Fat-tree 16GB/s32 32 20Gbps N 16 16 384 N 98TB/s SW2 24 #00 SW0 SW0 #00 #02 16 N 16 CPU #0~3 CPU #4~7 CPU #28~31 SW2 #02 SW2 #15 SW0 #03 SW0 #23 2007/4/27 4 58
54.5m 2 N ) 1I/O21 2000mm 2000mm 1000mm 2N 8 2000mm I/O SW 8SW 1000mm 600mm 800mm I/O 900 17m 800mm 1000mm 2007/4/27 4 59
OS: LinuxIO OS : SW : OpenMP MPI : Fortran HPF CAF C/C++ UPC 2007/4/27 4 60
RAS CPU ECCRAM(L2 ) I/F RAM MOD-N Out-of-N BIST (Built-In Test) / LSI ECC 1 N / OS CPU N I/O NN RAID6 I/O 2007/4/27 4 61
4. 2007/4/27 4 62
A SIMD ToFu SIMD RAS B Fat-tree VCSEL 20Gbps SerDes RAS 2007/4/27 4 63
A LSI (1/2) LSI 45nm LSI 8 HPC SIMD 6MB 128GFLOPS /101 / ) - RAM - Vth - - Vdd, Vbs 2007/4/27 4 64
A LSI (2/2) ( 10 ) LSI R A M L1$ L1 $ SEC DED ECC L2$ SEC DED ECC SEC DED ECC mtlb 2007 2008 2009 2010 GPR FPR GUB FUB PC PSTATE ALU SHIFT FMA 2007/4/27 4 65
2007/4/27 4 66 A (1/2) I/O 6.25Gbps 6.25Gbps PT 15 IDC 3.125Gbps SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystemBoard ICC SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN SystmBoard ICC CN
A (2/2) ToFu MPI 100PetaFlops HPC 2007 2008 2009 2010 2007/4/27 4 67
A (1/2) (SB) SB (SB) 2007/4/27 4 68
A (2/2) CPU0.006( ) / 2007 2008 2009 2010 / / 2007/4/27 4 69
A SIMD (1/2) Basic, Extend 2 2/ Basic, Extend SIMD 2 SIMD DO I=1,N IF ((I)) then A(I)=B(I)+C(I) ELSE X(I)=Y(I)*Z(I) ENDIF ENDDO L2, L1 DO I=1,N,2 IF ((I)) then IF ((I+1)) then A(I)=B(I)+C(I) A(I+1)=B(I+1)+C(I+1) ELSE A(I)=B(I)+C(I) X(I+1)=Y(I+1)*Z(I+1) ENDIF ELSE IF ((I+1)) then X(I)=Y(I)*Z(I) A(I+1)=B(I+1)+C(I+1) ELSE X(I)=Y(I)*Z(I) X(I+1)=Y(I +1)*Z(I+1) ENDIF ENDIF ENDDO 2007/4/27 4 70
A SIMD (2/2) Venus 8 ( ) SIMD 2007 2008 2009 2010 / / 2007/4/27 4 71
B LSI(1/2) (1) NMOS PMOS N+ N+ P_well P+ P+ N_well P_sub / 90nm 65nm 45nm (2) 45nmCMOS Low etc /SRAM etc Vth etc etc 2007/4/27 4 72
B LSI(2/2) LSI TEG LSI LSI 2007/1 2009/1 2009/4 2006 2007 2008 2009 2010 fix RTL LSI TEG TO LSI LSI LSI 2007/4/27 4 73
B(1/2) (1) (2) 20Gbps SerDes ITRS 2 5 10Gbps 1000/LSI 1/200 1/100 2007/4/27 4 74 100G bps 10G 1G ITRS 2 20Gbps 2000 2005 2010
B(2/2) FIX 2007/4 2007/4 2007/2 2008/2 2009/2 2009/3 2006 2007 2008 2009 2010 fix RTL LSI 2007/4/27 4 75
B(1/2) Program DO DO i = 1, 1, n +B(i-1)+ +B(i)+ = +B(i+1) END END DO DO i-1 i i+1 VL i-1 i i+1 VL 2007/4/27 4 76
B(2/2) 2007/4Q 2008/4Q 2009/4Q 2010/4Q 2006 2007 2008 2009 2010 fix RTL LSI 2007/4/27 4 77
5. 2007/4/27 4 78
21 9 7 SimFold GAMESS Modylas RSDFT NICAM LatticeQCD LANS HPL High Performance LinpackNPB-FT 2007/4/27 4 79
2007/4/27 4 80
2007/4/27 4 81
2007/4/27 4 82
2007/4/27 4 83
F 1/2 (SPARC64VI 1Core) SIMD SIMD or 2007/4/27 4 84
F 2/2 2007/4/27 4 85