DEGIMA LINPACK Energy Performance for LINPACK Benchmark on DEGIMA 1
AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK 1.4698 GFlops/Watt 1.9658 GFlops/Watt Abstract GPU Computing has lately attracted for energy efficiency. Most of GPU computing system are using for coarse-grained optimization for power-consumption and not for energy efficiency. In this paper, we propose an fine-grained optimization method for enegy efficient GPU computing. We use AMD/ATI Radeon HD 5870 GPU system and introduce its power consumption model in relation between energy-efficiency(flops/watt) and system parameters such as GPU frecuency and voltage. We implement an enegy controllable library with our power consumption model and apply it to the LINPACK benchmark. We found that the energy efficiency improved from 1.47 GFlops/Watt to 1.9658 GFlops/Watt using our method for LINPACK banchmark. 2
1. High Performance Computing GPU High Performance Computing(HPC) TOP500 2 TOP500 TOP500 LIN- PACK 1) 1.1 HPC 2011 11 TOP500 30 2) 2 GPU DEGIMA(DEstination for GPU Intensive MAchine) LINPACK benchmark (Rmax) (Rpeak) 1 1 DEGIMA 1 DEGIMA GPU, GPU 1 TOP500 DEGIMA 1 K computer, TSUBAME2, T2K-tsukuba R max, R peak 3
1.2 TOP500 GPU GPU TOP500 GPU 2011 11 TOP500 39 ( 2 AMD GPU 2 Cell Nvidia GPU 3) ) 3 GPU GPU DEGIMA 1 TOP500 GPU 1.3 Green500 Green500 TOP500 (Flops/W) TOP500 6 2 2011 11 50 4) 60% GPU GPU Green500 GPU 2500 MFlops/W Blue Gene/Q(IBM) GPU Cluster 2000 1500 DEGIMA(Nagasaki Univ) TSUBAME(Tokyo Tec) K Computer(RIKEN) 500 2 0 0 5 10 15 20 25 30 35 40 45 50 Rank 2011 11 Green500 50 ( ) ( ) GPU ( ) ( ) 4
1.4 2011 11 K-computer LINPACK benchmark 10 PFlops( ) 100 1 100 3 TOP500 2019 a 1EFlops 100PFlops 10PFlops 1PFlops 100TFlops 10TFlops 1TFlops 100GFlops 10GFlops 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 3 TOP500 TOP500#1 R max 2019 5) a TOP500 1 K-computer LINPACK 11.28PFlops 12.66 MW Flops/W K- computer 100 1266 MW 1 2.75 6) 4 Green500 4 5.67 2019 32 39.38 MW (= 12.66 MW 100/32.149 5
K-computer 3 Flops/W 4 10 3 10 2 10 2007 2008 2009 2010 2011 2012 4 Green500 No.1 year 2. AMD/ATI Radeon 5870 DEGIMA AMD/ATI Radeon 5870 GPU 2.1 AMD/ATI Radeon 5870 AMD/ATI Radeon 5870 1 7) Engine clock speed MHz Processing power( ) 2.72 TFlops Processing power( ) 544 GFlops Memory clock speed 1.2GHz Memory bandwidth 153.6 GHz 1 ATI Radeon 5870 2 AMD/ATI Radeon 5870 GPU Tesla M2090 Radeon5870 17 Radeon 5870 Tesla M2090 14 6
Radeon 5870 Tesla C2070 Tesla M2090 GTX 580 Process Technology 40(nm) 40(nm) 40(nm) 40(nm) ( ) 2.72TFlops 1.03TFlops 1.33TFlops 1.581TFlops ( ) 544GFlops 515GFlops 665GFlops 198GFlops 188W 215W 225W 244W (2011 12 ) 25983 212746 452025 39580 ( ) 20.94MFlops/ 2.42MFlops/ 1.45MFlops/ 5.00MFlops/ ( ) 2.894GFlops/W 2.395GFlops/W 2.956GFlops/W 0.81GFlops/W 2 GPU. (TDP: Thermal Design Power) 2011 12 8)9)10)11)12)13)14) 2.2 AMD GPU 3 AMD/ATI GPU 3 GPU 3 9 3 3 3 9 AMD Display Library(ADL) AMD Display Library(ADL) AMD/ATI GPU C 15) AMD Radeon 5870 ADL GPU 80 MHz 1200 MHz 5 MHz GPU 150 MHz 1400 MHz 5 MHz GPU 1.062 V 1.212 V 5 mv 3 GPU 2.3 AMD GPU AMD Radeon 5870 GPU 7
Level Engine frequency(mhz) Memory frequency(mhz) Core voltage(v) 2 1200 1.212 1 600 1.112 0 157 1.062 3 ATI Radeon 5870 3 0 1 2 GPU 5 LINPACK Active Percent GPU ADL GPU 5 LINPACK (25 ) 100 80 Activity Percent Current lvl 60 40 temp 20 0 GPU Call 5 0 5 10 15 20 25 (sec) LINPACK Radeon 5870 GPU GPU (Active Percent) (temp) (Current lvl) LINPACK GPU (GPU Call) (25 ) 5 LINPACK GPU 25 5 GPU (Current lvl) GPU (temp) GPU (Active Percent) GPU (GPU Call) LINPACK GPU 8
GPU CPU 5 GPU (GPU Call) ON/OFF 5 GPU (temp) GPU Call GPU Call GPU GPU Call GPU 5 GPU Call ( GPU Call 100 ) GPU- GPU (Current lvl) 5 LINPACK 25 2 LINPACK 0 2 LINPACK 2 0 AMD GPU 6 LINPACK ADL Radeon 5870 2 ADL 6 37.4% LINPACK 21.4% 27.7% 5 7 7 18.4% 3. AMD/ATI GPU AMD/ATI GPU 9
308.54 242.55 default this work 200 watt 152.12 100 95.25 59.65 110.04 6 0 idle high low Radeon 5870 GPU (idle) LINPACK (high) LINPACK (low) default 200 Watt 100 change parameter 7 0 0 5 10 15 20 25 sec 5 ( ) ( ) 25 3.1 AMD/ATI GPU API API GPU 2 10
GPU Call API GPU 3.2 API API (1) (2) 2 8 2 API C EL SetHighestAutomatic API EL SetLowestAutomatic API EL DEVICE API Radeon 5870 EL Init API 8 int main() { EL_DEVICE dev = EL_Init(EN_DEVICE_TYPE_HD5870); // EnergyLibrary Initialization... host part.. EL_SetHighestAutomatic(dev); // EnergyLibrary API... GPU part.. EL_SetLowestAutomatic(dev); // EnergyLibrary API... host part.. } API API C API GPU 2 EL SetLowestAutomatic GPU GPU EL SetHighestAutomatic GPU 3 11
4. LINPACK API GPU GPU DGEMM LINPACK 2 DGEMM(Double-precision General Matrix Multiply) LINPACK 4.1 9 AMD Radeon 5870 GPU AC105V Digital Multimeter 500Wmax Power Unit log recorder(pc) DC 3.3~12V Host computer CPU: Intel Core i5-2500t 16GB DDR3-1600 GPU: AMD HD5870 9 4.2 DGEMM DGEMM GPU DGEMM N=42000 GPU 1.062V 1.212(V ) V=1.062(V) V=1.137(V) V=1.212(V) 4.2.1 GPU 12
10 MHz MHz MHz 10 GPU DGEMM:N=M=42000, V=1.062(V) GFlops 460 440 420 440 420 400 380 360 340 320 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.137(V) 400 380 360 340 320 280 GFlops 460 440 420 440 420 400 380 360 340 320 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.212(V) 400 380 360 340 320 280 GFlops 460 420 400 440 380 360 440 420 400 380 360 340 10 340 320 320 280 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM DGEMM N=42000 GPU ( ) ( ) GPU ( V=1.062(V) V=1.137(V) V=1.212(V)) (GFlops) ( ) ( ) 4.2.2 GPU 11 GPU GPU 4.2.3 13
DGEMM:N=M=42000, V=1.062(V) Watt 185 190 180 185 175 180 175 170 170 165 165 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.137(V) Watt 160 205 200 200 190 195 195 190 185 185 180 180 175 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.212(V) Watt 170 215 210 210 200 205 205 200 195 195 190 190 185 11 185 180 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM DGEMM N=42000 GPU ( ) ( ) GPU ( V=1.062(V) V=1.137(V) V=1.212(V)) (Watt) ( ) ( ) 12 GPU GPU GPU 770MHz MHz 1.062V 4.3 LINPACK LINPACK GPU LINPACK N=39680 NB=1280 GPU V=1.062(V) 4.3.1 GPU 14
720 740 760 780 820 840 情報処理学会研究報告 DGEMM:N=M=42000, V=1.062(V) GFlops/W 2.6 2.5 2.5 2.4 2.3 2.4 2.2 2.1 2.0 2.3 2.2 2.1 2 1.9 1.8 1.9 1.8 1.7 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.137(V) GFlops/W 2.4 2.3 2.2 2.3 2.2 2.1 2.0 1.9 2.1 2 1.9 1.8 1.7 1.8 1.7 1.6 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM:N=M=42000, V=1.212(V) GFlops/W 2.3 2.2 2.2 2.1 2.0 1.9 1.8 2.1 2 1.9 1.8 1.7 1.7 1.6 1.6 12 1.5 720 740 760 780 820 840 720 740 760 780 820 840 DGEMM DGEMM N=42000 GPU ( ) ( ) GPU ( V=1.062(V) V=1.137(V) V=1.212(V)) (GFlops/W) ( ) ( ) 13 DGEMM GPU DGEMM LINPACK GPU 4.3.2 GPU 14 DGEMM GPU DGEMM LINPACK GPU 4.3.3 15 DGEMM 15
LINPACK:N=39680, NB=1280 GFlops 320 310 280 290 280 270 260 260 250 240 240 230 13 220 720 740 760 780 820 720 740 760 780 820 LINPACK LINPACK N=39680 NB=1280 GPU V=1.062(V) GPU ( ) ( ) (GFlops) ( ) ( ) LINPACK:N=39680, NB=1280 Watt 190 180 185 160 170 180 175 170 165 160 14 155 720 740 760 780 820 720 740 760 780 820 LINPACK LINPACK N=39680 NB=1280 GPU V=1.062(V) GPU ( ) ( ) (Watt) ( ) ( ) GPU GPU DGEMM LINPACK GPU 770MHz MHz 1.062V 5. DGEMM 16
LINPACK:N=39680, NB=1280 GFlops/W 2 1.9 1.9 1.8 1.7 1.6 1.8 1.7 1.6 1.5 1.5 1.4 1.4 15 1.3 720 740 760 780 820 720 740 760 780 820 LINPACK LINPACK N=39680 NB=1280 GPU V=1.062(V) GPU ( ) ( ) (GFlops/W) ( ) ( ) 5.1 DGEMM ( ) DGEMM ( ) 5.2 12 1 16 765MHz 930MHz 1.062V f(f eng, f mem ) = 1.73659 10 18 f 4 eng + 1.0627 10 18 f 4 mem 4.38584 10 18 f 3 eng f mem 1.24579 10 17 f eng f 3 mem + 1.81337 10 17 f 2 eng f 2 mem + 9.99745 10 13 f e +5.95166 10 13 f 3 mem 2.1966 10 12 f 2 eng f mem + 4.48017 10 13 f eng f mem 2.96352 10 8 f 2 eng 9.68834 10 8 f 2 mem + 1.39058 10 7 f eng f mem 2.27374 10 3 f eng + 1.95304 10 3 f mem + 9.80607 10 1 (1 5.3 17 2 6 W eng W mem f V W host 17
DGEMM:N=M=42000 GFlops/W 2.6 2.5 2.4 16 2.3 2.2 2.5 2.1 2.4 2.3 2 2.2 1.9 2.1 2.0 1.9 1.8 1.8 1.7 720 740 760 780 820 840 720 740 760 780 820 840 1 DGEMM N=42000 ( 12) Host computer W in Power Unit W out GPU GPU Engine GPU Memory W host W powerunit 17 W eng W mem GPU GPU 4 5 6 10 GPU GPU 7 8 W powerunit = W in W out (2) W out = W eng + W mem + W host (3) W eng = k eng f eng V 2 (4) W mem = k mem f mem V 2 (5) W host = Const (6) S( f) = S(f eng, f mem ) (7) E = S( f) W in ( f, V ) 5.3.1 1 10 (8) 18
W_out(W) 情報処理学会研究報告 9 10 18 20 21 22 765MHz 920MHz 1.062V f(f eng, f mem ) = 2.28649 10 16 f 4 eng + 1.9017 10 16 f 4 mem 5.86035 10 16 f 3 eng f mem 2.34738 10 15 f eng f 3 mem + 3.14867 10 15 f 2 eng f 2 mem + 1.39896 10 10 f e +1.15289 10 10 f 3 mem 4.23504 10 10 f 2 eng f mem + 1.25702 10 10 f eng f m +1.64027 10 7 f 2 eng 2.02001 10 5 f 2 mem + 2.34208 10 5 f eng f mem 6.31176f eng 10 1 + 5.47227f mem 10 1 1.99416 10 1 (9 DGEMM:N=M=42000 GFlops 00 00 460 00 00 s(x,y) 440 420 00 00 00 00 440 420 400 380 360 340 320 00 00 00 00 400 380 360 340 320 00 00 00 00 72000 74000 76000 70 00 82000 84000 72000 74000 76000 70 00 82000 84000 18 (DGEMM N=42000) 280 W out = 3.60009 10 4 f eng + 4.33553 10 4 f mem + 89.9443 (10) 550 500 450 400 350 250 200 150 100 19 50 100 150 200 250 350 400 450 500 550 600 W_in(W) ( ) 16) W in W out 6. 12 DGEMM 19
180 情報処理学会研究報告 00 DGEMM:N=M=42000 00 Watt 185 w(x,y) 00 00 180 00 00 00 170 175 00 00 00 175 170 00 00 165 20 165 00 160 00 00 00 72000 74000 76000 70 00 82000 84000 72000 74000 76000 70 00 82000 84000 DGEMM N=42000 ( 11) 3 ( ) ( ) DGEMM:N=M=42000 GFlops/W 2.6 2.5 2.4 2.5 2.4 2.3 2.2 2.1 2.3 2.2 2.1 2 1.9 2.0 1.9 1.8 1.8 1.7 21 1.6 720 740 760 780 820 840 720 740 760 780 820 840 2 DGEMM N=42000 18 20 ( ) ( ) Relative Error 0.007 0.006 0.005 0.004 0.003 0.002 0.001 22 720 740 760 780 820 840 16 0 770MHz MHz 1.062V LINPACK 6.1 23 LINPACK GPU AMD/ATI Radeon5870 20
1200 Normal 600 400 200 This work 23 0 0 20 40 60 sec 80 100 120 140 AMD/ATI Radeon 5870 (Normal) (This work) GPU 6.2 AMD/ATI Radeon5870 24 LINPACK 220.972W 159.552W 27.8% 770MHz MHz 1.062V 765MHz 930MHz 1.062V 765MHz 920MHz 1.062V 4 ( 12) ( 16 21) 6.3 DGEMM 4 21
50 100 80 60 40 20 0-20 -40-60 -80 0 20 40 60 80 100 120 140 0.2 0-0.2-0.4-0.6-0.8-1 -1.2-1.4-1.6 0 20 40 60 80 100 120 140 情報処理学会研究報告 350 normal 10 9 normal 8 250 this work 7 this work W 200 Wh 6 5 150 4 100 3 2 24 w 1 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 sec wh LINPACK AMD (normal) (this work) (this work normal) sec 5 25 4 LINPACK AMD/ATI Radeon5870 LINPACK 1,4698Gflops/W 1.9658Gflops/W 33.7% LINPACK 324.8 Gflops 220.972 W 1.4698 Gflops/W 310.6 Gflops 159.552 W 1.9472 Gflops/W 309.1 Gflops 162.794 W 1.8993 Gflops/W 308.6 Gflops 157.017 W 1.9658 Gflops/W 5 LINPACK. (N=39680, NB=1280). 22
GFlops Watt GFlops/W 2 200 GFlops/W 1 100 25 0 0 default thiswork thiswork thiswork (with model1) (with model2) LINPACK. (N=39680, NB=1280). (GFlops) (Watt) (GFlops/W) 7. GPU LINPACK DGEMM DGEMM 2 3 DGEMM LINPACK 1.4698 Gflops/W 1.9658 Gflops/W 2011 6 2011 11 Green500 References 1) J.Dongarra, LINPACK: users guide, ser. Miscellaneous Bks. Society for Industrial and Applied Mathematics, 1979. [Online]. Available: http://books.google.co.jp/books?id=amsm1n3vw0cc 2) Top 500 countries share for 11/2011, 2011. [Online]. Available: http://www.top500.org/charts/list/38/countries 3) Top 500 press release, 2011. [Online]. Available: http://www.top500.org/lists/2011/11/press-release 23
4) The green500 list november 2011, 2011. [Online]. Available: http://www.green500.org/lists/2011/11/top/list.php?from=1&to=100 5) Top500 performance development, 2011. [Online]. Available: http://www.top500.org/lists/2011/06/performance development 6), 2011. [Online]. Available: http://www.green500.org/lists/2011/11/top/list.php?from=1&to=100 7) Ati radeon hd 5870 graphics specifications, 2011. [Online]. Available: http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd- 5000/hd-5870/Pages/ati-radeon-hd-5870-overview.aspx#2 8) Nvidia tesla c2050 / c2070 gpu, 2011. [Online]. Available: http://www.nvidia.co.jp/object/product tesla C2050 C2070 jp.html 9) Next io vcore extreme -, 2011. [Online]. Available: http://www.elsa-jp.co.jp/products/nextio/vcore extreme/index.html 10) G.Chen, L.Chacón, and D.C. Barnes, An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm, ArXiv e-prints, Nov. 2011. 11).com eah5870/2dis/1gd5/v2 (pciexp 1gb), 2011. [Online]. Available: http://kakaku.com/item/k0000102777/ 12) Nvidia tesla c2070 [pciexp 6gb], 2011. [Online]. Available: http://kakaku.com/item/k0000264157/?lid=ksearch kakakuitem title 13) Ntt-x store, 2011. [Online]. Available: http://nttxstore.jp/ II HP13647981 14) Giada gtx580-ddr5 [pciexp 1.5gb], 2011. [Online]. Available: http://kakaku.com/item/k0000321156/?lid=ksearch kakakuitem title 15) Amd display library (adl) sdk, 2011. [Online]. Available: http://developer.amd.com/sdks/adlsdk/pages/default.aspx 16) 80 plus verification and testing report, 2011. [Online]. Available: http://www.acbel.com/productfile/80plus/acbel PC6024 W Report.pdf 24