on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

Similar documents
07-二村幸孝・出口大輔.indd

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

GPGPU

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

main.dvi

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

GPU n Graphics Processing Unit CG CAD

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

Microsoft PowerPoint - GPU_computing_2013_01.pptx

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

GPUコンピューティング講習会パート1

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

EGunGPU

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

HP High Performance Computing(HPC)

Run-Based Trieから構成される 決定木の枝刈り法

Logistello 1) playout playout 1 5) SIMD Bitboard playout playout Bitboard Bitboard 8 8 = black white 2 2 Bitboard 2 1 6) position rev i

1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Proce

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

GPUコンピューティング講習会パート1

GPUを用いたN体計算


倍々精度RgemmのnVidia C2050上への実装と応用

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Haiku Generation Based on Motif Images Using Deep Learning Koki Yoneda 1 Soichiro Yokoyama 2 Tomohisa Yamashita 2 Hidenori Kawamura Scho

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

fiš„v8.dvi

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

29 jjencode JavaScript

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

大学における原価計算教育の現状と課題

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

,,,,., C Java,,.,,.,., ,,.,, i

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

大学論集第42号本文.indb

Lytro [11] The Franken Camera [12] 2.2 Creative Coding Community Creative Coding Community [13]-[19] Sketch Fork 2.3 [20]-[23] 3. ourcam 3.1 ou

1: A/B/C/D Fig. 1 Modeling Based on Difference in Agitation Method artisoc[7] A D 2017 Information Processing

GPU.....

HPC可視化_小野2.pptx

1 1 tf-idf tf-idf i

スライド 1

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

(a) (b) (c) Fig. 2 2 (a) ; (b) ; (c) (a)configuration of the proposed system; (b)processing flow of the system; (c)the system in use 1 GPGPU (

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

Transcription:

vs. 1 1 1 GPU TFLOPS GPU GPU GPGPU GPGPU 1 SIMD MFLOPS HPC GPU FFTZIP HPC Challenge RandomAccess Levenshtein 6 vs. Ryōhei NISHIMURA, 1 Hidetsugu IRIE 1 and Kei HIRAKI 1 Recently, on the one hand, performance of a GPU has been higher than a TFLOPS, on the other hand, demand of GPUs of high performance for graphics has peak. Then, GPGPU that uses GPUs increasing in ability and possible range of processing for general-purpose computing has been the hot theme. On the other hand, there is the processor that accumulates many SIMD processors into one chip as what competes GPGPU. It also has performance of hundreds MFLOPS and wide memory bandwidth and has been paid attension to on a field of HPC. We compared of the GPU of the up-to-date architecture and the processor using the 6 applications: matrix multiplication, FFT, sorting, password-cracking of ZIP files, RandomAccess of HPC Challenge and calculation of the Levenshtein distance. As a result, it was shown that the performance of was superior except the part of the applications. 1. Moore 15) Graphics Processing Unit (GPU) 2008 6 1 TFLOPS GPU 1 GPU General Purpose GPU (GPGPU) 13) GPGPU 1 9 4 way SIMD 17) GPU TOP 500 2008 11 Roadrunner GPU GPGPU 2. 2.1 2008 6 1 NVIDIA GPU GPGPU CUDA 7) Streaming Multiprocessor (SM) 30 Streaming Processor (SP) 8 SP 1 1 Graduate School of Information Science and Technology, the University of Tokyo 1 c 2009 Information Processing Society of Japan

on PS3 Linux Core 2 Quad Q9400 @ (GHz) 1.296 3.2 2.66 30 SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) 1 1 24 1 SM 933.12 8 8 153.6 PPE SPE 1 85.12 512 1 1 32 bit 16384 128 bit 128 128 bit 16 SSE (GiB/s) 141.696 25.6 12.8 (MiB) 1024 256 4096 1 16 256 32 (L1 ) (KiB) (W) 236 64 1 95 OS NVIDIA 185.18.08 Linux 2.6.27.21 Linux 2.6.27.21 CUDA 2.2 GCC 4.1.1 GCC 4.3.2 1 SM 2 1 SM 16 KiB Shared Memory Shared Memory Constant Memory Texture Memory SM 11) CUDA SM 4 CUDA 32 Warp SM 32 Half Warp 16 32 bit 64 bit 128 bit Half Warp Shared Memory Half Warp 32 bit Shared Memory Constant Memory Half Warp Texture Memory Warp 1 Fedora 10 GPU PCI Express 2.0 x16 8 GB/s 2.2 HPCPC PPE CPU SPE SIMD SPE 256 KiB Local Store SPE Local Store DMA SPE 128 bit SIMD 1 Local Store 2 Way Local Store 2 DMA SPE 7 SPE SPE PLAYSTATION 3 Fedora 10 IBM SDK 3.1 1 4 GHz 80 W 21) 2 c 2009 Information Processing Society of Japan

3. 3.1 TOP500 Linpack 2048 2048 4) Volkov 20) Volkov 64 16 1 Streaming Multiprocessor Volkov 64 64 64 DMA 16 KiB 3.2 FFT (FFT) 9) 2 19 1 FFT 1000 FFT Stockham 19) FFT FFT FFT Cooley-Tukey 4 2 4 FFT 1 2 3 FFT 1 2 7 FFT 2 2 6 FFT 1 Streaming Multiprocessor 2 4 FFT 128 2 3 FFT 256 FFT Shared Memory 64 KiB 3 0 1 FFT 2 6 FFT 2 7 FFT 2 2 2 7 FFT 64 8 16 FFT Local Store 64 16 8 FFT Local Store Local Store 2 6 FFT 64 2 64 FFT 1 2 PPE SPE 3 TLB 16 MiB 3.3 2 20 FFT O(N(log N) 2 ) 6) O(N log N) O(N 2 ) 7 1 Streaming Multiprocessor 128 256 512 3 GTX 280 8 SIMD 3 c 2009 Information Processing Society of Japan

SPE 64 KiB 2 3.4 ZIP ZIP ZIP 8 bit CRC ZIP 1 1 8 bit 5 ZIP 1 4 95 1 Streaming Multiprocessor 192 Shared Memory Constant Memory ZIP CRC Texture Memory 8 bit CRC 8 bit CRC CRC 3.5 RandomAccess HPC Challenge 1 N 2 N = 2 20 1 Streaming Processor 256 32 bit XOR XOR 2 SPE SPE DMA SPE 1 1 3.6 Levenshtein SACSIS 3) GPU Challenge 2) Challenge 1) 2 Levenshtein 12) CUDA Levenshtein 2 2 1 1 1 Levenshtein Levenshtein 2 1 1 GPU Challenge 9 128 128 1 Streaming Multiprocessor 64 64 1 SPE 8 bit 16 SIMD 4. 4.1 2 4 c 2009 Information Processing Society of Japan

void init(unsigned long long t[]) { int i; for (i = 0; i < N; i++) { void t[i] = i; update(unsigned long long t[]) { int i; unsigned lont long ran; for (i = 0; i < N * 4; i++) { int main() { ran = (ran << 1) ^ (((signed long long) ran < 0)? 7ULL : 0); t[ran & (N - 1)] ^= ran; unsigned long long t[n]; init(t); update(t); 1 RandomAccess 2 3 59.5 140 (46.8) 289 123 (GFLOPS) (367) 1.22 1.92 (GFLOPS/W) (1.56) () GPU 1.97 1.68 (0.483) 25.3 29.6 (GFLOPS) (103) 1.98 6.98 (GiB/s) (40.4) 107 462 (MFLOPS/W) (436) FFT () GPU 2 1 Streaming Processor 32 bit Streaming Multiprocessor 128 bit 1 1 1 4.2 FFT 3 GPU 4 10 20 47.5 10 20 FLOP 5 c 2009 Information Processing Society of Japan

4 7.20 24.1 (5.42) 15.3 4.57 G /s (20.3) 1.09 2.92 (GiB/s) (33.2) 64.8 71.4 (M /s/w) (86.0) () GPU 5 0.237 0.770 (Mword/s) 362 111 (Mword/s/W) 1.53 1.73 ZIP Local Store FFT 4.3 4 14 CPU 1 Local Store 4.4 ZIP 5 3 Texture Memory Texture Memory 4.5 RandomAccess 6 1 10 11.0 G /s 1 3 27.0 207 (GiB/s) 2.31 0.302 (MiB/s/W) 10.0 4.83 6 RandomAccess 477 218 (10 3 8.88 71.7 /s/w) 7 Levenshtein DMA 4.6 Levenshtein 7 8 bit 8 bit 32 bit GTX 280 8 bit 32 bit 4 6 c 2009 Information Processing Society of Japan

9 8 7 6 5 4 3 2 1 0 2 転送込み 転送抜き ワット性能 ( 転送込み ) ワット性能 ( 転送抜き ) 3 GPU Challenge GPU Challenge Challenge 5. GPGPU OpenCL 16) NVIDIA Intel AMD GPU IBM AMD GPU GPGPU Brook 8) Brook+ 10) GPGPU Scherl 18) 8800 GTX Agarwal 5) 8800 GTX SDK CUDA RapidMind SDK 14) 6. 3 6 Shared Memory Local Store Local Store SIMD SP 236 W GPGPU 7 c 2009 Information Processing Society of Japan

GPGPU Tesla 1 GPGPU GPU 1) Challenge 2009, 2009. http://www.hpcc.jp/sacsis/2009/cell/. 2) GPU Challenge 2009, 2009. http://www.hpcc.jp/sacsis/2009/gpu/. 3) SACSIS2009 -, 2009. http://www.hpcc.jp/ sacsis/2009/. 4) W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic program transformations for virtual memory computers. In Proceeding of the 1979 National Computer Conference, pp. 969 974, June 1979. 5) V.Agarwal, Lurng-Kuo Liu, and D.A. Bader. Financial modeling on the cell broadband engine. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pp. 1 12, April 2008. 6) K.E. Batcher. Sorting networks and their applications. Proceeding AFIPS Spring Joint Computer Conference, 1968. 7) I.Buck. Geforce 8800 & nvidia cuda: A new architecture for computing on the gpu. website of Supercomputing 06 Workshop General-Purpose GPU Computing: Practice And Experience, 2006. 8) Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, pp. 777 786, New York, NY, USA, 2004. ACM. 9) JamesW. Cooley and JohnW. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput. 19, pp. 297 301, 1965. 10) Advanced MicroDevices Inc. Brook+ sc07 bof session. Supercomputing 2007 Conference, November 2007. 11) James Laudon, Anoop Gupta, and Mark Horowitz. Interleaving: a multithreading technique targeting multiprocessors and workstations. SIGPLAN Not., Vol.29, No.11, pp. 308 318, 1994. 12) VladimirI. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report8, 1966. 13) David Luebke, Mark Harris, Jens Krüger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff Woolley, and Aaron Lefohn. Gpgpu: general purpose computation on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Course Notes, p.33, New York, NY, USA, 2004. ACM. 14) MichaelD. McCool. Data-parallel programming on the cell be and the gpu using the rapidmind development platform. the GSPx Multicore Applications Conference, 2006. 15) G.E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, Vol.86, No.1, pp. 82 85, 1998. 16) Aaftab Munshi. Opencl. http://s08.idav.ucdavis.edu/munshi-opencl.pdf. 17) D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A.Kameyama, J.Keaty, Y.Masubuchi, M.Riley, D.Shippy, D.Stasiak, M.Suzuoki, M.Wang, J.Warnock, S.Weitzel, D.Wendel, T.Yamazaki, and K.Yazawa. The design and implementation of a first-generation cell processor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pp. 184 592 Vol. 1, 2005. 18) H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast gpu-based ct reconstruction using the common unified device architecture (cuda). Nuclear Science Symposium Conference Record, 2007. NSS 07. IEEE, Vol. 6, pp. 4464 4466, 26 2007-Nov. 3 2007. 19) D.Takahashi. High-performance parallel fft algorithms for the hitachi sr8000. High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on, Vol.1, pp. 192 199 vol.1, 2000. 20) Vasily Volkov. Homepage for vasily volkov. http://www.cs.berkeley.edu/volkov/. 21) D.Wang. Isscc 2005: The cell microprocessor. Real World Technologies, February 2005. http://www.realworldtech.com/page.cfm?articleid=rwt021005084318& p=2. 1 Tesla C1060 187.8 W 8 c 2009 Information Processing Society of Japan