on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

Size: px

Start display at page:

Download "on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit"

ひでかとべ
8 years ago
Views:

1 vs GPU TFLOPS GPU GPU GPGPU GPGPU 1 SIMD MFLOPS HPC GPU FFTZIP HPC Challenge RandomAccess Levenshtein 6 vs. Ryōhei NISHIMURA, 1 Hidetsugu IRIE 1 and Kei HIRAKI 1 Recently, on the one hand, performance of a GPU has been higher than a TFLOPS, on the other hand, demand of GPUs of high performance for graphics has peak. Then, GPGPU that uses GPUs increasing in ability and possible range of processing for general-purpose computing has been the hot theme. On the other hand, there is the processor that accumulates many SIMD processors into one chip as what competes GPGPU. It also has performance of hundreds MFLOPS and wide memory bandwidth and has been paid attension to on a field of HPC. We compared of the GPU of the up-to-date architecture and the processor using the 6 applications: matrix multiplication, FFT, sorting, password-cracking of ZIP files, RandomAccess of HPC Challenge and calculation of the Levenshtein distance. As a result, it was shown that the performance of was superior except the part of the applications. 1. Moore 15) Graphics Processing Unit (GPU) TFLOPS GPU 1 GPU General Purpose GPU (GPGPU) 13) GPGPU way SIMD 17) GPU TOP Roadrunner GPU GPGPU NVIDIA GPU GPGPU CUDA 7) Streaming Multiprocessor (SM) 30 Streaming Processor (SP) 8 SP 1 1 Graduate School of Information Science and Technology, the University of Tokyo 1 c 2009 Information Processing Society of Japan

2 on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit bit bit 16 SSE (GiB/s) (MiB) (L1 ) (KiB) (W) OS NVIDIA Linux Linux CUDA 2.2 GCC GCC SM 2 1 SM 16 KiB Shared Memory Shared Memory Constant Memory Texture Memory SM 11) CUDA SM 4 CUDA 32 Warp SM 32 Half Warp bit 64 bit 128 bit Half Warp Shared Memory Half Warp 32 bit Shared Memory Constant Memory Half Warp Texture Memory Warp 1 Fedora 10 GPU PCI Express 2.0 x16 8 GB/s 2.2 HPCPC PPE CPU SPE SIMD SPE 256 KiB Local Store SPE Local Store DMA SPE 128 bit SIMD 1 Local Store 2 Way Local Store 2 DMA SPE 7 SPE SPE PLAYSTATION 3 Fedora 10 IBM SDK GHz 80 W 21) 2 c 2009 Information Processing Society of Japan

3 TOP500 Linpack ) Volkov 20) Volkov Streaming Multiprocessor Volkov DMA 16 KiB 3.2 FFT (FFT) 9) FFT 1000 FFT Stockham 19) FFT FFT FFT Cooley-Tukey FFT FFT FFT FFT 1 Streaming Multiprocessor 2 4 FFT FFT 256 FFT Shared Memory 64 KiB FFT 2 6 FFT 2 7 FFT FFT FFT Local Store FFT Local Store Local Store 2 6 FFT FFT 1 2 PPE SPE 3 TLB 16 MiB FFT O(N(log N) 2 ) 6) O(N log N) O(N 2 ) 7 1 Streaming Multiprocessor GTX SIMD 3 c 2009 Information Processing Society of Japan

4 SPE 64 KiB ZIP ZIP ZIP 8 bit CRC ZIP bit 5 ZIP Streaming Multiprocessor 192 Shared Memory Constant Memory ZIP CRC Texture Memory 8 bit CRC 8 bit CRC CRC 3.5 RandomAccess HPC Challenge 1 N 2 N = Streaming Processor bit XOR XOR 2 SPE SPE DMA SPE Levenshtein SACSIS 3) GPU Challenge 2) Challenge 1) 2 Levenshtein 12) CUDA Levenshtein Levenshtein Levenshtein GPU Challenge Streaming Multiprocessor SPE 8 bit 16 SIMD c 2009 Information Processing Society of Japan

5 void init(unsigned long long t[]) { int i; for (i = 0; i < N; i++) { void t[i] = i; update(unsigned long long t[]) { int i; unsigned lont long ran; for (i = 0; i < N * 4; i++) { int main() { ran = (ran << 1) ^ (((signed long long) ran < 0)? 7ULL : 0); t[ran & (N - 1)] ^= ran; unsigned long long t[n]; init(t); update(t); 1 RandomAccess (46.8) (GFLOPS) (367) (GFLOPS/W) (1.56) () GPU (0.483) (GFLOPS) (103) (GiB/s) (40.4) (MFLOPS/W) (436) FFT () GPU 2 1 Streaming Processor 32 bit Streaming Multiprocessor 128 bit FFT 3 GPU FLOP 5 c 2009 Information Processing Society of Japan

6 (5.42) G /s (20.3) (GiB/s) (33.2) (M /s/w) (86.0) () GPU (Mword/s) (Mword/s/W) ZIP Local Store FFT CPU 1 Local Store 4.4 ZIP 5 3 Texture Memory Texture Memory 4.5 RandomAccess G /s (GiB/s) (MiB/s/W) RandomAccess ( /s/w) 7 Levenshtein DMA 4.6 Levenshtein 7 8 bit 8 bit 32 bit GTX bit 32 bit 4 6 c 2009 Information Processing Society of Japan

7 転送込み転送抜きワット性能 ( 転送込み ) ワット性能 ( 転送抜き ) 3 GPU Challenge GPU Challenge Challenge 5. GPGPU OpenCL 16) NVIDIA Intel AMD GPU IBM AMD GPU GPGPU Brook 8) Brook+ 10) GPGPU Scherl 18) 8800 GTX Agarwal 5) 8800 GTX SDK CUDA RapidMind SDK 14) Shared Memory Local Store Local Store SIMD SP 236 W GPGPU 7 c 2009 Information Processing Society of Japan

8 GPGPU Tesla 1 GPGPU GPU 1) Challenge 2009, ) GPU Challenge 2009, ) SACSIS2009 -, sacsis/2009/. 4) W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic program transformations for virtual memory computers. In Proceeding of the 1979 National Computer Conference, pp , June ) V.Agarwal, Lurng-Kuo Liu, and D.A. Bader. Financial modeling on the cell broadband engine. Parallel and Distributed Processing, IPDPS IEEE International Symposium on, pp. 1 12, April ) K.E. Batcher. Sorting networks and their applications. Proceeding AFIPS Spring Joint Computer Conference, ) I.Buck. Geforce 8800 & nvidia cuda: A new architecture for computing on the gpu. website of Supercomputing 06 Workshop General-Purpose GPU Computing: Practice And Experience, ) Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, pp , New York, NY, USA, ACM. 9) JamesW. Cooley and JohnW. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput. 19, pp , ) Advanced MicroDevices Inc. Brook+ sc07 bof session. Supercomputing 2007 Conference, November ) James Laudon, Anoop Gupta, and Mark Horowitz. Interleaving: a multithreading technique targeting multiprocessors and workstations. SIGPLAN Not., Vol.29, No.11, pp , ) VladimirI. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report8, ) David Luebke, Mark Harris, Jens Krüger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff Woolley, and Aaron Lefohn. Gpgpu: general purpose computation on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Course Notes, p.33, New York, NY, USA, ACM. 14) MichaelD. McCool. Data-parallel programming on the cell be and the gpu using the rapidmind development platform. the GSPx Multicore Applications Conference, ) G.E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, Vol.86, No.1, pp , ) Aaftab Munshi. Opencl. 17) D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A.Kameyama, J.Keaty, Y.Masubuchi, M.Riley, D.Shippy, D.Stasiak, M.Suzuoki, M.Wang, J.Warnock, S.Weitzel, D.Wendel, T.Yamazaki, and K.Yazawa. The design and implementation of a first-generation cell processor. In Solid-State Circuits Conference, Digest of Technical Papers. ISSCC IEEE International, pp Vol. 1, ) H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast gpu-based ct reconstruction using the common unified device architecture (cuda). Nuclear Science Symposium Conference Record, NSS 07. IEEE, Vol. 6, pp , Nov ) D.Takahashi. High-performance parallel fft algorithms for the hitachi sr8000. High Performance Computing in the Asia-Pacific Region, Proceedings. The Fourth International Conference/Exhibition on, Vol.1, pp vol.1, ) Vasily Volkov. Homepage for vasily volkov. 21) D.Wang. Isscc 2005: The cell microprocessor. Real World Technologies, February p=2. 1 Tesla C W 8 c 2009 Information Processing Society of Japan

07-二村幸孝・出口大輔.indd

07-二村幸孝・出口大輔.indd GPU Graphics Processing Units HPC High Performance Computing GPU GPGPU General-Purpose computation on GPU CPU GPU GPU *1 Intel Quad-Core Xeon E5472 3.0 GHz 2 6 MB L2 cache 1600 MHz FSB 80 GFlops 1 nvidia