on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

vs. 1 1 1 GPU TFLOPS GPU GPU GPGPU GPGPU 1 SIMD MFLOPS HPC GPU FFTZIP HPC Challenge RandomAccess Levenshtein 6 vs. Ryōhei NISHIMURA, 1 Hidetsugu IRIE 1 and Kei HIRAKI 1 Recently, on the one hand, performance of a GPU has been higher than a TFLOPS, on the other hand, demand of GPUs of high performance for graphics has peak. Then, GPGPU that uses GPUs increasing in ability and possible range of processing for general-purpose computing has been the hot theme. On the other hand, there is the processor that accumulates many SIMD processors into one chip as what competes GPGPU. It also has performance of hundreds MFLOPS and wide memory bandwidth and has been paid attension to on a field of HPC. We compared of the GPU of the up-to-date architecture and the processor using the 6 applications: matrix multiplication, FFT, sorting, password-cracking of ZIP files, RandomAccess of HPC Challenge and calculation of the Levenshtein distance. As a result, it was shown that the performance of was superior except the part of the applications. 1. Moore 15) Graphics Processing Unit (GPU) 2008 6 1 TFLOPS GPU 1 GPU General Purpose GPU (GPGPU) 13) GPGPU 1 9 4 way SIMD 17) GPU TOP 500 2008 11 Roadrunner GPU GPGPU 2. 2.1 2008 6 1 NVIDIA GPU GPGPU CUDA 7) Streaming Multiprocessor (SM) 30 Streaming Processor (SP) 8 SP 1 1 Graduate School of Information Science and Technology, the University of Tokyo 1 c 2009 Information Processing Society of Japan

on PS3 Linux Core 2 Quad Q9400 @ (GHz) 1.296 3.2 2.66 30 SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) 1 1 24 1 SM 933.12 8 8 153.6 PPE SPE 1 85.12 512 1 1 32 bit 16384 128 bit 128 128 bit 16 SSE (GiB/s) 141.696 25.6 12.8 (MiB) 1024 256 4096 1 16 256 32 (L1 ) (KiB) (W) 236 64 1 95 OS NVIDIA 185.18.08 Linux 2.6.27.21 Linux 2.6.27.21 CUDA 2.2 GCC 4.1.1 GCC 4.3.2 1 SM 2 1 SM 16 KiB Shared Memory Shared Memory Constant Memory Texture Memory SM 11) CUDA SM 4 CUDA 32 Warp SM 32 Half Warp 16 32 bit 64 bit 128 bit Half Warp Shared Memory Half Warp 32 bit Shared Memory Constant Memory Half Warp Texture Memory Warp 1 Fedora 10 GPU PCI Express 2.0 x16 8 GB/s 2.2 HPCPC PPE CPU SPE SIMD SPE 256 KiB Local Store SPE Local Store DMA SPE 128 bit SIMD 1 Local Store 2 Way Local Store 2 DMA SPE 7 SPE SPE PLAYSTATION 3 Fedora 10 IBM SDK 3.1 1 4 GHz 80 W 21) 2 c 2009 Information Processing Society of Japan

3. 3.1 TOP500 Linpack 2048 2048 4) Volkov 20) Volkov 64 16 1 Streaming Multiprocessor Volkov 64 64 64 DMA 16 KiB 3.2 FFT (FFT) 9) 2 19 1 FFT 1000 FFT Stockham 19) FFT FFT FFT Cooley-Tukey 4 2 4 FFT 1 2 3 FFT 1 2 7 FFT 2 2 6 FFT 1 Streaming Multiprocessor 2 4 FFT 128 2 3 FFT 256 FFT Shared Memory 64 KiB 3 0 1 FFT 2 6 FFT 2 7 FFT 2 2 2 7 FFT 64 8 16 FFT Local Store 64 16 8 FFT Local Store Local Store 2 6 FFT 64 2 64 FFT 1 2 PPE SPE 3 TLB 16 MiB 3.3 2 20 FFT O(N(log N) 2 ) 6) O(N log N) O(N 2 ) 7 1 Streaming Multiprocessor 128 256 512 3 GTX 280 8 SIMD 3 c 2009 Information Processing Society of Japan

SPE 64 KiB 2 3.4 ZIP ZIP ZIP 8 bit CRC ZIP 1 1 8 bit 5 ZIP 1 4 95 1 Streaming Multiprocessor 192 Shared Memory Constant Memory ZIP CRC Texture Memory 8 bit CRC 8 bit CRC CRC 3.5 RandomAccess HPC Challenge 1 N 2 N = 2 20 1 Streaming Processor 256 32 bit XOR XOR 2 SPE SPE DMA SPE 1 1 3.6 Levenshtein SACSIS 3) GPU Challenge 2) Challenge 1) 2 Levenshtein 12) CUDA Levenshtein 2 2 1 1 1 Levenshtein Levenshtein 2 1 1 GPU Challenge 9 128 128 1 Streaming Multiprocessor 64 64 1 SPE 8 bit 16 SIMD 4. 4.1 2 4 c 2009 Information Processing Society of Japan

void init(unsigned long long t[]) { int i; for (i = 0; i < N; i++) { void t[i] = i; update(unsigned long long t[]) { int i; unsigned lont long ran; for (i = 0; i < N * 4; i++) { int main() { ran = (ran << 1) ^ (((signed long long) ran < 0)? 7ULL : 0); t[ran & (N - 1)] ^= ran; unsigned long long t[n]; init(t); update(t); 1 RandomAccess 2 3 59.5 140 (46.8) 289 123 (GFLOPS) (367) 1.22 1.92 (GFLOPS/W) (1.56) () GPU 1.97 1.68 (0.483) 25.3 29.6 (GFLOPS) (103) 1.98 6.98 (GiB/s) (40.4) 107 462 (MFLOPS/W) (436) FFT () GPU 2 1 Streaming Processor 32 bit Streaming Multiprocessor 128 bit 1 1 1 4.2 FFT 3 GPU 4 10 20 47.5 10 20 FLOP 5 c 2009 Information Processing Society of Japan

4 7.20 24.1 (5.42) 15.3 4.57 G /s (20.3) 1.09 2.92 (GiB/s) (33.2) 64.8 71.4 (M /s/w) (86.0) () GPU 5 0.237 0.770 (Mword/s) 362 111 (Mword/s/W) 1.53 1.73 ZIP Local Store FFT 4.3 4 14 CPU 1 Local Store 4.4 ZIP 5 3 Texture Memory Texture Memory 4.5 RandomAccess 6 1 10 11.0 G /s 1 3 27.0 207 (GiB/s) 2.31 0.302 (MiB/s/W) 10.0 4.83 6 RandomAccess 477 218 (10 3 8.88 71.7 /s/w) 7 Levenshtein DMA 4.6 Levenshtein 7 8 bit 8 bit 32 bit GTX 280 8 bit 32 bit 4 6 c 2009 Information Processing Society of Japan

9 8 7 6 5 4 3 2 1 0 2 転送込み転送抜きワット性能 ( 転送込み ) ワット性能 ( 転送抜き ) 3 GPU Challenge GPU Challenge Challenge 5. GPGPU OpenCL 16) NVIDIA Intel AMD GPU IBM AMD GPU GPGPU Brook 8) Brook+ 10) GPGPU Scherl 18) 8800 GTX Agarwal 5) 8800 GTX SDK CUDA RapidMind SDK 14) 6. 3 6 Shared Memory Local Store Local Store SIMD SP 236 W GPGPU 7 c 2009 Information Processing Society of Japan

GPGPU Tesla 1 GPGPU GPU 1) Challenge 2009, 2009. http://www.hpcc.jp/sacsis/2009/cell/. 2) GPU Challenge 2009, 2009. http://www.hpcc.jp/sacsis/2009/gpu/. 3) SACSIS2009 -, 2009. http://www.hpcc.jp/ sacsis/2009/. 4) W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. Automatic program transformations for virtual memory computers. In Proceeding of the 1979 National Computer Conference, pp. 969 974, June 1979. 5) V.Agarwal, Lurng-Kuo Liu, and D.A. Bader. Financial modeling on the cell broadband engine. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pp. 1 12, April 2008. 6) K.E. Batcher. Sorting networks and their applications. Proceeding AFIPS Spring Joint Computer Conference, 1968. 7) I.Buck. Geforce 8800 & nvidia cuda: A new architecture for computing on the gpu. website of Supercomputing 06 Workshop General-Purpose GPU Computing: Practice And Experience, 2006. 8) Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, pp. 777 786, New York, NY, USA, 2004. ACM. 9) JamesW. Cooley and JohnW. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput. 19, pp. 297 301, 1965. 10) Advanced MicroDevices Inc. Brook+ sc07 bof session. Supercomputing 2007 Conference, November 2007. 11) James Laudon, Anoop Gupta, and Mark Horowitz. Interleaving: a multithreading technique targeting multiprocessors and workstations. SIGPLAN Not., Vol.29, No.11, pp. 308 318, 1994. 12) VladimirI. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report8, 1966. 13) David Luebke, Mark Harris, Jens Krüger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff Woolley, and Aaron Lefohn. Gpgpu: general purpose computation on graphics hardware. In SIGGRAPH 04: ACM SIGGRAPH 2004 Course Notes, p.33, New York, NY, USA, 2004. ACM. 14) MichaelD. McCool. Data-parallel programming on the cell be and the gpu using the rapidmind development platform. the GSPx Multicore Applications Conference, 2006. 15) G.E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, Vol.86, No.1, pp. 82 85, 1998. 16) Aaftab Munshi. Opencl. http://s08.idav.ucdavis.edu/munshi-opencl.pdf. 17) D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A.Kameyama, J.Keaty, Y.Masubuchi, M.Riley, D.Shippy, D.Stasiak, M.Suzuoki, M.Wang, J.Warnock, S.Weitzel, D.Wendel, T.Yamazaki, and K.Yazawa. The design and implementation of a first-generation cell processor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pp. 184 592 Vol. 1, 2005. 18) H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast gpu-based ct reconstruction using the common unified device architecture (cuda). Nuclear Science Symposium Conference Record, 2007. NSS 07. IEEE, Vol. 6, pp. 4464 4466, 26 2007-Nov. 3 2007. 19) D.Takahashi. High-performance parallel fft algorithms for the hitachi sr8000. High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on, Vol.1, pp. 192 199 vol.1, 2000. 20) Vasily Volkov. Homepage for vasily volkov. http://www.cs.berkeley.edu/volkov/. 21) D.Wang. Isscc 2005: The cell microprocessor. Real World Technologies, February 2005. http://www.realworldtech.com/page.cfm?articleid=rwt021005084318& p=2. 1 Tesla C1060 187.8 W 8 c 2009 Information Processing Society of Japan