CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: PDF Free Download

FFT 1 Fourier fast Fourier transform FFT FFT FFT 1 FFT FFT 2 Fourier 2.1 Fourier FFT Fourier discrete Fourier transform DFT DFT n 1 y k = j=0 x j ω jk n, 0 k n 1 (1) x j y k ω n = e 2πi/n i = 1 (1) n DFT O(n 2 ) FFT O(n log n) n DFT FFT [4, 16] FFT Cooley-Tukey [6] Stockham [5, 13] スーパーコンピューティングニュース - 123 -

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2 FFT FFT [16, 4] FFT FFT radix p p 1 Y (k) = X(j)Ω j ωp jk (2) j=0 Ω twiddle factor [4] 1 p FFT X(j) Ω j p DFT[10] (2) [12, 15] 3 3.1 memory hierarchy 1 locality スーパーコンピューティングニュース - 124 -

CPU L1 Cache L2 Cache Main Memory 2: RISC RISC 3.2 1 L1 Cache 1 1 2 L2 Cache 3 スーパーコンピューティングニュース - 125 -

C SUBROUTINE ZAXPY(N,A,X,Y) IMPLICIT REAL*8 (A-H,O-Z) COMPLEX*16 A,X(*),Y(*) DO I=1,N Y(I)=Y(I)+A*X(I) END DO RETURN END 3: ZAXPY 1 2 3.3 ZAXPY FFT FFT ZAXPY A X plus Y 4 Intel Xeon 3.06 GHz FSB 533 MHz 512 KB L2 cache PC2100 DDR- SDRAM Intel C Compiler 8.0 Intel Pentium4 SIMD Single Instruction Multiple Data SSE2 [8] x87 Intel MKL Math Kernel Library Version 6.1.1 [9] BLAS Basic Linear Algebra Subprograms ZAXPY 3 1 iteration 4 load 4 store 2 4 L2 N 8192 SSE2 with SSE2 3 GFLOPS Xeon 3.06 GHz 6.12 GFLOPS L2 x87 4 Six-Step FFT six-step FFT [3, 16] six-step FFT FFT スーパーコンピューティングニュース - 126 -

スーパーコンピューティングニュース - 127 -

スーパーコンピューティングニュース - 128 -

1 COMPLEX*16 X(N1,N2),Y(N2,N1),U(N2,N1) 2 DO I=1,N1 3 DO J=1,N2 4 Y(J,I)=X(I,J) 5 END DO 6 END DO 7 DO I=1,N1 8 CALL IN CACHE FFT(Y(1,I),N2) 9 END DO 10 DO I=1,N1 11 DO J=1,N2 12 Y(J,I)=Y(J,I)*U(J,I) 13 END DO 14 END DO 15 DO J=1,N2 16 DO I=1,N1 17 X(I,J)=Y(J,I) 18 END DO 19 END DO 20 DO J=1,N2 21 CALL IN CACHE FFT(X(1,J),N1) 22 END DO 23 DO I=1,N1 24 DO J=1,N2 25 Y(J,I)=X(I,J) 26 END DO 27 END DO 5: six-step FFT 6 six-step FFT 7 6 NB NP WORK 7 X WORK Y 1 16 WORK WORK X WORK multicolumn FFT six-step FFT two-pass [3, 16] six-step FFT n FFT O(n log n) O(n) Step 2 Step 4 column FFT L1 n column FFT L1 [1, 2] column FFT L1 column FFT two-pass three-pass FFT six-step FFT スーパーコンピューティングニュース - 129 -

1 COMPLEX*16 X(N1,N2),Y(N2,N1),U(N1,N2) 2 COMPLEX*16 WORK(N2+NP,NB) 3 DO II=1,N1,NB 4 DO JJ=1,N2,NB 5 DO I=II,II+NB-1 6 DO J=JJ,JJ+NB-1 7 WORK(J,I-II+1)=X(I,J) 8 END DO 9 END DO 10 END DO 11 DO I=1,NB 12 CALL IN CACHE FFT(WORK(1,I),N2) 13 END DO 14 DO J=1,N2 15 DO I=II,II+NB-1 16 X(I,J)=WORK(J,I-II+1)*U(I,J) 17 END DO 18 END DO 19 END DO 20 DO JJ=1,N2,NB 21 DO J=JJ,JJ+NB-1 22 CALL IN CACHE FFT(X(1,J),N1) 23 END DO 24 DO I=1,N1 25 DO J=JJ,JJ+NB-1 26 Y(J,I)=X(I,J) 27 END DO 28 END DO 29 END DO 6: six-step FFT out-of-place Stockham [5, 13] Step 2 4 multicolumn FFT O( n) FFT Step 5 6 24 28 O( n) WORK 6 In-Cache FFT multicolumn FFT column FFT in-cache FFT Stockham [5, 13] Stockham Cooley-Tukey [6] Cooley-Tukey 2 [6] 2 Stockham n = 2lm l m 2 l n/2 スーパーコンピューティングニュース - 130 -

1. Partial transpose NB 2. NB individual N2-point FFTs NB NB N2 1 2 3 4 N1 1 2 3 4 5 6 7 8 9 10 11 12 Array X 13 14 15 16 N2 5 6 7 8 9 101112 Array WORK N2 13141516 padding NP 3. Partial transpose N2 NB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 N1 Array WORK N2 1 2 3 4 5 6 7 8 9 10111213141516 Array X NB 4. NB individual N1-point FFTs N1 NB N2 Array X 7: six-step FFT 2 m 1 2 X Y X Y Y X ω p = e 2πi/p c 0 = X(k + jm) c 1 = X(k + jm + lm) Y (k + 2jm) = c 0 + c 1 Y (k + 2jm + m) = ω j 2l (c 0 c 1 ) 0 j < l 0 k < m 2 FFT 2 FFT 4 8 FFT 2 FFT [14] n = 2 p (p 2) FFT n = 4 q 8 r (0 q 2, r 0) 4 8 FFT n 4 2 FFT six-step FFT multicolumn FFT [3, 16] 5 six-step FFT 2 7 10 15 20 23 DO OpenMP[11]!$OMP DO スーパーコンピューティングニュース - 131 -

1: Intel Xeon 5150 2.66 GHz FFTE 4.0 SSE3 n 1 CPU, 1 core 1 CPU, 2 cores 2 CPUs, 4 cores Time MFLOPS Time MFLOPS Time MFLOPS 2 12 0.00006 4128.46 0.00006 4128.80 0.00006 4141.40 2 13 0.00014 3912.61 0.00014 3900.81 0.00014 3925.46 2 14 0.00028 4030.83 0.00029 4020.14 0.00028 4036.37 2 15 0.00060 4121.60 0.00060 4113.43 0.00060 4106.24 2 16 0.00143 3676.79 0.00141 3713.05 0.00141 3717.98 2 17 0.00500 2228.17 0.00380 2931.55 0.00226 4921.67 2 18 0.01340 1761.12 0.00747 3159.97 0.00472 4995.93 2 19 0.02989 1666.54 0.01678 2968.24 0.01341 3715.39 2 20 0.06675 1570.84 0.03735 2807.18 0.03003 3491.69 6 six-step FFT 3 20 DO WORK MPI FFT 7 Six-Step FFT six-step FFT FFT FFTE version 4.0 1 FFT FFTW version 3.1.2 2 [7] n = 2 m m FFT 10 FFT Intel Xeon 5150 2.66 GHz 4 GB DDR2- SDRAM 32 KB L1 instruction cache 32 KB L1 data cache 4 MB L2 Cache Linux 2.6.18-1.2798.fc6 Intel Fortran version 9.1 Intel C version 9.1 -O3 -xp -openmp 1 FFTE version 4.0 FFTW version 3.1.2 1 six-step FFT FFTE 2CPUs 4cores n FFTW 8 FFT 1 http://www.ffte.jp 2 http://www.fftw.org スーパーコンピューティングニュース - 132 -

スーパーコンピューティングニュース - 133 -

[8] Intel Corporation. IA-32 Intel Architecture Software Developer s Manual Volume 2: Instruction Set Reference, 2003. [9] Intel Corporation. Intel Math Kernel Library Reference Manual, 2003. [10] H. J. Nussbaumer. Fast Fourier Transform and Convolution Algorithms. Springer-Verlag, New York, second corrected and updated edition, 1982. [11] OpenMP. Simple, portable, scalable smp programming. http://www.openmp.org. [12] R. C. Singleton. An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans. Audio Electroacoust., 17:93 103, 1969. [13] P. N. Swarztrauber. FFT algorithms for vector computers. Parallel Computing, 1:45 63, 1984. [14] D. Takahashi. A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Computing, 29(6):679 690, 2003. [15] C. Temperton. Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys., 52:1 23, 1983. [16] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM Press, Philadelphia, PA, 1992. スーパーコンピューティングニュース - 134 -