THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. [ ] 466-8555 E-mail: fukushima@nitech.ac.jp I/O Abstract [Invited] High-Performance Computing Programming for Image Processing Based on Computer Architecture Norishige FUKUSHIMA Nagoya Institute of Technology 466-8555 Gokiso-cho, Showa-ku, Nagoya, Aichi E-mail: fukushima@nitech.ac.jp In this report, we review a parallelized and vectorized programming for high performance image processing and its design pattern. Moore s indecates the number of transisters in a chip are exponentially increasing, but computer archtechture is also formming complex. For high performance computing, programming utilizing the knowledge of the archecture is essential. Beside, incresing of memory transfar speed is moderate than the computaion performance. The fact is also imortant for image processing programming. Experimentatl results show that simple image transformation, convolution, and complex upsampling are accelerated with effective programming. Key words Image Processing Programming, Design Pattern, High Performance Computing, Parallerization, Vectorization. [] 2004 Pentium 4 SMT Intel Core 2 2006 207 Core i9 7980XE 8 52 AVX-52
FMA FLOPS 576 8 6 2: SMT L2 CPU OpenCV 999 Image Processing Library (IPL) 997 AMD Ryzen Intel CPU CPU CPU Intel x86 2. 2. 97 Intel 4004 2004 Pentium 4 3.8 GHz 207 CPU 4 GHz 2000 Pentium 4 2005 Pendium D CPU Pentium 4 2 2006 Core 2 4 4 CPU 2009 Core i 2 8 2. 2 SIMD [2] Intel SIMD 997 MMX 64 SSE 28 999 AVX 256 20 AVX-52 52 203 Intel AMD 3D Now! SSE ARM NEON SIMD 28 SIMD Pentium III (999 ) SSE SSE 2 SIMD SIMD FPU 2 3 Pentium 4 (2000 ) 2 Core 2 800 700 600 500 400 300 200 00 GFLOPS 4 コア 6 コア 2 コア バンド幅 [GB/S] 8 コア AVX2/FMA 8 コア AVX-52 クアッドチャネル FSBの廃止 0 995 2000 2005 200 205 2020 AVX Intel CPU FLOPS 2006 4 SSE Core i 2 20, Sandy Bridge AVX 8 FMA Core i 4 Haswell 204 206 Xeon Phi 6 AVX-52 207 CPU FMA FLOPS add, sub, mul, div max/min rcp rsqrt cmp dp ceil, floor, round popcount SIMD RGB gather AVX2 204 scatter AVX-52 207 2. 3 Intel CPU FLOPS 990 I/O Core i L 4 L2 2 L3 26-3 2
2. 4 FLOPS FLOP 4 2 2 I/O FLOPS F/B B/F B/F 2 28FLOPS/64GB/s=0.5 B/F Core i9 7980XE 748.8GFLOPS/85.3GB/s=0.4 B/F [3] 2 Core i9 7980XE F/B y = x I/O I/O Intel Parallel Studio CPU パフォーマンス [GFLOPS] 0 4 0 3 0 2 0 0 0 0-0 -2 マシンの理論演算強度 演算能力の理論上限 (493.3GFLOPS) 0.50.26 9.57 28.56 各メモリ帯域の上限 メモリ最適化 プログラムを天井に近づけるように最適化 0-3 0-2 0-0 0 0 0 2 2 並列化 低演算強度処理 ベクトル化 メモリアクセスでバウンド 演算強度 [flop/byte] メモリ最適化 並列化 ベクトル化 中演算強度処理 パフォーマンスが下がるアルゴリズムに変更してでも演算強度を上げる 並列化 高演算強度処理 理論値まで高速化 ベクトル化 Intel CPU FLOPS 3. 3. [4] N S(N) = ( P ) + P N N P S(N) = () ( P ) + P N + f(n) (2) 3 P = 0.8, 0.9 f(n) = 0, 0.0N 3. 2 R G B N N 3
Speed up ratio 0 9 8 7 6 5 4 3 2 p=0.9, f(n)=0) p=0.9, f(n)=0.00n p=0.8, f(n)=0) p=0.8, f(n)=0.00n 0 20 40 60 80 00 The numper of cores 3 [5] 2 3 4 IIR 3 3 3. 3 Pthreads OpenMP Intel Cilk Plus Intel TBB Microsoft Parallel Patterns Library Concurency 3. 4 load 6, 32, 64 I/O SIMD Intrinsic function Visual Studio GCC ICC OpenMP OpenMP4.0 SIMD SIMD OpenMP Visual Studio OpenMP2.0 set shuffle, permute blend gather/scatter gather scatter shuffle, permute, blend gather/scatter x, (x + ) 3 set 3. 5 4
r O(r 2 ) O(r) L2 3. 6 CPU SIMD GPU CUDA FPGA HDL OpenCL CPU GPU FPGA CPU (Domain Spesific Language: DSL) [6], [7] Halide DSL C++ Halide OpenCV FPGA 4. 4. ax + b f(x) = a 0 + a x + a 2x 2 +... + a n x n (3) 0 8 0 4 C++ AVX OpenMP exp C++ C++ exp C++ 2 exp I/O 8MB L L L2 map 4. 2 [8] = {r, g, b} J J p = 2 p q Ip Iq 2 exp( ) exp( )I q (4) N 2σ s 2σ r q ω p p, q ω σ s,r https://www.halide2fpga.com/ 5
4 速度向上比 速度向上比 8 7 6 5 4 3 2 ax+b-mem ax+b-c++ exp-mem exp-c++ 0 64 704 344 984 2624 3264 3904 4544 584 5824 6464 704 7744 画像サイズ [pixel] C++ 90 80 70 60 50 40 30 20 0 OMP math lut3set lut3gather lutset lutgather 0 64 704 344 984 2624 3264 3904 4544 584 5824 5 画像サイズ [pixel] N exp( Ip Iq 2 2σ r )=exp( (rp rq)2 2σ r ) exp( (gp gq)2 ) exp( (bp bq)2 ) (5) 2σ r 2σ r exp( Ip Iq 2 )=exp( ( (rp r q) 2 + (g p g q) 2 + (b p b q) 2 ) 2 ) (6) 2σ r 2σ r EXP2[ d ] = exp( d2 2σ ) round( 3 255 2 ) = 442 3 255 2 5 C++ FIR set gather LUT 3 4. 3 [9] 2 2 C++ 59.2 ms 0.3 ms 0.52 ms 0.33 ms OpenCV Cubic 5. CPU GPU JP7H0764 [] G.E. Moore, Cramming more components onto integrated circuits, Electronics Magazine, vol.9, 965. [2] M. Flynn, Some computer organizations and their effectiveness, IEEE Trans. on Computers, vol.c-2, no.9, pp.948 960, 972. [3] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, no.4, pp.65 76, 2009. [4] G.M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Proc. Spring Joint Computer Conference, pp.483 485, AFIPS 67, 967. [5] M.D. McCool, A.D. Robison, and J. Reinders, Structured parallel programming: patterns for efficient computation, Elsevier, 202. [6] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Notices, vol.48, no.6, pp.59 530, 203. [7] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan, Darkroom: compiling high-level image processing code into hardware pipelines., ACM Trans. Graph., vol.33, no.4, pp.44, 204. [8] C. Tomasi and R. Manduchi, Bilateral filtering for gray and color images, Proc. International Conference on Computer Vision, pp.839 846, 998. [9] D. Zhou, X. Shen, and W. Dong, Image zooming using directional cubic convolution interpolation, IET image processing, vol.6, no.6, pp.627 634, 202. 6