THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. [ ] I/O Abstr

Similar documents
23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

GPGPU

main.dvi

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.

スライド 1

修士論文

untitled

main

2017 (413812)

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

(MIRU2008) HOG Histograms of Oriented Gradients (HOG)

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai,

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

07-二村幸孝・出口大輔.indd

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Consideration of Cycle in Efficiency of Minority Game T. Harada and T. Murata (Kansai University) Abstract In this study, we observe cycle in efficien

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

知能と情報, Vol.30, No.5, pp

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2

VRSJ-SIG-MR_okada_79dce8c8.pdf

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

(Visual Secret Sharing Scheme) VSSS VSSS 3 i

Run-Based Trieから構成される 決定木の枝刈り法

スパコンに通じる並列プログラミングの基礎

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

25 D Effects of viewpoints of head mounted wearable 3D display on human task performance

13金子敬一.indd

untitled

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv


研究報告用MS-Wordテンプレートファイル

Sobel Canny i


6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

2007/8 Vol. J90 D No. 8 Stauffer [7] 2 2 I 1 I 2 2 (I 1(x),I 2(x)) 2 [13] I 2 = CI 1 (C >0) (I 1,I 2) (I 1,I 2) Field Monitoring Server

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

MPC MPC R p N p Z p p N (m, σ 2 ) m σ 2 floor( ), rem(v 1 v 2 ) v 1 v 2 r p e u[k] x[k] Σ x[k] Σ 2 L 0 Σ x[k + 1] = x[k] + u[k floor(l/h)] d[k]. Σ k x

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

IPSJ SIG Technical Report Vol.2009-CVIM-167 No /6/10 Real AdaBoost HOG 1 1 1, 2 1 Real AdaBoost HOG HOG Real AdaBoost HOG A Method for Reducing

H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%

28 Horizontal angle correction using straight line detection in an equirectangular image

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

Table 1 Table 2

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

P2P Web Proxy P2P Web Proxy P2P P2P Web Proxy P2P Web Proxy Web P2P WebProxy i

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

Agenda GRAPE-MPの紹介と性能評価 GRAPE-MPの概要 OpenCLによる四倍精度演算 (preliminary) 4倍精度演算用SIM 加速ボード 6 processor elem with 128 bit logic Peak: 1.2Gflops

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

す 局所領域 ωk において 線形変換に用いる係数 (ak 画素の係数 (ak bk ) を算出し 入力画像の信号成分を bk ) は次式のコスト関数 E を最小化するように最適化 有さない画素に対して 式 (2) より画素値を算出する される これにより 低解像度な画像から補間によるアップサ E(

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

untitled

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

インテル(R) Visual Fortran Composer XE

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis

,,,,,,,,,,,,,,,,,,, 976%, i

スパコンに通じる並列プログラミングの基礎

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE {s-kasihr, wakamiya,

HP Workstation 総合カタログ

P2P P2P peer peer P2P peer P2P peer P2P i

1

Intel_ParallelStudioXE2013_ClusterStudioXE2013_Introduction.pptx

スパコンに通じる並列プログラミングの基礎

Google Goggles [1] Google Goggles Android iphone web Google Goggles Lee [2] Lee iphone () [3] [4] [5] [6] [7] [8] [9] [10] :

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

IPSJ SIG Technical Report NetMAS NetMAS NetMAS One-dimensional Pedestrian Model for Fast Evacuation Simulator Shunsuke Soeda, 1 Tomohisa Yam

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

,,,,., C Java,,.,,.,., ,,.,, i

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

IPSJ SIG Technical Report Vol.2010-CVIM-170 No /1/ Visual Recognition of Wire Harnesses for Automated Wiring Masaki Yoneda, 1 Ta

卒業論文2.dvi

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

untitled

RaVioli SIMD

IPSJ SIG Technical Report GPS LAN GPS LAN GPS LAN Location Identification by sphere image and hybrid sensing Takayuki Katahira, 1 Yoshio Iwai 1

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

2.2 6).,.,.,. Yang, 7).,,.,,. 2.3 SIFT SIFT (Scale-Invariant Feature Transform) 8).,. SIFT,,. SIFT, Mean-Shift 9)., SIFT,., SIFT,. 3.,.,,,,,.,,,., 1,

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

Chip Size and Performance Evaluations of Shared Cache for On-chip Multiprocessor Takahiro SASAKI, Tomohiro INOUE, Nobuhiko OMORI, Tetsuo HIRONAKA, Han

Vol1-CVIM-172 No.7 21/5/ Shan 1) 2 2)3) Yuan 4) Ancuti 5) Agrawal 6) 2.4 Ben-Ezra 7)8) Raskar 9) Image domain Blur image l PSF b / = F(

スライド 1

FIT2013( 第 12 回情報科学技術フォーラム ) I-032 Acceleration of Adaptive Bilateral Filter base on Spatial Decomposition and Symmetry of Weights 1. Taiki Makishi Ch

Click to edit title

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

VXPRO R1400® ご提案資料

Transcription:

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. [ ] 466-8555 E-mail: fukushima@nitech.ac.jp I/O Abstract [Invited] High-Performance Computing Programming for Image Processing Based on Computer Architecture Norishige FUKUSHIMA Nagoya Institute of Technology 466-8555 Gokiso-cho, Showa-ku, Nagoya, Aichi E-mail: fukushima@nitech.ac.jp In this report, we review a parallelized and vectorized programming for high performance image processing and its design pattern. Moore s indecates the number of transisters in a chip are exponentially increasing, but computer archtechture is also formming complex. For high performance computing, programming utilizing the knowledge of the archecture is essential. Beside, incresing of memory transfar speed is moderate than the computaion performance. The fact is also imortant for image processing programming. Experimentatl results show that simple image transformation, convolution, and complex upsampling are accelerated with effective programming. Key words Image Processing Programming, Design Pattern, High Performance Computing, Parallerization, Vectorization. [] 2004 Pentium 4 SMT Intel Core 2 2006 207 Core i9 7980XE 8 52 AVX-52

FMA FLOPS 576 8 6 2: SMT L2 CPU OpenCV 999 Image Processing Library (IPL) 997 AMD Ryzen Intel CPU CPU CPU Intel x86 2. 2. 97 Intel 4004 2004 Pentium 4 3.8 GHz 207 CPU 4 GHz 2000 Pentium 4 2005 Pendium D CPU Pentium 4 2 2006 Core 2 4 4 CPU 2009 Core i 2 8 2. 2 SIMD [2] Intel SIMD 997 MMX 64 SSE 28 999 AVX 256 20 AVX-52 52 203 Intel AMD 3D Now! SSE ARM NEON SIMD 28 SIMD Pentium III (999 ) SSE SSE 2 SIMD SIMD FPU 2 3 Pentium 4 (2000 ) 2 Core 2 800 700 600 500 400 300 200 00 GFLOPS 4 コア 6 コア 2 コア バンド幅 [GB/S] 8 コア AVX2/FMA 8 コア AVX-52 クアッドチャネル FSBの廃止 0 995 2000 2005 200 205 2020 AVX Intel CPU FLOPS 2006 4 SSE Core i 2 20, Sandy Bridge AVX 8 FMA Core i 4 Haswell 204 206 Xeon Phi 6 AVX-52 207 CPU FMA FLOPS add, sub, mul, div max/min rcp rsqrt cmp dp ceil, floor, round popcount SIMD RGB gather AVX2 204 scatter AVX-52 207 2. 3 Intel CPU FLOPS 990 I/O Core i L 4 L2 2 L3 26-3 2

2. 4 FLOPS FLOP 4 2 2 I/O FLOPS F/B B/F B/F 2 28FLOPS/64GB/s=0.5 B/F Core i9 7980XE 748.8GFLOPS/85.3GB/s=0.4 B/F [3] 2 Core i9 7980XE F/B y = x I/O I/O Intel Parallel Studio CPU パフォーマンス [GFLOPS] 0 4 0 3 0 2 0 0 0 0-0 -2 マシンの理論演算強度 演算能力の理論上限 (493.3GFLOPS) 0.50.26 9.57 28.56 各メモリ帯域の上限 メモリ最適化 プログラムを天井に近づけるように最適化 0-3 0-2 0-0 0 0 0 2 2 並列化 低演算強度処理 ベクトル化 メモリアクセスでバウンド 演算強度 [flop/byte] メモリ最適化 並列化 ベクトル化 中演算強度処理 パフォーマンスが下がるアルゴリズムに変更してでも演算強度を上げる 並列化 高演算強度処理 理論値まで高速化 ベクトル化 Intel CPU FLOPS 3. 3. [4] N S(N) = ( P ) + P N N P S(N) = () ( P ) + P N + f(n) (2) 3 P = 0.8, 0.9 f(n) = 0, 0.0N 3. 2 R G B N N 3

Speed up ratio 0 9 8 7 6 5 4 3 2 p=0.9, f(n)=0) p=0.9, f(n)=0.00n p=0.8, f(n)=0) p=0.8, f(n)=0.00n 0 20 40 60 80 00 The numper of cores 3 [5] 2 3 4 IIR 3 3 3. 3 Pthreads OpenMP Intel Cilk Plus Intel TBB Microsoft Parallel Patterns Library Concurency 3. 4 load 6, 32, 64 I/O SIMD Intrinsic function Visual Studio GCC ICC OpenMP OpenMP4.0 SIMD SIMD OpenMP Visual Studio OpenMP2.0 set shuffle, permute blend gather/scatter gather scatter shuffle, permute, blend gather/scatter x, (x + ) 3 set 3. 5 4

r O(r 2 ) O(r) L2 3. 6 CPU SIMD GPU CUDA FPGA HDL OpenCL CPU GPU FPGA CPU (Domain Spesific Language: DSL) [6], [7] Halide DSL C++ Halide OpenCV FPGA 4. 4. ax + b f(x) = a 0 + a x + a 2x 2 +... + a n x n (3) 0 8 0 4 C++ AVX OpenMP exp C++ C++ exp C++ 2 exp I/O 8MB L L L2 map 4. 2 [8] = {r, g, b} J J p = 2 p q Ip Iq 2 exp( ) exp( )I q (4) N 2σ s 2σ r q ω p p, q ω σ s,r https://www.halide2fpga.com/ 5

4 速度向上比 速度向上比 8 7 6 5 4 3 2 ax+b-mem ax+b-c++ exp-mem exp-c++ 0 64 704 344 984 2624 3264 3904 4544 584 5824 6464 704 7744 画像サイズ [pixel] C++ 90 80 70 60 50 40 30 20 0 OMP math lut3set lut3gather lutset lutgather 0 64 704 344 984 2624 3264 3904 4544 584 5824 5 画像サイズ [pixel] N exp( Ip Iq 2 2σ r )=exp( (rp rq)2 2σ r ) exp( (gp gq)2 ) exp( (bp bq)2 ) (5) 2σ r 2σ r exp( Ip Iq 2 )=exp( ( (rp r q) 2 + (g p g q) 2 + (b p b q) 2 ) 2 ) (6) 2σ r 2σ r EXP2[ d ] = exp( d2 2σ ) round( 3 255 2 ) = 442 3 255 2 5 C++ FIR set gather LUT 3 4. 3 [9] 2 2 C++ 59.2 ms 0.3 ms 0.52 ms 0.33 ms OpenCV Cubic 5. CPU GPU JP7H0764 [] G.E. Moore, Cramming more components onto integrated circuits, Electronics Magazine, vol.9, 965. [2] M. Flynn, Some computer organizations and their effectiveness, IEEE Trans. on Computers, vol.c-2, no.9, pp.948 960, 972. [3] S. Williams, A. Waterman, and D. Patterson, Roofline: an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, no.4, pp.65 76, 2009. [4] G.M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Proc. Spring Joint Computer Conference, pp.483 485, AFIPS 67, 967. [5] M.D. McCool, A.D. Robison, and J. Reinders, Structured parallel programming: patterns for efficient computation, Elsevier, 202. [6] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, ACM SIGPLAN Notices, vol.48, no.6, pp.59 530, 203. [7] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasilyev, M. Horowitz, and P. Hanrahan, Darkroom: compiling high-level image processing code into hardware pipelines., ACM Trans. Graph., vol.33, no.4, pp.44, 204. [8] C. Tomasi and R. Manduchi, Bilateral filtering for gray and color images, Proc. International Conference on Computer Vision, pp.839 846, 998. [9] D. Zhou, X. Shen, and W. Dong, Image zooming using directional cubic convolution interpolation, IET image processing, vol.6, no.6, pp.627 634, 202. 6