RaVioli SIMD

Similar documents
main

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

main.dvi

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do

ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

FFTSS Library Version 3.0 User's Guide

H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [

07-二村幸孝・出口大輔.indd

動画系のSIMD最適化

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

An Interactive Visualization System of Human Network for Multi-User Hiroki Akehata 11N F


rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

2012 M

main.dvi

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

スライド 1

DPD Software Development Products Overview

P05.ppt

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

Cell/B.E. BlockLib

N08


! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

[1] [2] [3] (RTT) 2. Android OS Android OS Google OS 69.7% [4] 1 Android Linux [5] Linux OS Android Runtime Dalvik Dalvik UI Application(Home,T

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

C言語によるアルゴリズムとデータ構造

(MIRU2010) NTT Graphic Processor Unit GPU graphi

LinuxDeviceDriver2003-PDF.PDF

_CS6.indd

インテル(R) Visual Fortran Composer XE

The 3 key challenges in programming for MC

GPU.....

Corel GuideMenu DVD MovieWriter SE DVD MovieWriter SE DVD MovieWriter SE WinDVD SE WinDVD SE Corel Application Disc Corel Application Disc 2

ACDSee-Press-Release_0524

Agenda Intro & history LLVM overview Demo Pros & Cons LLVM Intermediate Language LLVM tools

indd

橡Webcamユーザーガイド03.PDF

DCR-SR100

コミュニケーションユーティリティー編


PC Windows 95, Windows 98, Windows NT, Windows 2000, MS-DOS, UNIX CPU

(Basic Theory of Information Processing) 1

Dell OptiPlex PC OptiPlex CPU OptiPlex 4 vpro TCO Dell KACE vpro OS Energy Smart Energy Smart Energy STAR 5.2 2

<4D F736F F D CF097AC E A D836A B2E646F6378>

2008 DS T050049


( ) ( ) ( ) 2

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

( ) 1

untitled

OptiPlex OptiPlex 4 OptiPlex vpro Energy STAR5.0 EPEAT GOLD 90 Energy Smart Energy Smart

LP-M720F

(SAD) x86 MPSADBW H.264/AVC H.264/AVC SAD SAD x86 SAD MPSADBW SAD 3x3 3 9 SAD SAD SAD x86 MPSADBW SAD 9 SAD SAD 4.6

Windows XP Windows Me Windows 98 Second Edition Windows /... 25

FileMaker Server 9 Getting Started Guide

untitled

/* sansu1.c */ #include <stdio.h> main() { int a, b, c; /* a, b, c */ a = 200; b = 1300; /* a 200 */ /* b 200 */ c = a + b; /* a b c */ }

Microsoft Word - C.....u.K...doc

r1.dvi

ActionScript Flash Player 8 ActionScript3.0 ActionScript Flash Video ActionScript.swf swf FlashPlayer AVM(Actionscript Virtual Machine) Windows

2 2 ( M2) ( )

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

untitled

2012年夏のプログラミング・シンポジウム.indd

FileMaker Server Getting Started Guide

Excel97関数編

cmpsys13w03_cpu_hp.ppt



untitled

untitled

Pentium 4

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

untitled


Express5800/120Rb-1 (2002/01/22)

Java Java Java Java Java 4 p * *** ***** *** * Unix p a,b,c,d 100,200,250,500 a*b = a*b+c = a*b+c*d = (a+b)*(c+d) = 225

OpenCV Windows(cygwin) Linux USB PC [1] Inel OpenCV OpenCV 1 Windows Linux OpenCV (a) (b)2 (c) (d) 1: OpenCV 1

Office BCP () Office Microsoft Exchange Exchange Server Exchange Online Exchange Server Exchange Online Exchange Exchange 1997 Exc

Shade 13.2 アップデータ

supercomputer2010.ppt

Microsoft PowerPoint _秀英体の取組み素材(予稿集).ppt

EPSON PX-500 プリンタ操作ガイド

untitled

Printer Driverセットアップ編

OpenCV IS Report No Report Medical Information System Labratry

Adobe Postscript 3 Expansion Unit

Web Web Web Web i

SonicStage Ver. 2.0

卒業研究報告書式


2005 1

,,,,., C Java,,.,,.,., ,,.,, i

Excel基礎講座演習-表紙とはじめにv1.3.doc

23 Study on Generation of Sudoku Problems with Fewer Clues

Transcription:

RaVioli SIMD 17 17115074

i RaVioli SIMD PC PC PC PC CPU RaVioli RaVioli CPU RaVioli CPU SIMD RaVioli RaVioli SIMD RaVioli SIMD

RaVioli SIMD 1 1 2 RaVioli 2 2.1 RaVioli....................................... 2 2.1.1.......................... 4 2.1.2.......................... 4 2.2........................................ 4 2.3 RaVioli..................................... 7 3 SIMD RaVioli 7 3.1 SIMD.......................................... 7 3.2 RaVioli SIMD.................... 8 3.3 SIMD...................... 9 4 SIMD 13 4.1...................................... 13 4.2........................................ 14 4.3................................. 16 4.4 SIMD................................... 19 5 21 6 23 24 A.1 A.2

1 1 PC PC PC Linux OS PC 1/30 1/60 CPU OS PC CPU CPU VIGRA[1] OpenCV[2] VIGRA OpenCV

2 RaVioli[3] RaVioli CPU CPU 7 3 RaVioli VIGRA OpenCV RaVioli RaVioli RaVioli SIMD SIMD 2 RaVioli 3 SIMD RaVioli 4 SIMD 5 6 2 RaVioli RaVioli 2.1 RaVioli RaVioli

3 1: RaVioli 2 RaVioli procpix 1

4 2.1.1 PC CPU RaVioli CPU RaVioli 2.1.2 RaVioli RV Pixel RV Image RV Pixel RV Pixel RGB HSV 0 100% RV Image 2.2 1 0 2

5 for(int j=0;j<height;j++){ for(int i=0;i<width;i++){ new_image[i][j]=binalization(image[i][j]); 2: Binalization() 1 0 x y i j Binalization() width height 2 RaVioli RV Image 2 RaVioli 3 3 4 main image RV Image RV Image RaVioli grain RaVioli

6 void Binalization(Pixel* p){ /*p */ void main(argc,argv[]){ RV_Image image; /* */ new_image=image.proc(binalization); /* */ 3: RaVioli ( ) RaVioil RV_Image* RV_Image::proc(RV_Pixel (* UserProgram)(RV_Pixel)){ RV_Image* tmpimage; for(int ny=0;ny<height;ny+=grain){ for(int nx=0;nx<width;nx+=grain){ tmpimage->pixel[ny*width+nx] = UserProgram(*_getPixel(nx,ny)); return(tmpimage); 4: RaVioli (RaVioli ) RaVioli 3 (0,0) 100.0

7 RaVioli (s) 2.520 0.040 RaVioli (s) 11.032 0.156 / ( ) 4.38 3.9 1: RaVioli 2.3 RaVioli RaVioli 2.2 RaVioli 1 4.4 3.9 RaVioli SIMD SIMD 3 SIMD RaVioli 3.1 SIMD SIMD Single Instruction Multiple Data SIMD 5 DSP( ) Intel Pentium III CPU PC CPU

8 5: SIMD SIMD PowerPC Pentium Cell SPE CPU Geforce RADEON GPU PC PC CPU SIMD PC CPU Intel Pentium AMD Athlon CPU SIMD SIMD Pentium SSE Athlon 3DNow! SIMD CPU Intel Pentium CPU SIMD SSE Intel CPU [4] SIMD [5] 3.2 RaVioli SIMD RaVioli SIMD RaVioli SIMD SIMD

9 SIMD SIMD RaVioli 2 RaVioli SIMD RaVioli SIMD SIMD RaVioli SIMD SIMD SIMD 2.1 RaVioli RaVioli RaVioli SIMD 3.3 SIMD RaVioli SIMD RaVioli SIMD SIMD 6 6 RGB input image input tp RGB allsum

10 int min=2147483646; int allsum=0; for(j=0;j<input_image.height-input_tp.height;j++){ for(i=0;i<input_image.width-input_tp.width;i++){ for(jj=0;jj<input_tp.height;jj++){ for(ii=0;ii<input_tp.width;ii+=16){ //SIMD asm volatile ( "movdqu (%1),%%xmm0\n\t" "movdqu (%2),%%xmm1\n\t" "movdqu (%3),%%xmm2\n\t" "movdqu (%4),%%xmm3\n\t" "movdqu (%5),%%xmm4\n\t" "movdqu (%6),%%xmm5\n\t" "psadbw %%xmm1,%%xmm0\n\t" "psadbw %%xmm3,%%xmm2\n\t" "psadbw %%xmm5,%%xmm4\n\t" "paddd %%xmm4,%%xmm2\n\t" "paddd %%xmm2,%%xmm0\n\t" "movdqu %%xmm0,%0\n\t" "emms" : "=g" (sum) : "r" (&input_image.r[(j+jj)*input_image.width+i]), "r" (&input_tp.r[(jj)*input_tp.width]), "r" (&input_image.g[(j+jj)*input_image.width+i]), "r" (&input_tp.g[(jj)*input_tp.width]), "r" (&input_image.b[(j+jj)*input_image.width+i]), "r" (&input_tp.b[(jj)*input_tp.width])); //SIMD allsum+=sum[0]+sum[2]; if(min > allsum) { min=allsum; mini=i; minj=j; allsum=0; 6: SIMD

11 RGB SIMD RaVioli 6 SIMD RaVioli 6 SIMD A.1,A.2 4 SIMD C++ SIMD SIMD SIMD SIMD SIMD SIMD mmintrin.h SIMD

12 gcc (GNU ) add %xmm0,%xmm1 Intel (Microsoft Macro Assembler) add xmm1,xmm0 7: SIMD SIMD C++ C++ asm asm asm CPU C++ C++ SIMD 7 7

13 2: 7 xmm0 xmm1 gcc x86 CPU gcc [6] 4 SIMD SIMD 4.1 SIMD 1. rv image.cpp 2. 3. 4. SIMD 5. 2 4 2

14 8: ( ) 4.2 RaVioli rv image.cpp 8 9 8

15 9: ( ) 8 image->procimagcomp(sad,input_tp); = = = procimgcomp void SAD input tp input tp RV image* rv image.cpp

16 10: ( ) 9 RaVioli 4.3 4.2 10

17 11: ( ) UserProgram User- Program UserProgram 11 11 12

18 int sum=0; /* */ void counttp(rv_doppelimage* image,rv_coord Cstart,RV_Coord Cend){ image->procimgcomp(sad,input_tp); if(min > sum) { min=sum; tmps=cstart; tmpe=cend; sum=0; void SAD(RV_Pixel* p1,rv_pixel* p2){ int r1,g1,b1,r2,g2,b2; p1->getrgb(r1,g1,b1); p2->getrgb(r2,g2,b2); sum+=abs(r1-r2)+abs(g1-g2)+abs(b1-b2); 12: 11 sum sum int void int return( );

19 13: ( ) int RV_Image::procImgComp(void (* UserProgram) (RV_Pixel*, RV_Pixel*),RV_Image* cmpimg){ /* */ return(sum); 4.4 SIMD 4.3 SIMD RaVioli SIMD RaVioli 13 SIMD SIMD SIMD RGB out = 0.299 r + 0.587 g + 0.114 b; (1) SIMD SIMD 1

20 for(ny=0;ny<bheight;ny+=grain){ for(nx=0;nx<bwidth;nx+=grain){ byte r1,g1,b1,r2,g2,b2; p1 = _getpixel(nx,ny); p2 = cmpimg->_getpixel(nx,ny); p1->getrgb(r1,g1,b1); p2->getrgb(r2,g2,b2); asm volatile ( /* SIMD */ : "=g" (sum) : "r" (&r1), "r" (&r2), "r" (&g1), "r" (&g2), "r" (&b1), "r" (&b2)); 14: (1) r,g,b 8bit byte 0.299 r 0.299 float (32bit) r byte (8bit) float (32bit) float (1) 0.587 g 0.114 b float 32bit SIMD SIMD 14 SIMD 14 SIMD 8bit byte 128bit SIMD 16 SIMD 16 16 SIMD

21 for(ny=0;ny<bheight;ny+=4*grain){ for(nx=0;nx<bwidth;nx+=4*grain){ byte r1[16],g1[16],b1[16],r2[16],g2[16],b2[16]; for(int i=0;i<16;i++){ p1 = _getpixel(nx+(i%4),ny+(i/4)); p2 = cmpimg->_getpixel((nx+(i%4)),((ny+i/4))); p1->getrgb(r1[i],g1[i],b1[i]); p2->getrgb(r2[i],g2[i],b2[i]); asm volatile ( /* SIMD */ : "=g" (sum) : "r" (&r1), "r" (&r2), "r" (&g1), "r" (&g2), "r" (&b1), "r" (&b2)); 15: 14 SIMD 16 16 14 16 4 14 15 5 SIMD RaVioli SIMD 3 16 1

22 CPU Opteron 2.0GHz 2GB GNU C++ version 4.1.2 3: 16: 16 22% 9% SIMD 16 8 16 8 2 34% SIMD 8 8 getrgb getr RV Pixel

23 (s) 0.028 7.378 0.024 (s) 0.083 3.737 0.128 (s) 0.111 11.115 0.152 4: SIMD SIMD C++ SIMD 4 SIMD 6 RaVioli SIMD RaVioli SIMD RaVioli SIMD RaVioli if SIMD

24 SIMD 2 [1] Köthe, U.: VIGRA - Vision with Generic Algorithms, 1.6.0 edition (2008). [2] Bradski, G. and Kaehler, A.: Learning OpenCV: Computer Vision With the Opencv Library, Oreilly & Associates Inc (2008). [3],,, : RaVioli, CVIM, Vol. 1, No. 4 (2009). [4] Corp., I.: IA-32, http://www.intel.co.jp/jp/download/index.htm. [5] : IA-32 SIMD, http://www.icnet.ne.jp/ nsystem/simd tobira/index. html. [6] SAITOH, A.: GCC for x86, http://www.mars.sannet.ne.jp/sci10/on gcc asm.html.

A.1 void SAD(RV_Pixel* p1,rv_pixel* p2){ byte r1,g1,b1,r2,g2,b2; p1->getrgb(r1,g1,b1); p2->getrgb(r2,g2,b2); sum+=abs(r1-r2)+abs(g1-g2)+abs(b1-b2); RaVioli void RV_Image::procImgComp(void (* UserProgram) (RV_Pixel*, RV_Pixel*),RV_Image* cmpimg){ int nx,ny; _InputCheck(); cmpgrain=cmpimg->getgrain(); for(ny=0;ny<bheight;ny+=grain){ for(nx=0;nx<bwidth;nx+=grain){ UserProgram(_getPixel(nx,ny), cmpimg->_getpixel(nx,ny)); A.2 int RV_Image::SIMD_procImgComp(RV_Image* cmpimg){ int nx,ny; int sum[4]; int allsum; int i;

byte r1[16],g1[16],b1[16],r2[16],g2[16],b2[16]; RV_Pixel* p1; RV_Pixel* p2; _InputCheck(); asm volatile ("pslldq \$255,%xmm3");//0 for(ny=0;ny<bheight;ny+=4*grain){ for(nx=0;nx<bwidth;nx+=4*grain){ for(i=0;i<16;i++){ p1 = _getpixel(nx+(i%4),ny+(i/4)); p2 = cmpimg->_getpixel((nx+(i%4)),((ny+i/4))); p1->getrgb(r1[i],g1[i],b1[i]); p2->getrgb(r2[i],g2[i],b2[i]); asm volatile ( "movdqu (%1),%%xmm0\n\t" "movdqu (%2),%%xmm1\n\t" "psadbw %%xmm1,%%xmm0\n\t" "movdqu (%3),%%xmm1\n\t" "movdqu (%4),%%xmm2\n\t" "psadbw %%xmm2,%%xmm1\n\t" "paddw %%xmm1,%%xmm0\n\t" "movdqu (%5),%%xmm1\n\t" "movdqu (%6),%%xmm2\n\t" "psadbw %%xmm2,%%xmm1\n\t" "paddw %%xmm1,%%xmm0\n\t" "paddd %%xmm0,%%xmm3" : "=g" (sum) : "r" (&r1), "r" (&r2),

"r" (&g1), "r" (&g2), "r" (&b1), "r" (&b2)); asm volatile ( "movdqu %%xmm3,%0\n\t" "emms" : "=g" (sum)); allsum=sum[0]+sum[2]; return(allsum);