ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

Similar documents
ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

Source: Intel.Config: Pentium III Processor-Intel Seattle SE440BX-2, 128MB PC100 CL2 SDRAM Intel 440BX-2 Chipset Platform- Diamond Viper 550 /

mate10„”„õŒì4

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

untitled

H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

Express5800/110Ee Pentium 1. Express5800/110Ee N N Express5800/110Ee Express5800/110Ee ( /800EB(256)) ( /800EB(256) 20W) CPU L1 L2 CD-

連載講座 : 高生産並列言語を使いこなす (3) ゲーム木探索問題 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 概要 17 2 ゲーム木探索 必勝 必敗 引き分け 盤面の評価値 αβ 法 指し手の順序付け (mo

橡Webcamユーザーガイド03.PDF

はじめに

Complex Lab – Operating Systems - Graphical Console

AtCoder Regular Contest 073 Editorial Kohei Morita(yosupo) A: Shiritori if python3 a, b, c = input().split() if a[len(a)-1] == b[0] and b[len(

インテル(R) Visual Fortran Composer XE

1 # include < stdio.h> 2 # include < string.h> 3 4 int main (){ 5 char str [222]; 6 scanf ("%s", str ); 7 int n= strlen ( str ); 8 for ( int i=n -2; i

r07.dvi

ohp07.dvi

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

3 SIMPLE ver 3.2: SIMPLE (SIxteen-bit MicroProcessor for Laboratory Experiment) 1 16 SIMPLE SIMPLE 2 SIMPLE 2.1 SIMPLE (main memo

main.dvi

Oracle Change Management Pack, Oracle Diagnostics Pack, Oracle Tuning Packインストレーション・ガイド リリース2.2


price, style. Office. VAJ/DG5TFTSXGA+ Pentium III VA0J/DX.TFTXGA Pentium III VAJ/DF5TFTXGA Pentium III VA0H/DF5TFTXGA Celeron VA0J/DF5TFTXGA Pentium I

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [

スパコンに通じる並列プログラミングの基礎

Getting Started Creative Sound Blaster Live! 5.1 Creative Sound Blaster Live! 5.1 Digital Audio Creative Technology Ltd. Creative Technology Ltd. 1 Co

スパコンに通じる並列プログラミングの基礎

Express5800/120Lf 1. Express5800/120Lf N N N Express5800/120Lf Express5800/120Lf Express5800/120Lf ( /1BG(256)) ( /1BG(256)) (

,,,,., C Java,,.,,.,., ,,.,, i

AxC_lj.fm


indd

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

hotspot の特定と最適化

HP Compaq Business Desktop dx7300シリーズ

Express5800/120Ed

SystemC言語概論

(Basic Theory of Information Processing) 1

スパコンに通じる並列プログラミングの基礎

Microsoft PowerPoint - NxLecture ppt [互換モード]

GPGPU

EPSON ES-D200 パソコンでのスキャンガイド

SonicStage Ver. 2.0

3 Powered by mod_perl, Apache & MySQL use Item; my $item = Item->new( id => 1, name => ' ', price => 1200,


Microsoft PowerPoint - iaca.ppt

Excel97関数編

untitled

平成29年度英語力調査結果(中学3年生)の概要

HP Compaq Business Desktop dc7700シリーズ

1 2 3

1 Code Generation Part I Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

m

2 ( ) i

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

HA8000シリーズ ユーザーズガイド ~BIOS編~ HA8000/RS110/TS10 2013年6月~モデル

Java Java Java Java Java 4 p * *** ***** *** * Unix p a,b,c,d 100,200,250,500 a*b = a*b+c = a*b+c*d = (a+b)*(c+d) = 225

untitled

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

I I / 47

FFTSS Library Version 3.0 User's Guide

アセンブラ入門(CASL II) 第3版

25 II :30 16:00 (1),. Do not open this problem booklet until the start of the examination is announced. (2) 3.. Answer the following 3 proble

HP COMPAQ BUSINESS DESKTOP DC7800シリーズ

XcalableMP入門

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

NotePC 8 10cd=m 2 965cd=m Note-PC Weber L,M,S { i {

Transcription:

SIMD 2(SSE2) / 2.0 2000 7 : 248602J-001 01/10/30 1

305-8603 115 Fax: 0120-47-8832 * Copyright Intel Corporation 1999-2001 01/10/30 2

1...5 2...5 2.1...5 2.1.1...5 2.1.2...8 3...9 3.1...9 3.2...9 4...9 5 C/C++...10 6 SSE2 DVEC...11 7 SSE2...15 A -... A-1... A-3 01/10/30 3

2.0 Pentium 4 2000 7 1.0 1999 9 1. Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Introduction to Algorithms The MIT Press Cambridge Massachusetts 1991 2. SIMD (Streaming SIMD Extensions) AP-805 243639J-002 01/10/30 4

1 SIMD 2(SSE2 Streaming SIMD Extensions 2) SIMD(Single Instruction Multiple Data) SIMD IA-32 SIMD SIMD (SSE) SIMD 128 SIMD 64 SIMD 3D (3D) / SSE2 / 2 1 ( ) SSE2 2 2.1 C 1 C if(maxdouble < the_array[i]) maxdouble = the_array[i]; maxindex = i; C N-1 2.1.1 C 1 (if ) 01/10/30 5

1 ( N-1 ) N 2N-2 2 2 1 SSE2 2 2 SSE2 8 128 64 2 SSE2 SIMD 1. 2 movapd 2. 2 128 ( 2 ) 2 3. 2 128 ( 2 ) cmpeqpd 4. 128 32 movmskpd 5. 128 OR orpd 6. 7 14 ( xmm0 ) maxloop: movapd xmm1,[edi - 16] movapd xmm2,[edi - 32] movapd xmm3,[edi - 48] movapd xmm4,[edi - 64] movapd xmm5,[edi - 80] movapd xmm6,[edi - 96] movapd xmm7,[edi - 112] sub edi,112 xmm0,xmm1 xmm2,xmm3 01/10/30 6

sub ecx,14 cmp ecx,14 xmm4,xmm5 xmm6,xmm7 xmm0,xmm2 xmm4,xmm6 xmm0,xmm4 AP-937 SSE2 jge maxloop 1 (C for ) indexloop: cmp ecx,12 jle indexlast movapd xmm0,[edi] movapd xmm1,[edi + 16] movapd xmm2,[edi + 32] movapd xmm3,[edi + 48] movapd xmm4,[edi + 64] movapd xmm6,[edi + 80] add edi,96 add edx,12 sub ecx,12 cmpeqpd xmm0,xmm5 cmpeqpd xmm1,xmm5 cmpeqpd xmm2,xmm5 cmpeqpd xmm3,xmm5 cmpeqpd xmm4,xmm5 cmpeqpd xmm6,xmm5 // OR our registers to see if the maximum value was found orpd xmm0,xmm1 orpd xmm2,xmm3 orpd xmm4,xmm6 orpd xmm0,xmm2 orpd xmm0,xmm4 // Move the result of the OR into eax movmskpd eax,xmm0 cmp eax,0 jz indexloop 01/10/30 7

1 6 1 (orpd ) movmskpd orpd ( movmskpd ) orpd movmskpd 2 SSE2 2 SSE2 2.1.2 1. 16 8 16 2. indexdone 3. minpd SSE2 C++ (DVEC) simd_max simd_min 4. 2 5. 3 2 01/10/30 8

3 3.1 SSE2 C 2 1. SSE2 2 SSE2 x87 2. C (if ) 1 SSE2 3.2 SSE2 ( ) (10,000 ) 1. ( ) 1 2. AP-805 4 SSE2 SSE2 2 01/10/30 9

5 C/C++ double max_c(double *the_array, int array_size, int *index) int maxindex = 0; // Initialize maxdouble with the value of the first item in the vector double maxdouble = the_array[0]; for(int i=1; i<array_size; i++) // Compare maxdouble with remaining vector elements // Keep track of the maximum value and its index if(maxdouble < the_array[i]) maxdouble = the_array[i]; maxindex = i; *index = maxindex; return(maxdouble); 01/10/30 10

6 SSE2 DVEC double maxw_dvec_unrolled(double *the_array, int array_size, int *index) // Assume 8 or 16 byte alignment assert(((unsigned int)&the_array[0] & (0x07)) == 0); // Use C code if array size is small if (array_size<=18) int maxindex = 0; double maxdouble = the_array[0]; for(int i=1; i<array_size; i++) if(maxdouble < the_array[i]) maxdouble = the_array[i]; maxindex = i; *index = maxindex; return(maxdouble); F64vec2 r1(0.0,0.0), r2(0.0,0.0), r3(0.0,0.0), r4(0.0,0.0), r5(0.0,0.0), r6(0.0,0.0), r7(0.0,0.0), r8(0.0,0.0); double max = 0.0; int i=0; int j, mask; F64vec2 *aligned_front_of_array; F64vec2 *aligned_end_of_array; *index = 0; int front_alignment,back_alignment; // Calculate alignments and compensate if 8 byte aligned if((((unsigned int)&the_array[0]) & (0x0F)) == 0) front_alignment = 1; else front_alignment = 0; if((((unsigned int)&the_array[array_size - 1]) & (0x0F)) == 0) back_alignment = 1; else back_alignment = 0; if(!back_alignment) aligned_end_of_array = (F64vec2 *)&the_array[array_size - 2]; else aligned_end_of_array = (F64vec2 *)&the_array[array_size - 1]; 01/10/30 11

if(!front_alignment) aligned_front_of_array = (F64vec2 *)&the_array[1]; else aligned_front_of_array = (F64vec2 *)&the_array[0]; r1 = _mm_loadu_pd(&the_array[array_size - 2]); r2 = _mm_loadu_pd(&the_array[0]); r1 = simd_max(r1,r2); Loop through the vector and find the maximum value j = array_size/16 * 8; for(i=1; i<j; i+=8) r1 = simd_max(r1,*(aligned_end_of_array - i)); r2 = simd_max(r2,*(aligned_end_of_array - (i+1))); r3 = simd_max(r3,*(aligned_end_of_array - (i+2))); r4 = simd_max(r4,*(aligned_end_of_array - (i+3))); r5 = simd_max(r5,*(aligned_end_of_array - (i+4))); r6 = simd_max(r6,*(aligned_end_of_array - (i+5))); r7 = simd_max(r7,*(aligned_end_of_array - (i+6))); r8 = simd_max(r8,*(aligned_end_of_array - (i+7))); r1 = simd_max(r1,*(aligned_front_of_array)); r2 = simd_max(r2,*(aligned_front_of_array+1)); r3 = simd_max(r3,*(aligned_front_of_array+2)); r4 = simd_max(r4,*(aligned_front_of_array+3)); r5 = simd_max(r5,*(aligned_front_of_array+4)); r6 = simd_max(r6,*(aligned_front_of_array+5)); r7 = simd_max(r7,*(aligned_front_of_array+6)); r8 = simd_max(r8,*(aligned_front_of_array+7)); r1 = simd_max(r1,r2); r3 = simd_max(r3,r4); r5 = simd_max(r5,r6); r7 = simd_max(r7,r8); r1 = simd_max(r1,r3); r5 = simd_max(r5,r7); r1 = simd_max(r1,r5); // Create a mask of maximum values in r5 r5 = unpack_low(r1,r1); r1 = unpack_high(r1,r1); r5 = simd_max(r5,r1); _mm_store_sd(&max,r5); // Store the max value // Calculate the index now (starting from the front of array in cache) if(!front_alignment) 01/10/30 12

r1 = _mm_loadu_pd(&the_array[0]); r1 = cmpeq(r1,r5); mask = move_mask(r1); // If we are lucky, the max is in the front of the array if(mask) if(mask == 3) mask = 1; *index = mask-1; return(max); *index = 1; // Last two doubles to look at r1 = _mm_loadu_pd(&the_array[array_size - 2]); r1 = cmpeq(r1,r5); mask = move_mask(r1); if(mask) if(mask == 2) *index = array_size - 1; else *index = array_size - 2; return(max); i = 0; // Go through array from the front and look for index while(!mask) r1 = cmpeq(*(aligned_front_of_array+i),r5); i++; r2 = cmpeq(*(aligned_front_of_array+i),r5); i++; r3 = cmpeq(*(aligned_front_of_array+i),r5); i++; r4 = cmpeq(*(aligned_front_of_array+i),r5); i++; r6 = cmpeq(*(aligned_front_of_array+i),r5); i++; r7 = cmpeq(*(aligned_front_of_array+i),r5); i++; r1 = _mm_or_pd(r1,r2); r3 = _mm_or_pd(r3,r4); r6 = _mm_or_pd(r6,r7); r1 = _mm_or_pd(r1,r3); r1 = _mm_or_pd(r1,r6); mask = move_mask(r1); if((i*2+12) >= array_size) break; 01/10/30 13

i -= 6; mask = 0; while(!mask) r1 = cmpeq(*(aligned_front_of_array+i),r5); mask = move_mask(r1); i++; i--; if(mask == 3) mask = 1; *index += 2*i + (mask-1); return(max); 01/10/30 14

7 SSE2 double maxw_asm(double *the_array, int array_size, int *index) double maximum; int indexvalue = 0; double *end_of_array,*aligned_end_of_array,*aligned_front_of_array; // Assume 8 or 16 byte alignment assert(((unsigned int)&the_array[0] & (0x07)) == 0); // Array size must be at least 18 elements or we use the C code if (array_size<=18) int maxindex = 0; float maxfloat = the_array[0]; for(int i=1; i<array_size; i++) if(maxfloat < the_array[i]) maxfloat = the_array[i]; maxindex = i; *index = maxindex; return(maxfloat); end_of_array = &the_array[array_size - 1]; int front_alignment,back_alignment; if((((unsigned int)&the_array[0]) & (0x0F)) == 0) front_alignment = 1; else front_alignment = 0; if((((unsigned int)&the_array[array_size - 1]) & (0x0F)) == 0) back_alignment = 1; else back_alignment = 0; if(!back_alignment) aligned_end_of_array = &the_array[array_size - 2]; else aligned_end_of_array = &the_array[array_size - 1]; 01/10/30 15

if(!front_alignment) aligned_front_of_array = &the_array[1]; indexvalue = 1; else aligned_front_of_array = &the_array[0]; asm mov mov mov mov mov esi,the_array ecx,array_size edx,ecx edi,aligned_end_of_array eax,end_of_array movupd xmm0,[eax - 8] movupd xmm1,[esi] xmm0,xmm1 sub ecx,1 // Loop where we find the maximum maxloop: movapd xmm1,[edi - 16] movapd xmm2,[edi - 32] movapd xmm3,[edi - 48] movapd xmm4,[edi - 64] movapd xmm5,[edi - 80] movapd xmm6,[edi - 96] movapd xmm7,[edi - 112] sub edi,112 xmm0,xmm1 xmm2,xmm3 xmm4,xmm5 xmm6,xmm7 xmm0,xmm2 xmm4,xmm6 xmm0,xmm4 sub cmp jge ecx,14 ecx,14 maxloop 01/10/30 16

mov edi,aligned_front_of_array maxdone: movapd xmm2,[edi] xmm0,xmm2 add sub cmp jg edi,16 ecx,2 ecx,0 maxdone mov edi,aligned_front_of_array shufpd xmm5,xmm0,3 xmm5,xmm0 shufpd xmm5,xmm5,1 xmm5,xmm0 // Created mask of maximum values in xmm5 movsd maximum,xmm5 // Stored maximum value sub edx,2 movupd xmm0,[eax - 8] cmpeqpd xmm0,xmm5 movmskpd eax,xmm0 cmp eax,0 jne indexdone xor edx,edx movupd xmm0,[esi] cmpeqpd xmm0,xmm5 movmskpd eax,xmm0 cmp eax,0 jne indexdone mov ecx,array_size // Loop where we find the index indexloop: cmp ecx,12 jle indexlast movapd xmm0,[edi] movapd xmm1,[edi + 16] 01/10/30 17

movapd xmm2,[edi + 32] movapd xmm3,[edi + 48] movapd xmm4,[edi + 64] movapd xmm6,[edi + 80] add edi,96 add edx,12 sub ecx,12 cmpeqpd xmm0,xmm5 cmpeqpd xmm1,xmm5 cmpeqpd xmm2,xmm5 cmpeqpd xmm3,xmm5 cmpeqpd xmm4,xmm5 cmpeqpd xmm6,xmm5 orpd xmm0,xmm1 orpd xmm2,xmm3 orpd xmm4,xmm6 orpd xmm0,xmm2 orpd xmm0,xmm4 movmskpd eax,xmm0 cmp eax,0 jz indexloop sub sub edx,12 edi,96 indexlast: movapd xmm0,[edi] cmpeqpd xmm0,xmm5 movmskpd eax,xmm0 add edi,16 add edx,2 cmp eax,0 jz indexlast sub add edx,2 edx,indexvalue indexdone: cmp eax,2 01/10/30 18

jne add end edx,1 end: mov indexvalue,edx *index = indexvalue; return(maximum); 01/10/30 19

A - 1: ( ) Pentium III (733 MHz) C * 7.74 4.94 SSE2 ASM * 1.09 SSE2 ASM * SSE2 ASM * SSE2 DVEC * 1.14 SSE2 DVEC * 1.86 SSE2 * 2.48 Pentium 4 (1.2 GHz) * ( ) 1.74 2.22 01/10/30 A-1

2: 1 Pentium 4 (SSE2 ASM vs.c ) 2.84 Pentium 4 (SSE2 DVEC vs.c ) 2.66 Pentium 4 C vs. Pentium III 1.57 C 1 2 1.2 GHz Pentium 4 733 MHz Pentium III Pentium 4 Pentium III 1 1000 C C C/C++ (ASM) (DVEC) SSE2 SIMD 2 SSE2 2.84 ( ) SSE2 01/10/30 A-2

3: Pentium III Pentium III (733 MHz) Desktop Board VC820 BIOS VC82010A.86A.0028.P10 2 256 KB 128 MB RDRAM PC800-45 Ultra ATA 6.00.012 IBM DJNA-371800 ATA-66 / Creative Labs 3D Blaster Annihilator Pro AGP nvidia GeForce256 DDR 32MB NVidia Reference Driver 5.22 Windows 2000 2195 4: Pentium 4 Pentium 4 (1.2 GHz) Desktop Board D850GB BIOS GB85010A.86A.0014.D.0007201756 2 256 KB 128 MB RDRAM PC800-45 Ultra ATA 6.00.012 IBM DJNA-371800 ATA-66 / Creative Labs 3D Blaster Annihilator Pro AGP nvidia GeForce256 DDR 32MB NVidia Reference Driver 5.22 Windows 2000 2195 01/10/30 A-3