SAD 23 (410M520)
(SAD) x86 MPSADBW H.264/AVC H.264/AVC SAD SAD x86 SAD MPSADBW SAD 3x3 3 9 SAD SAD SAD x86 MPSADBW SAD 9 SAD SAD 4.6
Abstract In recent years, the high definition of video image has made progress. The encoding for compressing an increasing number of data volumes of video image by this high definition progresses the sophistication of method and is greatly increasing the throughput. Because the motion estimation processing occupies most of the encode processing, the speeding up was being studied since early times. But the SAD operation instruction which embedded on the general purpose processor stopped advance since the MPSADBW instruction of x86 processor and is an obstacle the speeding up of software processing to don t correspond H.264/AVC encoding. Therefore, in this paper, I speed up the motion estimation by the realization of the highly parallel SAD operation instruction that is able to correspond with any the variable block sizes of H.264/AVC. The motion estimation does the block matching between the current picture and the reference picture, and calculates the SAD for this block matching. X86 processor has the MPSADBW instruction as the instruction of multiple SAD operations. But this instruction is limited to the horizontal SAD operations and have disadvantages that can t efficiently execute the motion estimation of tracking type that is the basic motion estimation of software processing at the moment. It can parallelize only three points at a time even if it used the estimation of tracking type which uses the square pattern of 3x3 which is using for high degree of data reuse in this laboratory. Hence, in order to solve this problem, in this paper, I proposed the highly parallel SAD operation instruction set that is able to parallelize the SAD operations of nine points for the square pattern at a time, and evaluated its effectivity. In addition, I designed the circuit structure which executes this proposed instruction set. It is able to speed up the motion estimation by using this instruction set. I evaluated the number of cycles that required to the SAD operations of nine points and the rate of speeding up between the MPSADBW instruction of x86 processor and the proposed highly parallel SAD operation instruction set. As a result, the performance of processing speed improved about 4.6 times and it was sped up the motion estimation.
1 1 1.1............................ 1 1.2............................ 1 2 SAD 2 2.1............................ 2 2.2........................ 2 2.3....................... 4 2.4........ 5 2.5 MPSADBW.................. 7 2.6 SAD............ 8 3 9 3.1.................... 9 3.1.1 SAD................. 10 3.1.2........................ 11 3.1.3........................ 11 3.1.4................... 11 3.1.5 SAD........ 12 3.2............................ 21 3.2.1 SAD................ 21 3.2.2............ 24 4 25 5 31 31 32 A (mjpegtools ) 33 A.1 mjpegtools.................. 33 A.2............................ 33 i
B x264 34 B.1......................... 34 B.2.............................. 35 C yasm 35 ii
2.1......... 3 2.2 SAD.......................... 4 2.3..................... 5 2.4 4x4.............. 6 2.5 MPSADBW SAD.............. 7 2.6 MPSADBW 8 SAD........ 8 3.7......................... 13 3.8 16x16 SAD................... 14 3.9 16x8 SAD................... 15 3.10 8x16 SAD................... 17 3.11 8x8 SAD.................... 17 3.12 8x4 SAD.................... 18 3.13 16 SAD.............................. 19 3.14 8 SAD 20 3.15 SAD.................. 22 3.16.......... 23 3.17............... 25 4.18 SAD............... 26 4.19................ 27 4.20 HD SAD................ 28 4.21 4Kx2K SAD.............. 29 4.22 UHD SAD............... 29 iii
2.1 MPSADBW........... 9 4.2 9 SAD............................. 26 4.3 SAD.................... 30 4.4..................... 30 iv
1 1.1 (High Definition : HD) 16 (Ultra High Definition : UHD) 2020 H.264/AVC [1,2] 7 1.2 (SAD) SAD SAD 1 () 3x3 (SAD ) x86 SSE4(Streaming SIMD Extensions 4) SIMD(Single Instruction stream Multiple Data stream) 1 SAD MPSADBW(Multiple Packed Sums of Absolute Difference Byte Word) [3 5] MPSADBW SAD MPSADBW SAD 3x3 SAD 3x3 SAD 3x3 SAD SAD 1
2 SAD 2.1 2 2 (SAD) SAD SAD 1 2.2 1 3x3 ( ) 2.1 2.1 4x4 9 9 8 3x3 () 2
1 2 3 4x4 4 5 6 7 8 9 1 2.1: EPZS EPZS 3
2.3 2 2 (Sum of Absolute Differences : SAD) C R MxN C R SAD 1 SAD(C, R) = N 1 y=0 M 1 x=0 C xy R xy (1) M=4N=4 4x4 SAD 2.2 C 123 47 39 84 5 18 2 8 124 103 49 54 38 45 86 71 C - R 9 20 3 12 C - R 15 7 2 15 163 126 47 35 76 6 31 1 9 SAD R 128 65 41 76 133 69 41 74 88 61 47 56 132 78 36 67 2.2: SAD SAD SAD 2 4
SAD 2.4 H.264/AVC 7 16x1616x88x168x88x44x84x4 7 2.3 16x16 16x8 8x16 8x8 8x4 4x8 4x4 2.3: 16x1616x88x168x88x44x8 6 7 4x4 4x4 2.4 7 1 SAD 16x1616x8 16 8x168x88x4 8 SAD 4x84x4 4 4x4 4x4 SAD 5
16 16 16 16 8 16x16 16x8 8 8 8 8 16 4 8 8x16 8x8 8x4 4 4 4 8 4 4x8 4x4 2.4: 4x4 4x4 SAD SAD 16x16 4x4 SAD 16 16 SAD 16x16 SAD 6
4x4 H.264/AVC SAD 2.5 MPSADBW MPSADBW x86 SIMD SSE4 1 MPSADBW 2 4 SAD 8 1 SAD 4 8 8 1 2 SAD 4 1 8 SAD SAD 1 MPSADBW SAD SAD MPSADBW SAD 2.5 MPSADBW 8 SAD 2.6 23 5 73 56 8 13 5 72 28 35 43 16 18 34 27 95 15 1 69 43 5 13 6 87 64 7 45 6 54 38 20 86 MPSADBW SAD 184 142 86 125 96 87 131 56 H G F E D C B A 2.5: MPSADBW SAD 7
54 38 20 86 SAD [A] SAD [B] 18 34 27 95 16 18 34 27 SAD [C] SAD [D] SAD [E] 43 16 18 35 43 16 18 28 35 43 16 34 SAD [F] SAD [G] 72 28 35 5 72 28 35 43 SAD [H] 13 5 72 28 2.6: MPSADBW 8 SAD MPSADBW 8 SAD SAD 3x3 SAD MPSADBW SAD 2.6 SAD MPSADBW 2.1 3 MPSADBW (3x3) MPSADBW (3x3) MPSADBW (5x3) 8
5x 3 MPSADBW 1 2.1: MPSADBW [WxH] [3x3] [3x3] [5x3] MPSADBW [] 1 1.08 1.18 MPSADBW 1.18 9 9 SAD 9 SAD SAD 3 3.1 9 SAD SAD SAD (Highly Parallel Multiple Packed Sums of Absolute Difference Byte Word : HPMPSADBW) : 3x3 9 SAD 9
(Move Input : MovIn) : SAD (Move Output : MovOut) : SAD (Add and Compare : AddCom) : SAD SAD SAD SAD 3.1.1 SAD SAD (Highly Parallel Multiple Packed Sums of Absolute Difference Byte Word : HPMPSADBW) 3x3 9 SAD 4x1 SAD 36 36 4x1 SAD 16 8 16 SAD 1 8 SAD 2 SAD hpmpsadbw 3 1 (r1) 2 (r2) 3 0 1 (i) (i) 0 16 SAD 1 8 SAD hpmpsadbw r1r2i hpmpsadbw addcom hpmpsadbw addcom 10
3.1.2 (Move Input : MovIn) SAD SAD 1 3 SAD 1 SAD 2 SAD 1 9 SAD movin 1 1 SAD (r1) movin r1 3.1.3 (Move Output : MovOut) SAD SAD 0 movout 1 1 (r1) 0 movout r1 3.1.4 (Add and Compare : AddCom) SAD 2 2 SAD 11
SAD 2 addcom 4 1 (r1) 23 (r2r3) 4 SAD (r4) addcom r1r2r3r4 addcom hpmpsadbw addcom hpmpsadbw 3.1.5 SAD 9 SAD SAD SAD SAD hpmpsadbw addcom 256bit R0R36 576bit T0T5 2 32bit 8 r0r295 144bit 4 t0t19 3.7 R0R15 R16R33 SAD SAD SAD 12
16x16 SAD 3.8 16x8 SAD 3.9 8x16 SAD 3.10 8x8 SAD 3.11 8x4 SAD 3.12 256bit 32bit 576bit 144bit R0 r7 r6 r5 r4 r3 r2 r1 r0 T0 t3 t2 t1 t0 R1 T1 Rn Tn 3.7: 1 movin R16 2 movin R17 3 hpmpsadbw R0, R18, 0 4 hpmpsadbw R1, R19, 0 5 hpmpsadbw R2, R20, 0 6 hpmpsadbw R3, R21, 0 7 movout T0 8 hpmpsadbw R4, R22, 0 # 8, 9 9 addcom t4, t3, t2, r272 10 hpmpsadbw R5, R23, 0 # 10, 11 11 addcom t5, t1, t0, r272 13
12 hpmpsadbw R6, R24, 0 13 hpmpsadbw R7, R25, 0 14 movout T0 15 hpmpsadbw R8, R26, 0 # 15, 16 16 addcom t6, t3, t2, r272 17 hpmpsadbw R9, R27, 0 # 17, 18 18 addcom t7, t1, t0, r272 19 hpmpsadbw R10, R28, 0 # 19, 20 20 addcom t4, t4, t6, r272 21 hpmpsadbw R11, R29, 0 # 21, 22 22 addcom t5, t5, t7, r272 23 movout T0 24 hpmpsadbw R12, R30, 0 # 24, 25 25 addcom t6, t3, t2, r272 26 hpmpsadbw R13, R31, 0 # 26, 27 27 addcom t7, t1, t0, r272 28 hpmpsadbw R14, R32, 0 # 28, 29 29 addcom t8, t4, t5, r272 30 hpmpsadbw R15, R33, 0 31 movout T0 32 addcom t3, t3, t2, r273 33 addcom t1, t1, t0, r273 34 addcom t6, t6, t3, r273 35 addcom t7, t7, t1, r273 36 addcom t9, t6, t7, r273 37 addcom t10, t4, t6, r274 38 addcom t11, t5, t7, r274 39 addcom t12, t8, t9, r274 40 addcom t13, t12, t12, r275 41 addcom t14, t10, t11, r276 3.8: 16x16 SAD 14
1 movin R16 2 movin R17 3 hpmpsadbw R0, R18, 0 4 hpmpsadbw R1, R19, 0 5 hpmpsadbw R2, R20, 0 6 hpmpsadbw R3, R21, 0 7 movout T0 8 hpmpsadbw R4, R22, 0 # 8, 9 9 addcom t8, t3, t2, r272 10 hpmpsadbw R5, R23, 0 # 10, 11 11 addcom t9, t1, t0, r273 12 hpmpsadbw R6, R24, 0 13 hpmpsadbw R7, R25, 0 14 movout T1 15 addcom t10, t7, t6, r274 16 addcom t11, t5, t4, r275 17 addcom t12, t8, t10, r276 18 addcom t13, t9, t11, r277 19 addcom t14, t12, t13, r278 20 addcom t15, t3, t7, r279 21 addcom t16, t2, t6, r279 22 addcom t17, t1, t5, r279 23 addcom t18, t0, t4, r279 24 addcom t19, t14, t14, r279 25 addcom t19, t15, t16, r280 26 addcom t19, t17, t18, r281 3.9: 16x8 SAD 15
1 movin R16 2 movin R17 3 hpmpsadbw R0, R18, 1 4 hpmpsadbw R1, R20, 1 5 movout T0 6 hpmpsadbw R4, R22, 1 # 6, 7 7 addcom t4, t3, t1, r272 8 hpmpsadbw R5, R24, 1 # 8, 9 9 addcom t5, t2, t0, r273 10 movout T0 11 hpmpsadbw R8, R26, 1 # 11, 12 12 addcom t6, t3, t1, r274 13 hpmpsadbw R9, R28, 1 # 13, 14 14 addcom t7, t2, t0, r275 15 movout T0 16 hpmpsadbw R12, R30, 1 # 16, 17 17 addcom t8, t3, t1, r276 18 hpmpsadbw R13, R32, 1 # 18, 19 19 addcom t9, t2, t0, r277 20 movout T0 21 addcom t3, t3, t1, r278 22 addcom t2, t2, t0, r279 23 addcom t0, t4, t5, r280 24 addcom t1, t6, t7, r281 25 addcom t10, t8, t9, r282 26 addcom t11, t3, t2, r283 27 addcom t12, t4, t6, r284 28 addcom t13, t5, t7, r284 29 addcom t14, t8, t3, r284 30 addcom t15, t9, t2, r284 31 addcom t16, t0, t1, r284 32 addcom t17, t10, t11, r285 33 addcom t18, t16, t17, r286 16
34 addcom t19, t18, t18, r288 35 addcom t19, t12, t13, r289 36 addcom t19, t14, t15, r290 3.10: 8x16 SAD 1 movin R16 2 movin R17 3 hpmpsadbw R0, R18, 1 4 hpmpsadbw R1, R20, 1 5 movout T0 6 hpmpsadbw R4, R22, 1 # 6, 7 7 addcom t4, t3, t1, r272 8 hpmpsadbw R5, R24, 1 # 8, 9 9 addcom t5, t2, t0, r273 10 movout T0 11 addcom t3, t3, t1, r274 12 addcom t2, t2, t0, r275 13 addcom t0, t4, t5, r276 14 addcom t1, t3, t2, r277 15 addcom t6, t4, t3, r278 16 addcom t7, t5, t2, r278 17 addcom t8, t0, t1, r278 18 addcom t9, t8, t8, r279 19 addcom t9, t6, t7, r280 3.11: 8x8 SAD 17
1 movin R16 2 movin R17 3 hpmpsadbw R0, R18, 1 4 hpmpsadbw R1, R20, 1 5 movout T0 6 addcom t3, t3, t1, r272 7 addcom t2, t2, t0, r273 8 addcom t1, t3, t2, r274 9 addcom t0, t1, t1, r275 3.12: 8x4 SAD SAD 16 3.13 8 3.14 16 1 SAD 4x4 SAD SAD 8 2 SAD 4x2 SAD SAD SAD 18
4x4 A B C D 16x16 E F G H I J K L 16 M N O P 16 16x8 8x16 8x16 16x8 8x8 8x8 8x8 8x8 8x4 8x4 8x4 8x4 8x4 8x4 8x4 8x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 A B C D E F G H I J K L M N O P 4x8 4x8 4x8 4x8 4x8 4x8 4x8 4x8 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 A B C D E F G H I J K L M N O P 3.13: 16 SAD 19
4x2 4x1 8x16 A a0 a1 4x4 a0 c0 a1 c1 e0 g0 e1 g1 i0 k0 i1 k1 m0 o0 m1 o1 8 b0 d0 b1 d1 f0 h0 f1 h1 j0 l0 j1 l1 n0 p0 n1 p1 16 8x8 8x8 8x4 4x8 4x8 8x4 8x4 4x8 4x8 8x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 4x2 A B C D E F G H I J K L M N O P 3.14: 8 SAD 20
3.2 SAD 2 2 3.2.1 SAD SAD 16 SAD 1 8 SAD 2 16 8 1 SAD SIMD 6 4 4 2 6 4x1 SAD SAD 6 4 SAD SAD SAD SAD () 4 16 8 0 16 1 8 2 SAD SAD 3.15 21
SAD 576 80 80 80 80 256 256 3.15: SAD 22
4x1 SAD 4x1 SAD SAD 3.16 16 16 16 4x1 SAD 16 16 8 8 8 8 32 32 3.16: 23
3.2.2 SAD SAD SAD SAD 3.7 576 n 144 SAD SIMD 9 SAD SAD 144 2 16 SAD 2 32 32 SAD 24
32 144 16 16 144 144 576 SAD 3.17: 4 H.264/AVC SAD 16 8 SAD x86 MPSADBW 9 SAD 25
4.2 MPSADBW SAD MPSADBW SAD 4.2: 9 SAD [WxH] 16x16 16x8 8x16 8x8 8x4 MPSADBW 219 123 123 60 27 35 24 32 18 10 9 SAD MPSADBW 4.18 SAD MP- SADBW SAD 1 4.18: SAD MPSADBW 4.19 26
MPSADBW 1 4.19: 4.18 9 SAD MPSADBW 26% 4.19 9 SAD MPSADBW 4.3 9 SAD SAD (hpmpsadbw ) SAD (addcom ) SAD SAD 2.7 H.264/AVC x264 x264 27
x264 SAD CrowdRunDucksTakeOffOldTownCross 3 500 (HD : 1920x1080)4Kx2K(4Kx2K : 3840x2160) (UHD : 7680x4320) 3 4Kx2K MPSADBW 16x1616x88x168x88x4 SAD HD 4.20 4Kx2K 4.21 UHD 4.22 4.20: HD SAD 28
4.21: 4Kx2K SAD 4.22: UHD SAD 29
x264 SAD 4.3 4.4 4.3 4.4 MPSADBW 1 4.3: SAD HD 4Kx2K UHD MPSADBW 1 1 1 1 0.22 0.21 0.22 0.22 4.4: HD 4Kx2K UHD MPSADBW 1 1 1 1 4.52 4.78 4.62 4.64 x264 SAD 78% 4.6 9 SAD SAD 30
5 SIMD 3x3 9 SAD H.264/AVC SAD x86 MPSADBW 4.6 SAD SAD 8 SAD 2 31
[1] ITU-TH.264http://www.itu.int/rec/T-REC-H.264/e2011/6 [2] () H.264/AVC 2006 [3] IntelIntel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture253665-041US2011/12 [4] IntelIntel 64 and IA-32 Architectures Software Developer s Manual Combined Volumes 2A, 2B, and 2C: Instruction Set Reference, A-Z 325383-041US2011/12 [5] Intel 64 IA-32 248966-024JA2011/4 32
A (mjpegtools ) A.1 mjpegtools mjpegtools x264 Y4M mjpegtools (mjpegtools-[ ].tar.gz) http://sourceforge.net/projects/mjpeg/files/mjpegtools/ mjpegtools (x86 [moule ] )./configure --prefix=/home/username/mjpegtools ( ) make install () /home/username/mjpegtools/bin/ ppmtoy4m A.2 cat *.ppm > input.txt (PPM 1 )./ppmtoy4m -o 0 -n 500 -I p -F 25:1 -S 420mpeg2 < input.txt > output.y4m ppmtoy4m ( ) -o : -n : -F : (fps) 33
B x264 B.1 x264 (last_x264.tar.bz2) http://www.videolan.org/developers/x264.html x264./configure --prefix=/home/username/x264 ( ) make (x264 ) make install (x264 ) /home/username/x264/bin x264./configure yasm (--disable-asm)./configure Found yasm 0.7.2.2153 <- Minimum version is yasm-1.0.0 <- If you really want to compile without asm, configure with --disable-asm. yasm configure./configure : AS= yasm : AS= /home/username/yasm/bin/yasm 34
B.2 ( )./x264 --psnr --partitions p8x8,p4x4,b8x8,i8x8,i4x4 -o output.264 input.y4m C yasm yasm (yasm-[].tar.gz) http://yasm.tortall.net/download.html [Source.tar.gz] yasm./configure --prefix=/home/username/yasm ( ) make install ( ) /home/username/yasm/bin/ yasm 35