H.265/HEVC 2014 (410808)
16 2020 H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%
Abstract In recent years, high resolution video technology has been developed in order to start broadcast of UHD having 16 times definition of HD in 2020. In fact, in January last year, standardization of the new video compression standard H.265/HEVC havings about twice the compression performance of the conventional standard H.264/AVC was completed. At this laboratory, for and evaluation of efficient HEVC encording teqnique, we must use the reference software HM(HEVC Test Model) provided by JCT-VC. But original HM takes about 5 minutes for encording only 5 frames. Thus HM is too late for repeating experimental evaluation. Therefore, it is necessary to optimiz HM code, and raise evaluation efficiency. I examined performance profilling to detect the bottleneck of HEVC encording in the HM, and implement SIMD parallel processing at the conditions that compression performance is invariable. As a result, compared with original HM, the execution time was reduced by about 33 percent.
1 1 1.1.............................. 1 1.2............................ 2 2 HM 3 2.1......................... 3 2.2......................... 4 2.3..................... 5 2.4........................ 6 3 7 3.1............................. 7 3.2........................ 7 3.3 SIMD (Single Instruction Multiple Data)....... 8 3.3.1 Intel AVX(Intel Advanced Vector extensions) AVX2......................... 10 3.3.2................ 11 4 SIMD 12 4.1............................ 12 4.2 filter........................... 13 4.2.1............... 13 4.2.2.............. 17 4.3 xgetsad8........................ 20 4.3.1 SAD....................... 20 4.3.2.......................... 21 4.4 xcalchads8x8..................... 22 4.4.1 SATD...................... 22 4.4.2.......................... 23 5 26 5.1............................ 26 5.2.............................. 26 6 27 i
28 28 A Visual Stdio 29 ii
2.1..................... 3 2.2 SAD SATD........... 5 3.3 SIMD......................... 9 3.4 HM.......... 10 3.5................ 12 4.6 filter..................... 13 4.7 pmaddwd..................... 14 4.8 punpcklwd punpckhwd............. 15 4.9 punpckhwd punpcklwd..... 16 4.10 2 filter SIMD.. 17 4.11 8.............. 17 4.12 pmullw 1......... 19 4.13 4.12 8............... 19 4.14 8 SIMD................ 19 4.15 SAD...................... 21 4.16 SAD SIMD.................... 22 4.17 1............. 23 4.18 phaddw...................... 24 4.19 4.17 SIMD..................... 25 iii
3.1 SSE........................ 9 3.2 AVX....................... 11 5.3 (30 )....... 26 iv
1 1.1 (1920x1080) 16 (7680x4320) 2020 H.264/AVC(Advanced Video Coding) 10 MPEG(Moving Picture Experts Group) VCEG(Video Coding Experts Group) JCT-VC(Joint Collaborative Team on Video Coding) 2013 H.264/AVC H.265/HEVC(High Efficiency Video Coding) ( [1]]) Apple ipad H.265 1
H.265 H.265 1.2 JCT-VC HM(HEVC Test Model) H.265 Codec HM H.265 HM HM 5 5 H.265 H.265 x265 x265 2
HM x265 HM 2 HM 2.1 HM C++ Visual Studio ( 2.1) 2.1: HM SAD 3
2.2 HM xgetsad SAD( ) xcalchads SATD( ) SAD SATD 4 2.2 4
X Z a00 a01 a02 a03 c00 c01 c02 c03 a10 a11 a12 a13 a20 a21 a22 a23 (X-Y) c10 c11 c12 c13 c20 c21 c22 c23 Z SAD a30 a31 a32 a33 c30 c31 c32 c33 Y b00 b01 b02 b03 b10 b11 b12 b13 b20 b21 b22 b23 b30 b31 b32 b33 Z Z SATD 2.2: SAD SATD xgetsad 8x8 16x16 32x32 64x64 xcalchads 4x4 8x8 2.2 4x4 SAD 48 2.3 filter FIR(Finite Impulse Response) filter 5
(1) d[n 0 ] = c 0 s[n 0 ] + c 1 s[n 1 ] +... + c 7 s[n 7 ] (1) 8 d c s FIR s c d SAD SATD 2.4 HM 2013 Ver10 HM 6
3 3.1 HM SIMD 3.2 HM 7
3.3 SIMD (Single Instruction Multiple Data) SIMD 32 4 SIMD CPU SIMD SSE(Streaming SIMD Extensions) 128 SSE x86 32 8 xmm 128 32 4 int xmm 16 short 8 xmm SIMD SSE SIMD 3.3 8
32bit 128bit A0 + B0 = C0 xmm0: A0 A1 A2 A3 A1 A2 + B1 = C1 + B2 = C2 SIMD xmm1: + B0 B1 B2 B3 A3 + B3 = C3 xmm0: C0 C1 C2 C3 3.3: SIMD SSE SSE2 SSE3 SSSE3(Supplemental Streaming SIMD Extensions 3) SSE 3.1 3.1: SSE MOVDQU xmm1, xmm2/m128 xmm2 128 xmm1 PADDW xmm1, xmm2/m128 xmm1 xmm2 16 PSUBW xmm1, xmm2/m128 xmm1 xmm2 16 PMULLW xmm1, xmm2/m128 xmm1 xmm2 16 PABSW xmm1, xmm2/m128 xmm2 16 xmm1 HM 3.4 9
sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; 3.4: HM 8 HM SIMD SIMD SSE AVX 3.3.1 Intel AVX(Intel Advanced Vector extensions) AVX2 AVX SSE SIMD SSE SIMD 2 256 1 8 4 AVX xmm 256 ymm ymm 128 xmm AVX 10
SSE 2 3 Haswell AVX2 AVX2 256 AVX 3.2 3.2: AVX VMOVDQU ymm1, ymm2/m256 ymm2 256 ymm1 VPADDW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 VPSUBW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 VPMULLW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 SIMD 3.3.2 C C++ SIMD 11
SIMD 3.5 int i[4] = {1,2,3,4}; int j[4] = {5,6,7,8}; i[0] += j[0]; i[1] += j[1]; i[2] += j[2]; i[3] += j[3]; int i[4] = {1,2,3,4}; int j[4] = {5,6,7,8}; asm{ movdqu xmm0,i movdqu xmm1,j paddd xmm0,xmm1 movdqu i, xmm0 } 3.5: 4 SIMD 4.1 Sandy Bridge x86 SSE AVX x86 64 xmm 16 12
SSE x86 32 8 xmm x86 SIMD 4.2 filter 4.2.1 filter 4.6 s[0] s[n0] sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; s[n7] s[n0] = sum >> shift ; sum (32bit) shift (32bit) s[n0] (16bit) c0,c7 (16bit) s[n0] 4.6: filter s[n0] sum 16bit shift sum 16bit shift 0 6 12 13
SIMD pmaddwd pmaddwd 4.7 xmm0: s[n0] s[n1] s[n7] xmm1: c0 c1 c7 32bit pmaddwd 16bit xmm0: s[n0]c0 + s[n1]c1 s[n2]c2 + s[n3]c3 s[n4]c4 + s[n5]c5 s[n6]c6 + s[n7]c7 4.7: pmaddwd 32bit 1 8 4 4.6 s[n0] s[n7] s SIMD SIMD punpcklwd punpckhwd 4.8 14
xmm0: a0 a2 a4 a6 b1 b3 b5 b7 xmm1: a1 a3 a5 a7 b2 b4 b6 b8 punpcklwd punpckhwd xmm0: a0 a1 a2 a3 a4 a5 a6 a7 xmm0: b1 b2 b3 b4 b5 b6 b7 b8 4.8: punpcklwd punpckhwd punpcklwd 64bit xmm0 64bit a0 a2 a4 a6 xmm1 a1 a3 a5 a7 punpckhwd 64bit punpcklwd punpckhwd 4.9 15
a0 h0 L L a0 a1 a2 a3 a4 a5 a6 a7 a1 h1 L L b0 b1 b2 b3 b4 b5 b6 b7 a2 h2 L H c0 c1 c2 c3 c4 c5 c6 c7 a3 h3 L H d0 d1 d2 d3 d4 d5 d6 d7 a4 h4 H L e0 e1 e2 e3 e4 e5 e6 e7 a5 h5 H L f0 f1 f2 f3 f4 f5 f6 f7 a6 h6 H H g0 g1 g2 g3 g4 g5 g6 g7 a7 h7 H H h0 h1 h2 h3 h4 h5 h6 h7 4.9: punpckhwd punpcklwd punpcklwd (L) punpckhwd (H) a h xmm a b c d e f g h pmaddwd SIMD SIMD SIMD 16
4.2.2 sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; s[n0] = sum >> shift ; sum (32bit) shift (32bit) s[n0] (16bit) c0,c7 (16bit) paddw pmullw 4.10: 2 filter SIMD 4.10 SIMD paddw(16bit ) pmullw(16bit ) SIMD 4.11 8 4.11: 8 17
8 xmm0 4.12 4.13 4.14 SIMD 18
4.12: pmullw 1 4.13: 4.12 8 4.14: 8 SIMD 19
pmullw ( 4.12) 8 ( 4.13) paddw ( 4.14) 8 sum sum 16bit sum 32bit SIMD shift=0 sum 16bit 4.3 xgetsad8 4.3.1 SAD SAD(Sum of Absolute Difference) SAD = Diff(x, y) (2) x,y Diff(x,y) (x,y) ( ) SAD 20
4.3.2 1 SAD X[0] X[7] sum += abs(x[0] - Y[0]); sum += abs(x[7] - Y[7]); sum:32bit X[0], Y[0]:16bit Y[0] Y[7] SAD sum 4.15: SAD xgetsad8 SAD 4.15 SAD 1 (8 ) sum SAD 1 SAD SIMD 4.16 21
SAD sad + = abs ( X[0] - Y[0] ) ; sad + = abs ( X[7] - Y[7] ) ; sad:32bit X[0], Y[0]:16bit paddusw pabsw psubw abs(): 4.16: SAD SIMD paddusw( 16bit ) pabsw(16bit ) psubw(16bit ) SIMD SIMD xgetsad16 32 4.4 xcalchads8x8 4.4.1 SATD SAD SATD(Hadamard transformed SAD) SAT D = ( DiffT (x, y) )/2 (3) x,y 22
DiffT(x,y) Diff(x,y) 8x8 2x2 4x4 4x16 16x4 SIMD 4.4.2 xcalchads8x8 8x8 1 4.17 0 a0 b0 c0 1 a4 b2 c1 2 a1 b4 c2 3 a5 b6 c3 4 a2 b1 c4 5 a6 b3 c5 6 a3 b5 c6 7 a7 b7 c7 4.17: 1 23
0 7 1 8 0+1 0-1 a SIMD c 1 1 8x8 4.17 SIMD SIMD phaddw 4.18 xmm0: a[0] a[1] a[7] xmm1: b[0] b[1] b[7] xmm0: a[0]+a[1] phaddw 16bit a[2]+a[3] a[4]+a[5] a[6]+a[7] b[0]+b[1] b[2]+b[3] b[4]+b[5] b[6]+b[7] 4.18: phaddw 4.17 SIMD 4.18 24
xmm0: xmm1: 0 1 2 3 4 5 6 7 0-1 2-3 4-5 6-7 pmullw phaddw xmm0: xmm1: a0 a1 a2 a3 a4 a5 a6 a7 a0 -a1 a2 -a3 a4 -a5 a6 -a7 phaddw xmm0: xmm1: b0 b1 b2 b3 b4 b5 b6 b7 b0 -b1 b2 -b3 b4 -b5 b6 -b7 phaddw xmm0: c0 c1 c2 c3 c4 c5 c6 c7 4.19: 4.17 SIMD xmm1 phaddw xmm0 pmullw 4.17 a b c 4.19 xmm0 25
5 5.1 HM HD ( :speed bag 1080p.yuv) 30 SIMD 5.3 5.3: (30 ) (s) SIMD (s) (%) 2421.9 1601.2 33.9 filter 776.4 258.0 66.8 xcalchads8x8 369.9 179.5 51.5 xgetsad8 65.4 38.8 40.6 xgetsad16 58.0 20.3 64.9 xgetsad32 45.4 9.3 79.5 5.2 filter SIMD 33% filter 40 26
xgetsad8 32 xgetsad32 xgetsad8 4 SIMD PSNR HM HM 2 SIMD HM 3 SIMD HM 6 SIMD HM Visual Studio SIMD 33% SIMD SSE xmm 8 AVX x86 64 xmm 16 27
[1] H.265/HEVC 2013 28
A Visual Stdio URL pgomgr http://d.hatena.ne.jp/crest/20120108/1326049212 29