16 2020 H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%



Similar documents
2016 [1][2] H.264/AVC HEVC HEVC

(SAD) x86 MPSADBW H.264/AVC H.264/AVC SAD SAD x86 SAD MPSADBW SAD 3x3 3 9 SAD SAD SAD x86 MPSADBW SAD 9 SAD SAD 4.6

RaVioli SIMD

GPGPU

Development of Induction and Exhaust Systems for Third-Era Honda Formula One Engines Induction and exhaust systems determine the amount of air intake

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

7,, i

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

2017 (413812)

4.1 % 7.5 %

ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

,,,,., C Java,,.,,.,., ,,.,, i

P2P P2P peer peer P2P peer P2P peer P2P i

Web Web Web Web 1 1,,,,,, Web, Web - i -

Introduction Purpose This training course demonstrates the use of the High-performance Embedded Workshop (HEW), a key tool for developing software for

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

161 J 1 J 1997 FC 1998 J J J J J2 J1 J2 J1 J2 J1 J J1 J1 J J 2011 FIFA 2012 J 40 56

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

IT i

provider_020524_2.PDF

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

soturon.dvi

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

kiyo5_1-masuzawa.indd

A pp CALL College Life CD-ROM Development of CD-ROM English Teaching Materials, College Life Series, for Improving English Communica

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

生活設計レジメ

44 4 I (1) ( ) (10 15 ) ( 17 ) ( 3 1 ) (2)

I II III 28 29


28 TCG SURF Card recognition using SURF in TCG play video

06_学術.indd

udc-2.dvi

WebRTC P2P Web Proxy P2P Web Proxy WebRTC WebRTC Web, HTTP, WebRTC, P2P i

Fig, 1. Waveform of the short-circuit current peculiar to a metal. Fig. 2. Waveform of arc short-circuit current. 398 T. IEE Japan, Vol. 113-B, No. 4,

4 i

スライド 1

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

Web Web Web Web i

21 e-learning Development of Real-time Learner Detection System for e-learning

インテル(R) Visual Fortran Composer XE

14 CRT Color Constancy in the Conditions of Dierent Cone Adaptation in a CRT Display

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

Web Web Web Web Web, i

SOM SOM(Self-Organizing Maps) SOM SOM SOM SOM SOM SOM i

IR0036_62-3.indb

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

2013 Future University Hakodate 2013 System Information Science Practice Group Report biblive : Project Name biblive : Recording and sharing experienc

〈論文〉興行データベースから「古典芸能」の定義を考える

箱根の遊園地・観光鉄道創設を誘発した観光特化型“不動産ファンド”


ABSTRACT The movement to increase the adult literacy rate in Nepal has been growing since democratization in In recent years, about 300,000 peop

論文9.indd

NotePC 8 10cd=m 2 965cd=m Note-PC Weber L,M,S { i {

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

21 Key Exchange method for portable terminal with direct input by user

29 jjencode JavaScript

394-04

06’ÓŠ¹/ŒØŒì

IT,, i

24 LED A visual programming environment for art work using a LED matrix

21 Effects of background stimuli by changing speed color matching color stimulus

Y X X Y1 X 2644 Y1 Y2 Y1 Y3 Y1 Y1 Y1 Y2 Y3 Y2 Y3 Y1 Y1 Y2 Y3 Y1 Y2 Y3 Y1 X Lexis X Y X X2 X3 X2 Y2 Y1 Y1

ISSN NII Technical Report Patent application and industry-university cooperation: Analysis of joint applications for patent in the Universit

2011 Future University Hakodate 2011 System Information Science Practice Group Report Project Name Visualization of Code-Breaking RSA Group Name RSA C

II

2 The Bulletin of Meiji University of Integrative Medicine 3, Yamashita 10 11

自分の天職をつかめ

Sobel Canny i

第5部門_05_垣本 徹.indd

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

29 Short-time prediction of time series data for binary option trade

untitled

A5 PDF.pwd

220 28;29) 30 35) 26;27) % 8.0% 9 36) 8) 14) 37) O O 13 2 E S % % 2 6 1fl 2fl 3fl 3 4

社会学部紀要 114号☆/22.松村

untitled

<95DB8C9288E397C389C88A E696E6462>

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

1 1 tf-idf tf-idf i


24_ChenGuang_final.indd

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do

Web-ATMによる店舗向けトータルATMサービス

Ł\”ƒ1PDFŠp

23 A Comparison of Flick and Ring Document Scrolling in Touch-based Mobile Phones

A B C B C ICT ICT ITC ICT

最近の選挙キャンペーンの動向

ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

25 Removal of the fricative sounds that occur in the electronic stethoscope

P2P Web Proxy P2P Web Proxy P2P P2P Web Proxy P2P Web Proxy Web P2P WebProxy i

2 10 The Bulletin of Meiji University of Integrative Medicine 1,2 II 1 Web PubMed elbow pain baseball elbow little leaguer s elbow acupun

, IT.,.,..,.. i

Microsoft PowerPoint - iaca.ppt

i


Wide Scanner TWAIN Source ユーザーズガイド

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

Transcription:

H.265/HEVC 2014 (410808)

16 2020 H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%

Abstract In recent years, high resolution video technology has been developed in order to start broadcast of UHD having 16 times definition of HD in 2020. In fact, in January last year, standardization of the new video compression standard H.265/HEVC havings about twice the compression performance of the conventional standard H.264/AVC was completed. At this laboratory, for and evaluation of efficient HEVC encording teqnique, we must use the reference software HM(HEVC Test Model) provided by JCT-VC. But original HM takes about 5 minutes for encording only 5 frames. Thus HM is too late for repeating experimental evaluation. Therefore, it is necessary to optimiz HM code, and raise evaluation efficiency. I examined performance profilling to detect the bottleneck of HEVC encording in the HM, and implement SIMD parallel processing at the conditions that compression performance is invariable. As a result, compared with original HM, the execution time was reduced by about 33 percent.

1 1 1.1.............................. 1 1.2............................ 2 2 HM 3 2.1......................... 3 2.2......................... 4 2.3..................... 5 2.4........................ 6 3 7 3.1............................. 7 3.2........................ 7 3.3 SIMD (Single Instruction Multiple Data)....... 8 3.3.1 Intel AVX(Intel Advanced Vector extensions) AVX2......................... 10 3.3.2................ 11 4 SIMD 12 4.1............................ 12 4.2 filter........................... 13 4.2.1............... 13 4.2.2.............. 17 4.3 xgetsad8........................ 20 4.3.1 SAD....................... 20 4.3.2.......................... 21 4.4 xcalchads8x8..................... 22 4.4.1 SATD...................... 22 4.4.2.......................... 23 5 26 5.1............................ 26 5.2.............................. 26 6 27 i

28 28 A Visual Stdio 29 ii

2.1..................... 3 2.2 SAD SATD........... 5 3.3 SIMD......................... 9 3.4 HM.......... 10 3.5................ 12 4.6 filter..................... 13 4.7 pmaddwd..................... 14 4.8 punpcklwd punpckhwd............. 15 4.9 punpckhwd punpcklwd..... 16 4.10 2 filter SIMD.. 17 4.11 8.............. 17 4.12 pmullw 1......... 19 4.13 4.12 8............... 19 4.14 8 SIMD................ 19 4.15 SAD...................... 21 4.16 SAD SIMD.................... 22 4.17 1............. 23 4.18 phaddw...................... 24 4.19 4.17 SIMD..................... 25 iii

3.1 SSE........................ 9 3.2 AVX....................... 11 5.3 (30 )....... 26 iv

1 1.1 (1920x1080) 16 (7680x4320) 2020 H.264/AVC(Advanced Video Coding) 10 MPEG(Moving Picture Experts Group) VCEG(Video Coding Experts Group) JCT-VC(Joint Collaborative Team on Video Coding) 2013 H.264/AVC H.265/HEVC(High Efficiency Video Coding) ( [1]]) Apple ipad H.265 1

H.265 H.265 1.2 JCT-VC HM(HEVC Test Model) H.265 Codec HM H.265 HM HM 5 5 H.265 H.265 x265 x265 2

HM x265 HM 2 HM 2.1 HM C++ Visual Studio ( 2.1) 2.1: HM SAD 3

2.2 HM xgetsad SAD( ) xcalchads SATD( ) SAD SATD 4 2.2 4

X Z a00 a01 a02 a03 c00 c01 c02 c03 a10 a11 a12 a13 a20 a21 a22 a23 (X-Y) c10 c11 c12 c13 c20 c21 c22 c23 Z SAD a30 a31 a32 a33 c30 c31 c32 c33 Y b00 b01 b02 b03 b10 b11 b12 b13 b20 b21 b22 b23 b30 b31 b32 b33 Z Z SATD 2.2: SAD SATD xgetsad 8x8 16x16 32x32 64x64 xcalchads 4x4 8x8 2.2 4x4 SAD 48 2.3 filter FIR(Finite Impulse Response) filter 5

(1) d[n 0 ] = c 0 s[n 0 ] + c 1 s[n 1 ] +... + c 7 s[n 7 ] (1) 8 d c s FIR s c d SAD SATD 2.4 HM 2013 Ver10 HM 6

3 3.1 HM SIMD 3.2 HM 7

3.3 SIMD (Single Instruction Multiple Data) SIMD 32 4 SIMD CPU SIMD SSE(Streaming SIMD Extensions) 128 SSE x86 32 8 xmm 128 32 4 int xmm 16 short 8 xmm SIMD SSE SIMD 3.3 8

32bit 128bit A0 + B0 = C0 xmm0: A0 A1 A2 A3 A1 A2 + B1 = C1 + B2 = C2 SIMD xmm1: + B0 B1 B2 B3 A3 + B3 = C3 xmm0: C0 C1 C2 C3 3.3: SIMD SSE SSE2 SSE3 SSSE3(Supplemental Streaming SIMD Extensions 3) SSE 3.1 3.1: SSE MOVDQU xmm1, xmm2/m128 xmm2 128 xmm1 PADDW xmm1, xmm2/m128 xmm1 xmm2 16 PSUBW xmm1, xmm2/m128 xmm1 xmm2 16 PMULLW xmm1, xmm2/m128 xmm1 xmm2 16 PABSW xmm1, xmm2/m128 xmm2 16 xmm1 HM 3.4 9

sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; 3.4: HM 8 HM SIMD SIMD SSE AVX 3.3.1 Intel AVX(Intel Advanced Vector extensions) AVX2 AVX SSE SIMD SSE SIMD 2 256 1 8 4 AVX xmm 256 ymm ymm 128 xmm AVX 10

SSE 2 3 Haswell AVX2 AVX2 256 AVX 3.2 3.2: AVX VMOVDQU ymm1, ymm2/m256 ymm2 256 ymm1 VPADDW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 VPSUBW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 VPMULLW xmm1, xmm2, xmm3/m128 xmm2 xmm3 16 xmm1 SIMD 3.3.2 C C++ SIMD 11

SIMD 3.5 int i[4] = {1,2,3,4}; int j[4] = {5,6,7,8}; i[0] += j[0]; i[1] += j[1]; i[2] += j[2]; i[3] += j[3]; int i[4] = {1,2,3,4}; int j[4] = {5,6,7,8}; asm{ movdqu xmm0,i movdqu xmm1,j paddd xmm0,xmm1 movdqu i, xmm0 } 3.5: 4 SIMD 4.1 Sandy Bridge x86 SSE AVX x86 64 xmm 16 12

SSE x86 32 8 xmm x86 SIMD 4.2 filter 4.2.1 filter 4.6 s[0] s[n0] sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; s[n7] s[n0] = sum >> shift ; sum (32bit) shift (32bit) s[n0] (16bit) c0,c7 (16bit) s[n0] 4.6: filter s[n0] sum 16bit shift sum 16bit shift 0 6 12 13

SIMD pmaddwd pmaddwd 4.7 xmm0: s[n0] s[n1] s[n7] xmm1: c0 c1 c7 32bit pmaddwd 16bit xmm0: s[n0]c0 + s[n1]c1 s[n2]c2 + s[n3]c3 s[n4]c4 + s[n5]c5 s[n6]c6 + s[n7]c7 4.7: pmaddwd 32bit 1 8 4 4.6 s[n0] s[n7] s SIMD SIMD punpcklwd punpckhwd 4.8 14

xmm0: a0 a2 a4 a6 b1 b3 b5 b7 xmm1: a1 a3 a5 a7 b2 b4 b6 b8 punpcklwd punpckhwd xmm0: a0 a1 a2 a3 a4 a5 a6 a7 xmm0: b1 b2 b3 b4 b5 b6 b7 b8 4.8: punpcklwd punpckhwd punpcklwd 64bit xmm0 64bit a0 a2 a4 a6 xmm1 a1 a3 a5 a7 punpckhwd 64bit punpcklwd punpckhwd 4.9 15

a0 h0 L L a0 a1 a2 a3 a4 a5 a6 a7 a1 h1 L L b0 b1 b2 b3 b4 b5 b6 b7 a2 h2 L H c0 c1 c2 c3 c4 c5 c6 c7 a3 h3 L H d0 d1 d2 d3 d4 d5 d6 d7 a4 h4 H L e0 e1 e2 e3 e4 e5 e6 e7 a5 h5 H L f0 f1 f2 f3 f4 f5 f6 f7 a6 h6 H H g0 g1 g2 g3 g4 g5 g6 g7 a7 h7 H H h0 h1 h2 h3 h4 h5 h6 h7 4.9: punpckhwd punpcklwd punpcklwd (L) punpckhwd (H) a h xmm a b c d e f g h pmaddwd SIMD SIMD SIMD 16

4.2.2 sum = s[n0] * c0 ; sum += s[n1] * c1 ; sum += s[n7] * c7 ; s[n0] = sum >> shift ; sum (32bit) shift (32bit) s[n0] (16bit) c0,c7 (16bit) paddw pmullw 4.10: 2 filter SIMD 4.10 SIMD paddw(16bit ) pmullw(16bit ) SIMD 4.11 8 4.11: 8 17

8 xmm0 4.12 4.13 4.14 SIMD 18

4.12: pmullw 1 4.13: 4.12 8 4.14: 8 SIMD 19

pmullw ( 4.12) 8 ( 4.13) paddw ( 4.14) 8 sum sum 16bit sum 32bit SIMD shift=0 sum 16bit 4.3 xgetsad8 4.3.1 SAD SAD(Sum of Absolute Difference) SAD = Diff(x, y) (2) x,y Diff(x,y) (x,y) ( ) SAD 20

4.3.2 1 SAD X[0] X[7] sum += abs(x[0] - Y[0]); sum += abs(x[7] - Y[7]); sum:32bit X[0], Y[0]:16bit Y[0] Y[7] SAD sum 4.15: SAD xgetsad8 SAD 4.15 SAD 1 (8 ) sum SAD 1 SAD SIMD 4.16 21

SAD sad + = abs ( X[0] - Y[0] ) ; sad + = abs ( X[7] - Y[7] ) ; sad:32bit X[0], Y[0]:16bit paddusw pabsw psubw abs(): 4.16: SAD SIMD paddusw( 16bit ) pabsw(16bit ) psubw(16bit ) SIMD SIMD xgetsad16 32 4.4 xcalchads8x8 4.4.1 SATD SAD SATD(Hadamard transformed SAD) SAT D = ( DiffT (x, y) )/2 (3) x,y 22

DiffT(x,y) Diff(x,y) 8x8 2x2 4x4 4x16 16x4 SIMD 4.4.2 xcalchads8x8 8x8 1 4.17 0 a0 b0 c0 1 a4 b2 c1 2 a1 b4 c2 3 a5 b6 c3 4 a2 b1 c4 5 a6 b3 c5 6 a3 b5 c6 7 a7 b7 c7 4.17: 1 23

0 7 1 8 0+1 0-1 a SIMD c 1 1 8x8 4.17 SIMD SIMD phaddw 4.18 xmm0: a[0] a[1] a[7] xmm1: b[0] b[1] b[7] xmm0: a[0]+a[1] phaddw 16bit a[2]+a[3] a[4]+a[5] a[6]+a[7] b[0]+b[1] b[2]+b[3] b[4]+b[5] b[6]+b[7] 4.18: phaddw 4.17 SIMD 4.18 24

xmm0: xmm1: 0 1 2 3 4 5 6 7 0-1 2-3 4-5 6-7 pmullw phaddw xmm0: xmm1: a0 a1 a2 a3 a4 a5 a6 a7 a0 -a1 a2 -a3 a4 -a5 a6 -a7 phaddw xmm0: xmm1: b0 b1 b2 b3 b4 b5 b6 b7 b0 -b1 b2 -b3 b4 -b5 b6 -b7 phaddw xmm0: c0 c1 c2 c3 c4 c5 c6 c7 4.19: 4.17 SIMD xmm1 phaddw xmm0 pmullw 4.17 a b c 4.19 xmm0 25

5 5.1 HM HD ( :speed bag 1080p.yuv) 30 SIMD 5.3 5.3: (30 ) (s) SIMD (s) (%) 2421.9 1601.2 33.9 filter 776.4 258.0 66.8 xcalchads8x8 369.9 179.5 51.5 xgetsad8 65.4 38.8 40.6 xgetsad16 58.0 20.3 64.9 xgetsad32 45.4 9.3 79.5 5.2 filter SIMD 33% filter 40 26

xgetsad8 32 xgetsad32 xgetsad8 4 SIMD PSNR HM HM 2 SIMD HM 3 SIMD HM 6 SIMD HM Visual Studio SIMD 33% SIMD SSE xmm 8 AVX x86 64 xmm 16 27

[1] H.265/HEVC 2013 28

A Visual Stdio URL pgomgr http://d.hatena.ne.jp/crest/20120108/1326049212 29