HTM RaR HTM 2. 2) 3) HTM 2 3 Yoo 4) HTM Adaptive Transaction Scheduling Akpinar 5) HTM Gaona 6) HTM 3. Read-after-Read HTM 3.1 Read-after-Read Read Wr

Similar documents
先進的計算基盤システムシンポジウム 2 : : TM TM 2.2 LogTM HTM LogTM TM LogTM LogTM LogTM read write read write LogTM Illinois 3 Read after Write (RaW): writ

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

GPGPU

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

2017 (413812)

IPSJ SIG Technical Report Vol.2011-EC-19 No /3/ ,.,., Peg-Scope Viewer,,.,,,,. Utilization of Watching Logs for Support of Multi-

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

58 10

3_23.dvi

2) TA Hercules CAA 5 [6], [7] CAA BOSS [8] 2. C II C. ( 1 ) C. ( 2 ). ( 3 ) 100. ( 4 ) () HTML NFS Hercules ( )

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L


EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). The material has been made available on the website

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

Lytro [11] The Franken Camera [12] 2.2 Creative Coding Community Creative Coding Community [13]-[19] Sketch Fork 2.3 [20]-[23] 3. ourcam 3.1 ou

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

( ) [1] [4] ( ) 2. [5] [6] Piano Tutor[7] [1], [2], [8], [9] Radiobaton[10] Two Finger Piano[11] Coloring-in Piano[12] ism[13] MIDI MIDI 1 Fig. 1 Syst

Express5800/320Fa-L/320Fa-LR

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

Express5800/R110a-1Hユーザーズガイド

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

<95DB8C9288E397C389C88A E696E6462>

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

fiš„v8.dvi

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE {s-kasihr, wakamiya,

28 Horizontal angle correction using straight line detection in an equirectangular image

2 ( ) i

Vol.55 No (Jan. 2014) saccess 6 saccess 7 saccess 2. [3] p.33 * B (A) (B) (C) (D) (E) (F) *1 [3], [4] Web PDF a m

Table 1. Assumed performance of a water electrol ysis plant. Fig. 1. Structure of a proposed power generation system utilizing waste heat from factori

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

mobicom.dvi


Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

Transcription:

1 1, 1 1 1 1 Readafter-Read Read-after-Read 66.9% A Speed-Up Technique for Hardware Transactional Memories by Reducing Concurrency Considering Conflicting Addresses Koshiro Hashimoto, 1 Masamichi Eto, 1, 1 Shoichiro Horiba, 1 Tomoaki Tsumura 1 and Hiroshi Matsuo 1 Lock-based thread synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, Transactional Memory (TM) has been proposed and studied for lock-free synchronization. On TM, transactions are executed speculatively unless a memory access conflict is caused, hence the performance of TM is generally better than that of lock. However, if speculative execution is continued when a Read-after-Read (RaR) access occurs, following stalls can be wasted. In this paper, we propose a speed-up technique by reducing concurrency considering conflicting addresses. The result of the experiment shows that proposed method improves the performance 66.9% in maximum. 1. Transactional Memory: TM 1) TM 1 Nagoya Institute of Technology 1 Presently with Central Japan Railway Company TM Hardware Transactional Memory: HTM HTM Read Read Read-after-Read RaR c 2013 Information Processing Society of Japan 162

HTM RaR HTM 2. 2) 3) HTM 2 3 Yoo 4) HTM Adaptive Transaction Scheduling Akpinar 5) HTM Gaona 6) HTM 3. Read-after-Read HTM 3.1 Read-after-Read Read Write Test-and-Set Write Read Read Write 1 Abort NACK ACK NACK stall Futile Stall Read-after-Read Futile Stall 1 2 LogTM 7) eager conflict detection HTM NACK NACK Futile Stall HTM 3.2 Read-after-Read Futile Stall 3.2.1 Futile Stall Write Read Read Write Read RaR RaR Read Read 2 c 2013 Information Processing Society of Japan 163

t4 t5 t6 t7 t8 2 ing stall NACK Wakeup Wakeup NACK ACK Core3 Thread3 ing Abort RaR Futile Stall 3 3 Thread3 RaR t4 Thread3 A Write 1 Futile Stall 3.2.2 Futile Stall Thread3 Wakeup 2 t5 Thread3 Wakeup Thread3 t6 t7 2 t5 A Thread3 t8 Read Wakeup 2 Thread3 2 Thread3 Read Thread3 RaR 4. 4.1 3 Register for RaR addresses RaR-addr. : Read Write Queue for order of resumption O-que. : RaR RaR-addr. Read. Register for resumption R-res. : RaR Read Write RaR-addr. Read RaR-addr. RaR Read O-que. RaR O-que. R-res. c 2013 Information Processing Society of Japan 164

t4 stall Nack Nack A Nack Abort A Nack Core3 Thread3 Abort Cache () Registration Registration R W addr () Check 1 0 A read bit 0 0 (t4) Check read bit Cache (Core3) R W addr 1 0 A 0 0 3 RaR RaR-addr. R-res. 4.2 Read-after-Read RaR 4.2.1 RaR-addr. 3 3 3 RaR-addr. Write-after-Read WaR Thread3 NACK Thread3 WaR Thread3 A R t4 R HTM Read R Thread3 Write A Read A RaR-addr. 4.2.2 RaR-addr. 4.2.1 RaR-addr. RaR 4 3 3 Read RaR-addr. 4 RaR-addr. A A RaR-addr. A RaR-addr. A Read Read Write A Read RaR Wakeup Thread3 Thread3 RaR t4 4.2.3 RaR-addr. RaR-addr. 4.1 RaR-addr. Read Write 1 Read Wrire RaR-addr. N N 1 2 4 64bit 128bit 256bit 32 256byte 512byte 1Kbyte c 2013 Information Processing Society of Japan 165

t4 4 ing A Core3 Thread3 A ing RaR-addr. RaR FIFO RaRaddr. RaR 5 4.3 4.2 5 4 RaR Thread3 RaR Read O-que. RaR R-res. Thread3 RaR O-que. Thread3 Thread3 R-res. O-que. R-res. O- que. O-que. 1 3 O-que 1 Wakeup t5 Wakeup R-res. 2 ted t7 t8 (t6) R-res. 2 O-que. t4 t5 t6 5 ing (t5) 1 3 Wakeup ted (t7) Wakeup Core3 Thread3 (t8) 2 ted ing t6 ted O-que. 3 Thread3 Wakeup t7 Wakeup Thread3 Thread3 R-res. ted t8 ted O-que. O-que. O-que. R-res. 32 1 4bit O-que. 4bit 31 4bit 32 32 512bytes 5. 5.1 HTM LogTM 7) Simics 8) 3.0.31 GEMS 9) 2.1.1 Simics c 2013 Information Processing Society of Japan 166

1 Processor SPARC V9 #cores 32 cores clock 1 GHz issue width single issue order in-order non-memory IPC 1 D1 cache 32 KBytes ways 4 ways latency 1 cycle D2 cache 8 MBytes ways 8 ways latency 20 cycles Memory 8 GBytes latency 450 cycles Interconnect network latency 14 cycles 2 GEMS SPLASH-2 STAMP All (R 1) 29.2% 19.1% 4.9% 22.6% 66.9% 39.9% 9.3% 66.9% (R 2) 29.3% 19.9% 5.2% 23.0% 66.9% 41.5% 9.9% 66.9% (R 4) 29.5% 19.9% 5.0% 23.1% 66.9% 41.1% 9.3% 66.9% (R ) 29.8% 22.4% 4.7% 24.0% 66.9% 40.9% 8.8% 66.9% GEMS 32 SPARC V9 OS Solaris 10 1 GEMS microbench SPLASH-2 STAMP 10 STAMP 2 Gramoli 10) 5.2 6 2 6 Non trans Good trans Bad trans Aborting Backoff Stall Barrier Magicing 5 (B) LogTM (R 1) RaR-addr. 1 (R 2) RaR-addr. 2 (R 4) RaR-addr. 4 (R ) LogTM B 1 R 1 R RaRaddr. Read Write 11) 10 95% 3 5.4 Write Read Futile Stall Futile Stall Btree B RaR-addr. R 1 R 1 R 1 22.6% 66.9% 5.3 GEMS microbench GEMS microbench Deque Prioque Backoff Read Write R 1 Futile Stall Backoff Btree Btree 2 Tx.I Tx.J Tx.I Read Write Tx.J Write Read Tx.I Tx.I Tx.J c 2013 Information Processing Society of Japan 167

Ratio of cycles 1.2 1 0.8 0.6 0.4 0.2 0 (B)conventional LogTM (baseline) (R1)RaR-access Detection (N = 1) (R2)RaR-access Detection (N = 2) (R4)RaR-access Detection (N = 4) (R )RaR-access Detection Magicing Barrier Stall Backoff Aborting Bad_trans Good_trans Non_trans GEMS microbench SPLASH-2 STAMP 6 Tx.J Write Read Btree SPLASH-2 SPLASH-2 Raytrace Backoff Raytrace Read Write 3 Futile Stall Futile Stall Backoff Cholesky Barrier Futile Stall Radiosity Read Write RaR-addr. RaR Radiosity RaR-addr. STAMP STAMP Kmeans Kmeans Read Write Kmeans Futile Stall 5.4 3 4 5 RaR-addr. 1 c 2013 Information Processing Society of Japan 168

3 R 1 RaR-addr. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 876,235 Barnes 86,413 Kmeans 148,084 Contention 562,844 Cholesky 296,708 Vacation 684,826 Deque 7,152 Radiosity 115,865 - - Prioque 72,095 Raytrace 1,257,086 - - 4 R 1 O-que. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 21,137 Barnes 417 Kmeans 270 Contention 130 Cholesky 3,751 Vacation 7 Deque 3,210 Radiosity 2,991 - - Prioque 3,022 Raytrace 38,524 - - 5 R 1 R-res. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 22,113 Barnes 448 Kmeans 324 Contention 152 Cholesky 5,888 Vacation 10 Deque 3,303 Radiosity 4,052 - - Prioque 3,232 Raytrace 39,456 - - 1 cycle O-que. 1 1 4bit 1 2 cycle R-res. 1 RaR-addr. 1 cycle Raytrace 0.2% 6. Read Write Read-after-Read HTM Futile Stall GEMS microbench SPLASH-2 STAMP HTM 66.9% 1) Herlihy, M. et al.: Transactional Memory: Architectural Support for Lock-Free Data Structures, Proc. 20th Int l Symp. on Computer Architecture (ISCA 93), pp.289 300 (1993). 2) J.Moravan, M. et al.: Supporting Nested Transactional Memory in LogTM, Proc. 12th Int l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.1 12 (2006). 3) M, L., G, M. and A, G.: A Dynamically Adaptable Hardware Transactional Memory, Microarchitecture(MICRO), 2010 43rd Annual IEEE/ACM, pp.27 38 (2010). 4) Yoo, R.M. and Lee, H.-H.S.: Adaptive Transaction Scheduling for Transactional Memory Systems, Proc. 20th Annual Symp. on Parallelism in Algorithms and Architectures (SPAA 08), pp.169 178 (2008). 5) Akpinar, E. et al.: A Comprehensive Study of Conflict Resolution Policies in Hardware Transactional Memory, Proc. 6th ACM SIG- PLAN Workshop on Transactional Computing (TRANSACT 11) (2011). 6) Gaona, E. et al.: Dynamic Serialization Improving Energy Consumption in Eager- Eager Hardware Transactional Memory Systems, Proc. Parallel, Distributed and Network- Based Processing 2012 20th Euromicro International Conference (PDP 12), pp.221 228 (2012). 7) Moore, K.E. et al.: LogTM: Log-based Transactional Memory, Proc. 12th Int l Symp. on High-Performance Computer Architecture, pp. 254 265 (2006). 8) Magnusson, P.S. et al.: Simics: A Full System Simulation Platform, Computer, Vol.35, No.2, pp.50 58 (2002). 9) Martin, M. M. K. et al.: Multifacet s General Execution-driven Multiprocessor Simulator (GEMS) Toolset, ACM SIGARCH Computer Architecture News, Vol.33, No.4, pp.92 99 (2005). 10) Gramoli, V. and Guerraoui, R.: Transactions - stamp, http://lpdserver.epfl.ch/ transactions/wiki/doku.php?id=stamp (2011). 11) Alameldeen, A. R. and Wood, D. A.: Variability in Architectural Simulations of Multi- Threaded Workloads, Proc. 9th Int l Symp. on High-Performance Computer Architecture (HPCA 03), pp.7 18 (2003). c 2013 Information Processing Society of Japan 169