先進的計算基盤システムシンポジウム 2 : : TM TM 2.2 LogTM HTM LogTM TM LogTM LogTM LogTM read write read write LogTM Illinois 3 Read after Write (RaW): writ

Similar documents
HTM RaR HTM 2. 2) 3) HTM 2 3 Yoo 4) HTM Adaptive Transaction Scheduling Akpinar 5) HTM Gaona 6) HTM 3. Read-after-Read HTM 3.1 Read-after-Read Read Wr

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

58 10

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

GPGPU

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

Table 1. Reluctance equalization design. Fig. 2. Voltage vector of LSynRM. Fig. 4. Analytical model. Table 2. Specifications of analytical models. Fig

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

(a) 1 (b) 3. Gilbert Pernicka[2] Treibitz Schechner[3] Narasimhan [4] Kim [5] Nayar [6] [7][8][9] 2. X X X [10] [11] L L t L s L = L t + L s

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

1 [1, 2, 3, 4, 5, 8, 9, 10, 12, 15] The Boston Public Schools system, BPS (Deferred Acceptance system, DA) (Top Trading Cycles system, TTC) cf. [13] [

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

25 Removal of the fricative sounds that occur in the electronic stethoscope

MmUm+FopX m Mm+Mop F-Mm(Fop-Mopum)M m+mop MSuS+FX S M S+MOb Fs-Ms(Mobus-Fex)M s+mob Fig. 1 Particle model of single degree of freedom master/ slave sy

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

Vol.54 No (July 2013) [9] [10] [11] [12], [13] 1 Fig. 1 Flowchart of the proposed system. c 2013 Information

Express5800/320Fa-L/320Fa-LR

Table 1. Assumed performance of a water electrol ysis plant. Fig. 1. Structure of a proposed power generation system utilizing waste heat from factori

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co


2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

第62巻 第1号 平成24年4月/石こうを用いた木材ペレット

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

..,,,, , ( ) 3.,., 3.,., 500, 233.,, 3,,.,, i

1 Web [2] Web [3] [4] [5], [6] [7] [8] S.W. [9] 3. MeetingShelf Web MeetingShelf MeetingShelf (1) (2) (3) (4) (5) Web MeetingShelf

3_23.dvi

2017 (413812)

20 Method for Recognizing Expression Considering Fuzzy Based on Optical Flow

Web Web Web Web Web, i

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

IPSJ SIG Technical Report Vol.2012-CG-148 No /8/29 3DCG 1,a) On rigid body animation taking into account the 3D computer graphics came

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

mobicom.dvi

第122号.indd

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

塗装深み感の要因解析

Vol. 42 No MUC-6 6) 90% 2) MUC-6 MET-1 7),8) 7 90% 1 MUC IREX-NE 9) 10),11) 1) MUCMET 12) IREX-NE 13) ARPA 1987 MUC 1992 TREC IREX-N

2 122

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

A Construction of Hybrid Adaptive Control System Using a Fixed Compensator Shiro MASUDA*, Hiroshi OKAMOTO** and Akira INOUE* In this paper, we propose

fiš„v8.dvi

IPSJ SIG Technical Report Vol.2014-IOT-27 No.14 Vol.2014-SPT-11 No /10/10 1,a) 2 zabbix Consideration of a system to support understanding of f

28 Horizontal angle correction using straight line detection in an equirectangular image

Express5800/R110a-1Hユーザーズガイド

6_27.dvi

7,, i

Transcription:

先進的計算基盤システムシンポジウム LogTM 1 1 1, 1 1 1 LogTM LogTM possible cycle starving writer starving writer LogTM 18.7% 6.6% A Speed-Up Technique for LogTM by Preventing Recurrence of Conflicts Masamichi Eto, 1 Shoichiro Horiba, 1 Hiroki Asai, 1 Tomoaki Tsumura 1 and Hiroshi Matsuo 1 Lock-based thread synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, LogTM has been proposed and studied for lock-free synchronization. To solve conflicts in LogTM, a flag called possible cycle is used. However, the performance can be decrease because of some conflict patterns. This paper proposes a method for dynamically changing the priority of threads to solve the conflict patterns. Our model reduces the number of aborts and reccurence of aborts. The result of the experiment shows that proposing method improve the performance 18.7% in maximum and 6.6% in average. 1. 1) LogTM 2) 1 Nagoya Institute of Technology 1 Presently with DENSO Corporation LogTM possible cycle starving writer starving writer starving writer 2. Transactional Memory TM TM HTM 1 LogTM 2.1 TM TM, 116

先進的計算基盤システムシンポジウム 2 : : TM TM 2.2 LogTM HTM LogTM 2.2.1 TM LogTM LogTM 2.2.2 LogTM read write read write LogTM Illinois 3 Read after Write (RaW): write Write after Read (WaR): read Write after Write (WaW): write / read write ACK NACK NACK ACK 1 t3 t5 2 thr.1 thr.2 Tx.Y thr.1 thr.2 Tx.Y thr.1 ST A thr.2 ST B thr.1 LD B thr.1 t1 thr.2 NACK NACK thr.1 t3 thr.1 thr.2 t4 117

先進的計算基盤システムシンポジウム t1 t2 t3 t4 t5 t6 t7 time Core1 thr.1 ST A 1 stalled ACK B Core2 thr.2 Tx.Y ST B possible_cycle = 1 LogTM thr.2 thr.1 LogTM Transactional Lock Removal 3) 1 t2 NACK possible cycle NACK t5 Tx.Y thr.2 t6 thr.2 Tx.Y thr.1 B t7 2.2.3 LogTM exponential backoff magic waiting Exponential backoff magic waiting NACK 3. 4) 5) LogTM 6) ( CP) LogTM CP LogTM possible cycle flag Williullah 7) 1 LogTM 4. 3 4.1 LogTM 1 starving writer ST LD 2 3 (thr.1 3 ) thr.1 thr.3 thr.2 LD A ST Tx.Y thr.1 thr.2 ST A 118

先進的計算基盤システムシンポジウム possible_cycle t1 = 1 t2 t3 time Core1 Core2 Core3 thr.1 thr.2 thr.3 2 Tx.Y ST B stalled ST A Starving... ACK B starving writer possible_cycle = 1 Tx.Y t1 thr.1 thr.2 ST A thr.3 t2 2.2.2 thr.1 thr.3 2 thr.2 thr.1 thr.2 thr.1 t3 thr.3 A thr.2 thr.1 A LD ST starving writer LD ST 2.2.3 exponential backoff magic waiting starving writer starving writer starving writer 4.2 starving writer LD reader ST writer reader magic waiting writer starving writer WaR 2.2.2 2 starving writer 3 1 writer 2 starving writer magic waiting writer I LD ST WaR II writer 2 2.2.2 possible cycle 2 II 2 2 3 starving writer thr.2 reader thr.1 writer ST B t1 WaR I thr.2 Tx.2 thr.1 2 t2 t3 {B, A} II ST starving writer LD 2.2.3 magic waiting starving writer 2 writer writer Tx.W reader Tx.R starving Tx.W 3 Tx.W Tx.R possible cycle 2 1 II 119

先進的計算基盤システムシンポジウム t1 t2 t3 t4 time Core1 thr.1 ST A ST B (ST B) stalled (starving writer) ACK B Core2 thr.2 Tx.Y Magic Waiting 3 Starving Writer Fig. 3 Proposed model with a starving writer. II : 2 possible cycle II 1 I starving writer 3 Tx.Y A t2 t3 II thr.2 magic waiting 3 writer WaR writer 2 2 II II : 2 2 1 I 5. 3 5.1 4.2 LogTM 3 WaR flags: WaR LD ST n n bit Conflict Table (C-Tbl): M-W flags 1 2 2 C-Tbl1 C-Tbl2 2 C-Tbl 1 3 C-Tbl 1 1 Magic Waiting flags (M-W flags): Magic waiting 2n bit 2 C-Tbl1 1 C-Tbl2 2 2 magic waiting n n 1 WaR flags n bit M-W flags 2n bit 3n bit C-Tbl 64 bit n RAM n = 32 C-Tbl 16kB 2 4.2 1 C-Tbl 1 M-W flags 1 bit 1 3 4.2 C-Tbl 256B 5.2 4 starving writer thr.2 1 thr.2 thr.1 ST B WaR t1 thr.2 thr.1 WaR flags 1 C-Tbl 120

先進的計算基盤システムシンポジウム t1 t2 t3 t4 time Core1 thr.1 ST A ST B (ST B) stalled (starving writer) 4 Core2 thr.2 Tx.Y C-Tbl2[1] A WaR[1] 1 C-Tbl1[1] == B M-W[1] = 10 ACK B Magic Waiting WaR[1] 1 C-Tbl1[1] B C-Tbl2[1] == A M-W[1] = 11 B thr.2 C-Tbl1 C-Tbl1[1] B thr.2 t2 A thr.1 C-Tbl2 thr.1 C-Tbl2 A thr.2 Tx.Y WaR flags thr.2 Tx.Y B t3 t1 WaR flags B M- W flags 10 thr.2 t4 t3 C-Tbl2 A M-W flags M-W flags 11 magic waiting 2 1 C-Tbl 1 C-Tbl1 B M-W flags 1 t3 M-W flags 3 2 C-Tbl 2 1 Processor SPARC V9 number of cores 32 cores frequency 1 GHz issue width single-issue issue order in-order non-memory IPC 1 D1 cache 32 KBytes ways 4 ways latency 1 cycle D2 cache 8 MBytes ways 8 ways latency 20 cycles Memory 4 GBytes latency 450 cycles Interconnect network latency 14 cycles 2 6. 6.1 LogTM TM Simics 8) 3.0.31 GEMS 9) 2.1.1 Simics GEMS 32 SPARC V9 OS Solaris10 1 GEMS microbench SPLASH2 10) STAMP 11) 12 1 32 OS 1 31 STAMP 2 STAMP 16 3 GEMS partial rollback 6.2 5 2 3 5 (B) LogTM (S 1) 4.2 1 (S 2) 4.2 2 (S 3) 4.2 3 (B) 1 Non trans Good trans Bad trans /- 121

Ratio of execution cycles 先進的計算基盤システムシンポジウム 1.2 (B) traditional LogTM (baseline) (S1) prevents Starving Writer (condition #1) (S2) prevents Starving Writer (condition #2) (S3) prevents Starving Writer (condition #3) 1 Magic_Waiting 0.8 0.6 Barrier Stall Backoff 0.4 0.2 ing Bad_trans Good_trans 0 Btree Contention Deque Prioque Slist Barnes Cholesky Radiosity Raytrace Genome Kmeans Vacation Non_trans GEMS / 31threads SPLASH2 / 31threads STAMP / 16threads 5 GEMS SPLASH2 STAMP 2 GEMS SPLASH2 STAMP all (S 1) 3.9% 5.7% 1.3% 3.9% 8.4% 12.6% 1.9% 12.6% (S 2) 6.7% 10.2% 1.8% 6.7% 17.0% 18.6% 2.3% 18.6% (S 3) 6.6% 10.3% 1.7% 6.6% 17.3% 18.7% 1.9% 18.7% 3 GEMS SPLASH2 STAMP all (S 1) 37.1% 25.5% 40.0% 34.2% 76.2% 45.7% 67.7% 76.2% (S 2) 46.6% 44.7% 47.9% 46.3% 86.8% 67.1% 72.9% 86.8% (S 3) 46.1% 45.4% 47.6% 46.3% 86.6% 67.4% 72.9% 86.6% ing Stall MagicWaiting Barrier Backoff magic waiting exponential backoff 10 95% Slist 3 Contention Deque Genome Kmeans Vacation (B) 72.9% Kmeans 15.1% Deque magic waiting Kmeans 0.1% Btree Prioque Barnes Radiosity Bad trans ing Stall Backoff Btree Prioque starving writer starving writer Backoff Btree (S 2) (S 3) 86.8% 86.7% 1/4 Barnes (S 2) (S 3) (B) Barrier 25% 122

先進的計算基盤システムシンポジウム Radiosity magic waiting Prioque Raytrace Bad trans Bad trans Bad trans (S 2) (S 3) (S 1) possible cycle (S 2) (S 3) (S 3) 7. LogTM starving writer GEMS microbench SPLASH-2 STAMP backoff LogTM 86.6% 18.7% 6.6% 1 starving writer LogTM backoff magic waiting 1) Herlihy, M. et al.: Transactional Memory: Architectural Support for Lock-Free Data Structures, Proc. of 20th Int l Symp. on Computer Architecture (ISCA 93), pp.289 300 (1993). 2) Moore, K. E., Bobba, J., Moravan, M. J., Hill, M. D. and Wood, D. A.: LogTM: Logbased Transactional Memory, Proc. of 12th Int l Symp. on High-Performance Computer Architecture, pp.254 265 (2006). 3) Rajwar, R. and Goodman, J. R.: Transactional Lock-Free Execution of Lock-Based Programs, Proc of 10th Symp. on Architectural Support for Programming Languages and Operating Systems, pp.5 17 (2002). 4) J.Moravan, M. et al.: Supporting Nested Transactional Memory in LogTM, Proc. of the 12th Int l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.1 12 (2006). 5) Vol.108, No.ICD-28, pp.81 86 (2008). 6) SACSIS2011 pp.324 331 (2011). 7) Waliullah, M.M. and Stenstrom, P.: Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems, Proc. of Int l Symp. on Parallel and Distributed (IPDPS), pp.1 11 (2008). 8) Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A. and Werner, B.: Simics: A Full System Simulation Platform, Computer, Vol.35, No.2, pp.50 58 (2002). 9) Martin, M. M. K. et al.: Multifacet s General Execution-driven Multiprocessor Simulator (GEMS) Toolset, ACM SIGARCH Computer Architecture News, Vol.33, No.4, pp.92 99 (2005). 10) Woo, S. C. et al.: The SPLASH-2 Programs: Characterization and Methodological Considerations, Proc of 22nd Int l. Symp. on Computer Architecture (ISCA 95), pp.24 36 (1995). 11) Minh, C. C., Chung, J., Kozyrakis, C. and Olukotun, K.: STAMP: Stanford Transactional Applications for Multi-, Proc. of IEEE Int l Symp. on Workload Characterization (IISWC 08) (2008). 123