先進的計算基盤システムシンポジウム LogTM 1 1 1, 1 1 1 LogTM LogTM possible cycle starving writer starving writer LogTM 18.7% 6.6% A Speed-Up Technique for LogTM by Preventing Recurrence of Conflicts Masamichi Eto, 1 Shoichiro Horiba, 1 Hiroki Asai, 1 Tomoaki Tsumura 1 and Hiroshi Matsuo 1 Lock-based thread synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, LogTM has been proposed and studied for lock-free synchronization. To solve conflicts in LogTM, a flag called possible cycle is used. However, the performance can be decrease because of some conflict patterns. This paper proposes a method for dynamically changing the priority of threads to solve the conflict patterns. Our model reduces the number of aborts and reccurence of aborts. The result of the experiment shows that proposing method improve the performance 18.7% in maximum and 6.6% in average. 1. 1) LogTM 2) 1 Nagoya Institute of Technology 1 Presently with DENSO Corporation LogTM possible cycle starving writer starving writer starving writer 2. Transactional Memory TM TM HTM 1 LogTM 2.1 TM TM, 116
先進的計算基盤システムシンポジウム 2 : : TM TM 2.2 LogTM HTM LogTM 2.2.1 TM LogTM LogTM 2.2.2 LogTM read write read write LogTM Illinois 3 Read after Write (RaW): write Write after Read (WaR): read Write after Write (WaW): write / read write ACK NACK NACK ACK 1 t3 t5 2 thr.1 thr.2 Tx.Y thr.1 thr.2 Tx.Y thr.1 ST A thr.2 ST B thr.1 LD B thr.1 t1 thr.2 NACK NACK thr.1 t3 thr.1 thr.2 t4 117
先進的計算基盤システムシンポジウム t1 t2 t3 t4 t5 t6 t7 time Core1 thr.1 ST A 1 stalled ACK B Core2 thr.2 Tx.Y ST B possible_cycle = 1 LogTM thr.2 thr.1 LogTM Transactional Lock Removal 3) 1 t2 NACK possible cycle NACK t5 Tx.Y thr.2 t6 thr.2 Tx.Y thr.1 B t7 2.2.3 LogTM exponential backoff magic waiting Exponential backoff magic waiting NACK 3. 4) 5) LogTM 6) ( CP) LogTM CP LogTM possible cycle flag Williullah 7) 1 LogTM 4. 3 4.1 LogTM 1 starving writer ST LD 2 3 (thr.1 3 ) thr.1 thr.3 thr.2 LD A ST Tx.Y thr.1 thr.2 ST A 118
先進的計算基盤システムシンポジウム possible_cycle t1 = 1 t2 t3 time Core1 Core2 Core3 thr.1 thr.2 thr.3 2 Tx.Y ST B stalled ST A Starving... ACK B starving writer possible_cycle = 1 Tx.Y t1 thr.1 thr.2 ST A thr.3 t2 2.2.2 thr.1 thr.3 2 thr.2 thr.1 thr.2 thr.1 t3 thr.3 A thr.2 thr.1 A LD ST starving writer LD ST 2.2.3 exponential backoff magic waiting starving writer starving writer starving writer 4.2 starving writer LD reader ST writer reader magic waiting writer starving writer WaR 2.2.2 2 starving writer 3 1 writer 2 starving writer magic waiting writer I LD ST WaR II writer 2 2.2.2 possible cycle 2 II 2 2 3 starving writer thr.2 reader thr.1 writer ST B t1 WaR I thr.2 Tx.2 thr.1 2 t2 t3 {B, A} II ST starving writer LD 2.2.3 magic waiting starving writer 2 writer writer Tx.W reader Tx.R starving Tx.W 3 Tx.W Tx.R possible cycle 2 1 II 119
先進的計算基盤システムシンポジウム t1 t2 t3 t4 time Core1 thr.1 ST A ST B (ST B) stalled (starving writer) ACK B Core2 thr.2 Tx.Y Magic Waiting 3 Starving Writer Fig. 3 Proposed model with a starving writer. II : 2 possible cycle II 1 I starving writer 3 Tx.Y A t2 t3 II thr.2 magic waiting 3 writer WaR writer 2 2 II II : 2 2 1 I 5. 3 5.1 4.2 LogTM 3 WaR flags: WaR LD ST n n bit Conflict Table (C-Tbl): M-W flags 1 2 2 C-Tbl1 C-Tbl2 2 C-Tbl 1 3 C-Tbl 1 1 Magic Waiting flags (M-W flags): Magic waiting 2n bit 2 C-Tbl1 1 C-Tbl2 2 2 magic waiting n n 1 WaR flags n bit M-W flags 2n bit 3n bit C-Tbl 64 bit n RAM n = 32 C-Tbl 16kB 2 4.2 1 C-Tbl 1 M-W flags 1 bit 1 3 4.2 C-Tbl 256B 5.2 4 starving writer thr.2 1 thr.2 thr.1 ST B WaR t1 thr.2 thr.1 WaR flags 1 C-Tbl 120
先進的計算基盤システムシンポジウム t1 t2 t3 t4 time Core1 thr.1 ST A ST B (ST B) stalled (starving writer) 4 Core2 thr.2 Tx.Y C-Tbl2[1] A WaR[1] 1 C-Tbl1[1] == B M-W[1] = 10 ACK B Magic Waiting WaR[1] 1 C-Tbl1[1] B C-Tbl2[1] == A M-W[1] = 11 B thr.2 C-Tbl1 C-Tbl1[1] B thr.2 t2 A thr.1 C-Tbl2 thr.1 C-Tbl2 A thr.2 Tx.Y WaR flags thr.2 Tx.Y B t3 t1 WaR flags B M- W flags 10 thr.2 t4 t3 C-Tbl2 A M-W flags M-W flags 11 magic waiting 2 1 C-Tbl 1 C-Tbl1 B M-W flags 1 t3 M-W flags 3 2 C-Tbl 2 1 Processor SPARC V9 number of cores 32 cores frequency 1 GHz issue width single-issue issue order in-order non-memory IPC 1 D1 cache 32 KBytes ways 4 ways latency 1 cycle D2 cache 8 MBytes ways 8 ways latency 20 cycles Memory 4 GBytes latency 450 cycles Interconnect network latency 14 cycles 2 6. 6.1 LogTM TM Simics 8) 3.0.31 GEMS 9) 2.1.1 Simics GEMS 32 SPARC V9 OS Solaris10 1 GEMS microbench SPLASH2 10) STAMP 11) 12 1 32 OS 1 31 STAMP 2 STAMP 16 3 GEMS partial rollback 6.2 5 2 3 5 (B) LogTM (S 1) 4.2 1 (S 2) 4.2 2 (S 3) 4.2 3 (B) 1 Non trans Good trans Bad trans /- 121
Ratio of execution cycles 先進的計算基盤システムシンポジウム 1.2 (B) traditional LogTM (baseline) (S1) prevents Starving Writer (condition #1) (S2) prevents Starving Writer (condition #2) (S3) prevents Starving Writer (condition #3) 1 Magic_Waiting 0.8 0.6 Barrier Stall Backoff 0.4 0.2 ing Bad_trans Good_trans 0 Btree Contention Deque Prioque Slist Barnes Cholesky Radiosity Raytrace Genome Kmeans Vacation Non_trans GEMS / 31threads SPLASH2 / 31threads STAMP / 16threads 5 GEMS SPLASH2 STAMP 2 GEMS SPLASH2 STAMP all (S 1) 3.9% 5.7% 1.3% 3.9% 8.4% 12.6% 1.9% 12.6% (S 2) 6.7% 10.2% 1.8% 6.7% 17.0% 18.6% 2.3% 18.6% (S 3) 6.6% 10.3% 1.7% 6.6% 17.3% 18.7% 1.9% 18.7% 3 GEMS SPLASH2 STAMP all (S 1) 37.1% 25.5% 40.0% 34.2% 76.2% 45.7% 67.7% 76.2% (S 2) 46.6% 44.7% 47.9% 46.3% 86.8% 67.1% 72.9% 86.8% (S 3) 46.1% 45.4% 47.6% 46.3% 86.6% 67.4% 72.9% 86.6% ing Stall MagicWaiting Barrier Backoff magic waiting exponential backoff 10 95% Slist 3 Contention Deque Genome Kmeans Vacation (B) 72.9% Kmeans 15.1% Deque magic waiting Kmeans 0.1% Btree Prioque Barnes Radiosity Bad trans ing Stall Backoff Btree Prioque starving writer starving writer Backoff Btree (S 2) (S 3) 86.8% 86.7% 1/4 Barnes (S 2) (S 3) (B) Barrier 25% 122
先進的計算基盤システムシンポジウム Radiosity magic waiting Prioque Raytrace Bad trans Bad trans Bad trans (S 2) (S 3) (S 1) possible cycle (S 2) (S 3) (S 3) 7. LogTM starving writer GEMS microbench SPLASH-2 STAMP backoff LogTM 86.6% 18.7% 6.6% 1 starving writer LogTM backoff magic waiting 1) Herlihy, M. et al.: Transactional Memory: Architectural Support for Lock-Free Data Structures, Proc. of 20th Int l Symp. on Computer Architecture (ISCA 93), pp.289 300 (1993). 2) Moore, K. E., Bobba, J., Moravan, M. J., Hill, M. D. and Wood, D. A.: LogTM: Logbased Transactional Memory, Proc. of 12th Int l Symp. on High-Performance Computer Architecture, pp.254 265 (2006). 3) Rajwar, R. and Goodman, J. R.: Transactional Lock-Free Execution of Lock-Based Programs, Proc of 10th Symp. on Architectural Support for Programming Languages and Operating Systems, pp.5 17 (2002). 4) J.Moravan, M. et al.: Supporting Nested Transactional Memory in LogTM, Proc. of the 12th Int l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.1 12 (2006). 5) Vol.108, No.ICD-28, pp.81 86 (2008). 6) SACSIS2011 pp.324 331 (2011). 7) Waliullah, M.M. and Stenstrom, P.: Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems, Proc. of Int l Symp. on Parallel and Distributed (IPDPS), pp.1 11 (2008). 8) Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A. and Werner, B.: Simics: A Full System Simulation Platform, Computer, Vol.35, No.2, pp.50 58 (2002). 9) Martin, M. M. K. et al.: Multifacet s General Execution-driven Multiprocessor Simulator (GEMS) Toolset, ACM SIGARCH Computer Architecture News, Vol.33, No.4, pp.92 99 (2005). 10) Woo, S. C. et al.: The SPLASH-2 Programs: Characterization and Methodological Considerations, Proc of 22nd Int l. Symp. on Computer Architecture (ISCA 95), pp.24 36 (1995). 11) Minh, C. C., Chung, J., Kozyrakis, C. and Olukotun, K.: STAMP: Stanford Transactional Applications for Multi-, Proc. of IEEE Int l Symp. on Workload Characterization (IISWC 08) (2008). 123