1 1, 1 1 1 1 Readafter-Read Read-after-Read 66.9% A Speed-Up Technique for Hardware Transactional Memories by Reducing Concurrency Considering Conflicting Addresses Koshiro Hashimoto, 1 Masamichi Eto, 1, 1 Shoichiro Horiba, 1 Tomoaki Tsumura 1 and Hiroshi Matsuo 1 Lock-based thread synchronization techniques are commonly used in parallel programming on multi-core processors. However, lock can cause deadlocks and poor scalabilities. Hence, Transactional Memory (TM) has been proposed and studied for lock-free synchronization. On TM, transactions are executed speculatively unless a memory access conflict is caused, hence the performance of TM is generally better than that of lock. However, if speculative execution is continued when a Read-after-Read (RaR) access occurs, following stalls can be wasted. In this paper, we propose a speed-up technique by reducing concurrency considering conflicting addresses. The result of the experiment shows that proposed method improves the performance 66.9% in maximum. 1. Transactional Memory: TM 1) TM 1 Nagoya Institute of Technology 1 Presently with Central Japan Railway Company TM Hardware Transactional Memory: HTM HTM Read Read Read-after-Read RaR c 2013 Information Processing Society of Japan 162
HTM RaR HTM 2. 2) 3) HTM 2 3 Yoo 4) HTM Adaptive Transaction Scheduling Akpinar 5) HTM Gaona 6) HTM 3. Read-after-Read HTM 3.1 Read-after-Read Read Write Test-and-Set Write Read Read Write 1 Abort NACK ACK NACK stall Futile Stall Read-after-Read Futile Stall 1 2 LogTM 7) eager conflict detection HTM NACK NACK Futile Stall HTM 3.2 Read-after-Read Futile Stall 3.2.1 Futile Stall Write Read Read Write Read RaR RaR Read Read 2 c 2013 Information Processing Society of Japan 163
t4 t5 t6 t7 t8 2 ing stall NACK Wakeup Wakeup NACK ACK Core3 Thread3 ing Abort RaR Futile Stall 3 3 Thread3 RaR t4 Thread3 A Write 1 Futile Stall 3.2.2 Futile Stall Thread3 Wakeup 2 t5 Thread3 Wakeup Thread3 t6 t7 2 t5 A Thread3 t8 Read Wakeup 2 Thread3 2 Thread3 Read Thread3 RaR 4. 4.1 3 Register for RaR addresses RaR-addr. : Read Write Queue for order of resumption O-que. : RaR RaR-addr. Read. Register for resumption R-res. : RaR Read Write RaR-addr. Read RaR-addr. RaR Read O-que. RaR O-que. R-res. c 2013 Information Processing Society of Japan 164
t4 stall Nack Nack A Nack Abort A Nack Core3 Thread3 Abort Cache () Registration Registration R W addr () Check 1 0 A read bit 0 0 (t4) Check read bit Cache (Core3) R W addr 1 0 A 0 0 3 RaR RaR-addr. R-res. 4.2 Read-after-Read RaR 4.2.1 RaR-addr. 3 3 3 RaR-addr. Write-after-Read WaR Thread3 NACK Thread3 WaR Thread3 A R t4 R HTM Read R Thread3 Write A Read A RaR-addr. 4.2.2 RaR-addr. 4.2.1 RaR-addr. RaR 4 3 3 Read RaR-addr. 4 RaR-addr. A A RaR-addr. A RaR-addr. A Read Read Write A Read RaR Wakeup Thread3 Thread3 RaR t4 4.2.3 RaR-addr. RaR-addr. 4.1 RaR-addr. Read Write 1 Read Wrire RaR-addr. N N 1 2 4 64bit 128bit 256bit 32 256byte 512byte 1Kbyte c 2013 Information Processing Society of Japan 165
t4 4 ing A Core3 Thread3 A ing RaR-addr. RaR FIFO RaRaddr. RaR 5 4.3 4.2 5 4 RaR Thread3 RaR Read O-que. RaR R-res. Thread3 RaR O-que. Thread3 Thread3 R-res. O-que. R-res. O- que. O-que. 1 3 O-que 1 Wakeup t5 Wakeup R-res. 2 ted t7 t8 (t6) R-res. 2 O-que. t4 t5 t6 5 ing (t5) 1 3 Wakeup ted (t7) Wakeup Core3 Thread3 (t8) 2 ted ing t6 ted O-que. 3 Thread3 Wakeup t7 Wakeup Thread3 Thread3 R-res. ted t8 ted O-que. O-que. O-que. R-res. 32 1 4bit O-que. 4bit 31 4bit 32 32 512bytes 5. 5.1 HTM LogTM 7) Simics 8) 3.0.31 GEMS 9) 2.1.1 Simics c 2013 Information Processing Society of Japan 166
1 Processor SPARC V9 #cores 32 cores clock 1 GHz issue width single issue order in-order non-memory IPC 1 D1 cache 32 KBytes ways 4 ways latency 1 cycle D2 cache 8 MBytes ways 8 ways latency 20 cycles Memory 8 GBytes latency 450 cycles Interconnect network latency 14 cycles 2 GEMS SPLASH-2 STAMP All (R 1) 29.2% 19.1% 4.9% 22.6% 66.9% 39.9% 9.3% 66.9% (R 2) 29.3% 19.9% 5.2% 23.0% 66.9% 41.5% 9.9% 66.9% (R 4) 29.5% 19.9% 5.0% 23.1% 66.9% 41.1% 9.3% 66.9% (R ) 29.8% 22.4% 4.7% 24.0% 66.9% 40.9% 8.8% 66.9% GEMS 32 SPARC V9 OS Solaris 10 1 GEMS microbench SPLASH-2 STAMP 10 STAMP 2 Gramoli 10) 5.2 6 2 6 Non trans Good trans Bad trans Aborting Backoff Stall Barrier Magicing 5 (B) LogTM (R 1) RaR-addr. 1 (R 2) RaR-addr. 2 (R 4) RaR-addr. 4 (R ) LogTM B 1 R 1 R RaRaddr. Read Write 11) 10 95% 3 5.4 Write Read Futile Stall Futile Stall Btree B RaR-addr. R 1 R 1 R 1 22.6% 66.9% 5.3 GEMS microbench GEMS microbench Deque Prioque Backoff Read Write R 1 Futile Stall Backoff Btree Btree 2 Tx.I Tx.J Tx.I Read Write Tx.J Write Read Tx.I Tx.I Tx.J c 2013 Information Processing Society of Japan 167
Ratio of cycles 1.2 1 0.8 0.6 0.4 0.2 0 (B)conventional LogTM (baseline) (R1)RaR-access Detection (N = 1) (R2)RaR-access Detection (N = 2) (R4)RaR-access Detection (N = 4) (R )RaR-access Detection Magicing Barrier Stall Backoff Aborting Bad_trans Good_trans Non_trans GEMS microbench SPLASH-2 STAMP 6 Tx.J Write Read Btree SPLASH-2 SPLASH-2 Raytrace Backoff Raytrace Read Write 3 Futile Stall Futile Stall Backoff Cholesky Barrier Futile Stall Radiosity Read Write RaR-addr. RaR Radiosity RaR-addr. STAMP STAMP Kmeans Kmeans Read Write Kmeans Futile Stall 5.4 3 4 5 RaR-addr. 1 c 2013 Information Processing Society of Japan 168
3 R 1 RaR-addr. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 876,235 Barnes 86,413 Kmeans 148,084 Contention 562,844 Cholesky 296,708 Vacation 684,826 Deque 7,152 Radiosity 115,865 - - Prioque 72,095 Raytrace 1,257,086 - - 4 R 1 O-que. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 21,137 Barnes 417 Kmeans 270 Contention 130 Cholesky 3,751 Vacation 7 Deque 3,210 Radiosity 2,991 - - Prioque 3,022 Raytrace 38,524 - - 5 R 1 R-res. GEMS R 1 SPLASH-2 R 1 STAMP R 1 Btree 22,113 Barnes 448 Kmeans 324 Contention 152 Cholesky 5,888 Vacation 10 Deque 3,303 Radiosity 4,052 - - Prioque 3,232 Raytrace 39,456 - - 1 cycle O-que. 1 1 4bit 1 2 cycle R-res. 1 RaR-addr. 1 cycle Raytrace 0.2% 6. Read Write Read-after-Read HTM Futile Stall GEMS microbench SPLASH-2 STAMP HTM 66.9% 1) Herlihy, M. et al.: Transactional Memory: Architectural Support for Lock-Free Data Structures, Proc. 20th Int l Symp. on Computer Architecture (ISCA 93), pp.289 300 (1993). 2) J.Moravan, M. et al.: Supporting Nested Transactional Memory in LogTM, Proc. 12th Int l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp.1 12 (2006). 3) M, L., G, M. and A, G.: A Dynamically Adaptable Hardware Transactional Memory, Microarchitecture(MICRO), 2010 43rd Annual IEEE/ACM, pp.27 38 (2010). 4) Yoo, R.M. and Lee, H.-H.S.: Adaptive Transaction Scheduling for Transactional Memory Systems, Proc. 20th Annual Symp. on Parallelism in Algorithms and Architectures (SPAA 08), pp.169 178 (2008). 5) Akpinar, E. et al.: A Comprehensive Study of Conflict Resolution Policies in Hardware Transactional Memory, Proc. 6th ACM SIG- PLAN Workshop on Transactional Computing (TRANSACT 11) (2011). 6) Gaona, E. et al.: Dynamic Serialization Improving Energy Consumption in Eager- Eager Hardware Transactional Memory Systems, Proc. Parallel, Distributed and Network- Based Processing 2012 20th Euromicro International Conference (PDP 12), pp.221 228 (2012). 7) Moore, K.E. et al.: LogTM: Log-based Transactional Memory, Proc. 12th Int l Symp. on High-Performance Computer Architecture, pp. 254 265 (2006). 8) Magnusson, P.S. et al.: Simics: A Full System Simulation Platform, Computer, Vol.35, No.2, pp.50 58 (2002). 9) Martin, M. M. K. et al.: Multifacet s General Execution-driven Multiprocessor Simulator (GEMS) Toolset, ACM SIGARCH Computer Architecture News, Vol.33, No.4, pp.92 99 (2005). 10) Gramoli, V. and Guerraoui, R.: Transactions - stamp, http://lpdserver.epfl.ch/ transactions/wiki/doku.php?id=stamp (2011). 11) Alameldeen, A. R. and Wood, D. A.: Variability in Architectural Simulations of Multi- Threaded Workloads, Proc. 9th Int l Symp. on High-Performance Computer Architecture (HPCA 03), pp.7 18 (2003). c 2013 Information Processing Society of Japan 169