2015 (413M505)
1 FabScalar FabCache FabBus FabHetero FabCache FabCache FabCache FabCache FabCache 3.5 0.1ns 0.1
Abstract Single-ISA heterogeneous multi-core architecture which consists of diverse superscalar cores is increasing importance in the processor architecture. Using a proper superscalar core for characteristic in a program contributes to reduce energy consumption and improve performance. However, designing a heterogeneous multi-core processor requires a large design and verification effort. Therefore, FabHetero has been proposed which generates diverse heterogeneous multi-core processors automatically using FabScalar, FabCache, and FabBus which generate various designs of superscalar core, cache system, and flexible shared bus system, respectively. This paper presents the detail of FabCache and shows that the caches generated by FabCache with arbitrary parameter values such as cache capacity, line size, associativity, access latency, and line transmission width between cache hierarchies work correctly. This paper also focuses on performance estimation and the physical design of the caches. According to the estimation results, FabCache generates cache systems which have almost the same area and power consumption as hand-tuned cache because the ratio of L1 instruction and data cache controller including extra circuits is only 3.5% and the increased power consumption by comparing with hand-tuned cache is less than 0.1% although having the overhead of automatic generation.
1 1 2 3 2.1........... 3 2.2...................... 4 3 FabHetero 7 3.1............... 9 3.2........................ 9 3.3.................... 10 4 12 4.1 FPGA............. 12 4.2 LEON............. 13 5 14 5.1 FabCache....................... 14 5.2............. 15 5.3.............. 16 6 18 6.1..................... 18 6.2 L1................... 20 6.3 L1................. 21 6.4 L2..................... 23 6.5................ 25 6.5.1...... 25 6.5.2........ 28 6.6 FabCache...................... 31 7 32 7.1............................ 32 7.2............................ 34 7.3............................ 36 8 38 i
39 40 A 46 B 46 ii
2.1 Homogeneous and Heterogeneous multi-core........ 3 2.2 Example of Cache System................... 5 3.3 FabHetero........................... 7 6.4 Implementation of interleaved L1 instruction cache.... 20 6.5 L1 Data Cache........................ 21 6.6 L2 cache design........................ 23 6.7 Fetch image of superscalar.................. 25 6.8 Interleaved memory...................... 25 6.9 Interleaved memory...................... 26 6.10 Interleaved memory...................... 27 6.11 Miss status holding register................. 29 7.12 Cache hit rate......................... 33 7.13 L1Icache Power Consumption................. 34 7.14 L1Dcache Power Consumption................ 34 7.15 Chip image of L1 instruction cache.............. 36 7.16 Chip image of L1 data cache................. 37 iii
5.1 Available designs in FabCache................ 16 7.2 EDA environment....................... 32 7.3 Delay.............................. 34 iv
1 [1, 2, 3, 4] RTL(Register Transfer Level) FabScalar [5, 6, 7, 8, 9, 10, 11] FabScalar FabScalar 1
FabHetero [12] FabHetero 3 FabScalar FabCache [13] FabBus [14, 15] 3 FabHetero 4 5 FabCache 6 7 FabCache 2
2 2.1 Homogeneous Heterogeneous 2.1: Homogeneous and Heterogeneous multi-core CPU 1 ( 2.1 ) ( 2.1 ) 3
2.2 50% 2.2 L1 L2 2 3 4
CPU L1 L2 2.2: Example of Cache System. 2.1 1 5
6
FabScalar Core0 Core1 Core2 FabCache L1 Inst Cache L1 Data Cache L1-I L1-D L1-I L1-D L2 Cache L2-I L2-D FabBus Inter Connect Last level cache or main memory 3.3: FabHetero 3 FabHetero FabCache FabHetero FabHetero FabHetero 7
FabScalar FabCache FabBus 3 3.3 FabHetero 3 (Core 0, Core 1, Core 3) Core 0 L1 L1 Core 1 L1 L2 Core 2 L1 L2 L1 L2 FabHetero FabScalar, FabCache, FabBus FabScalar, FabBus 8
FabCache 3.1 FabScalar N.K.Choudhary RTL [16] FabScalar ILP 1 8 Load store unit (LSU) load store queue (LSQ) FabScalar 3.2 FabBus 9
FabBus FabBus ARM AMBA 3.3 [1, 2, 3, 5, 17] B. de Abreu Silva [17] 10
FabCache 11
4 [18, 19, 20, 21] 4.1 FPGA FPGA P. Yiannacouras [18] 2 3 12
4.2 LEON Leon4 [19] 2 n [20, 21] 13
5 5.1 FabCache 3.3 3.3 L1 3.3 L1 L2 3.3 L1L2 L2 L2 FabCache 1 FabScalar 14
FabCache FabCache ASIC 5.2 5.1 FabCache 1 2 1 2 n 3 15
5.1: Available designs in FabCache Memory hierarchy Dimensions( L = line size, Specific microarchitectures S = set size, W = associativity) L1 instruction cache L = (fetch width to 2 n ) 4(byte) two banks interleaved vs. non-interleaved S = 1 to 2 n 1 to 8 fetch width W = 1, 2 n -way, full Interface with L2 cache line size transmission vs. burst transmission enable vs. disable L1 data cache L = (1 to 2 n ) 4(byte) Miss handling S = 1 to 2 n blocking vs. non-blocking W = 1, 2 n -way, full Writing approach MSHR = 1 to 8 entry write-through vs. write-back Interface with L2 cache line size transmission vs. burst transmission enable vs. disable L2 cache L = wider than higher hierarchy dedicated instruction and data vs. unified S = 1 to 2 n Cache coherency W = 1, 2 n -way, full MOESI vs. MOSI vs. MEI vs. dedicated for each processor core. interface with shared memory processor num to/from one vs. processor num to/from multi-ported memory cache replacement policy LRU vs. Pseudo-LRU enable vs. disable 5.3 1 16
I/O I/O FabCache LRU 100 % 17
6 L1 L2 6.1 FabCache SystemVerilog 1 RTL P. Yiannacouras [18] RTL RTL 18
FabCache RTL 1 2 6.2 FabCache 19
6.2 L1 PC Tag Index Bank select bit Line select bit byte offset {Tag, Index} {Tag, Index} + 1 Bank select bit {Tag, Index} Bank select bit Line select bit Fetch width Even Bank line size set size way size are defined in parameter file swap Odd Bank line size set size way size are defined in parameter file (a, b, c, d) (e, f, g, h) (a, b, c, d, e, f, g, h) or (e, f, g, h, a, b, c, d) squeeze N (Fetch width) instructions 6.4: Implementation of interleaved L1 instruction cache 6.4 L1 L1 2 swap squeeze 20
L2 cache Store Ack Store request Replay request Miss status holding register Filled data Missed address Stage 1 Stall Stage 2 Store buffer Load buffer Memory buffer Request from cpu Tag memory controller Tag address Tag Hit/Miss signal Tag SRAM memory Data memory controller CPU Requested data Data SRAM memory Data address 6.5: L1 Data Cache bank select bit Line select bit 2 ( 3 1 4 ) squeeze 6.3 L1 6.5 L1 L1 2 miss status holding register (MSHR) 21
2 LRU Holding register 2 MSHR MSHR 1 MSHR L2 22
L1 inst. L1 data dedicated design Superset interface design L2 inst. or L2 bank0 L2 data or L2 bank1 bank select bit == 0 unified design arbiter0 arbiter1 bank select bit == 1 6.6: L2 cache design 6.4 L2 L2 2 2 6.6 L2 L2 2 L2 1 L2 L2 L2 2 L2 L1 2 3 23
L2 L1-L2 L2 L1 L2 L2 2 (2 ) SRAM L2 24
Core Cache 1 FETCH_WIDTH1 8 2 3 4 6.7: Fetch image of superscalar Line0 Line1 Line2 Line3 Line4 Line5 Line6 Line7 a b c d e f g h i j k l m n o p Line0 Line2 c d a b c d i j k l c d e f e f Line1 e f g h Line3 m n o p Normal Memory Even Bank Odd Bank 6.8: Interleaved memory 6.5 6.5.1 FabScalar 1 6.7 25
Even Bank a b c d i j k l q r s t Odd Bank e f g h m n o p u v w x (a, b, c, d) (e, f, g, h) swap (a, b, c, d, e, f, g, h) squeeze (b,c,d,e) 6.9: Interleaved memory 1 6.8 6.8 FabScalar 4 a p 1 6.8 1 4 ( a b c d ) 1 1 1 26
Even Bank a b c d i j k l q r s t Odd Bank e f g h m n o p u v w x (i, j, k, l) (e, f, g, h) swap (e, f, g, h, i, j, k, l) squeeze (f,g,h,i) 6.10: Interleaved memory c d e f 2 2 1 6.9 L1 2 27
c d e f 4 c, d e f squeeze 1 swap f, g, h, i i f, g, h swap 6.5.2 [22] FabCache CPU L1 AMBA4 16 MSHR 28
L2 Cache Filled data with ID Fill buffer Missed request address Filled data or Invalidate signal Stall signal Fill signal with ID Missed request address MHSR is full stage 2 stage 1 Missed request packet Status Miss status holding registar Comparator and status collection Replay signal Stall signal 6.11: Miss status holding register AMBA4 4 ID 16 MSHR 16 29
MSHR 6.11 MSHR 6.3 2 MSHR MSHR 2 1 MSHR ID fill buffer Fill buffer L2 fill buffer 1 fill buffer MSHR ID fill fill MSHR ID replay 30
1 6.6 FabCache FabCache FabHetero AMBA AMBA System on Chip (SOC) FabCache AMBA AMBA FabCache 1 31
7.2: EDA environment. Phase EDA tool functional verification Cadence NC-Verilog 09.20-S038 synthesis Synopsys Design Compiler 2013.03-SP2 place & route Synopsys IC Compiler G-2012.06 power estimation Synopsys XA G-2012.06-SP2 7 6.1 RTL SPEC2000INT EDA 7.2 7.1 FabCache SPEC2000INT 1 7.12 32
1 gap Hit rate 0.9 Direct 2-way 4-way 8-way 16-way Hit rate 0.8 4096 8192 16384 Cache capacity (KB) 32768 1 mcf Direct 2-way 4-way 8-way 16-way 0.9 0.8 4096 8192 16384 32768 Cache capacity (KB) 7.12: Cache hit rate 16 FabCache 33
1 FabCache design Hand design 0.8 0.6 0.4 0.2 0 gzip gcc bzip mcf Benchmarks 7.13: L1Icache Power Consumption. 1 FabCache design Hand design 0.8 0.6 0.4 0.2 0 gzip gcc bzip mcf Benchmarks 7.14: L1Dcache Power Consumption. 7.3: Delay. Design L1 instruction cache L1 data cache FabCache 2.39ns 2.45ns Hand-tuned 2.27ns 2.32ns 7.2 L1 L1 RTL L1 LRU L1 34
MSHR SPEC2000 INT 5000 EDA Synopsys XA G-2012.06-SP2 7.13, 7.14 L1 FabCache design 7.13, 7.14 FabCache design FabCache 7.3 FabCache 0.1% 0.1ns RTL 35
Cache Control Logic RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO 7.15: Chip image of L1 instruction cache. 7.3 7.15,7.16 L1 L1 8KB 4 1 LRU L1 1 MSHR FabCache Rohm 180nm [23] 7.15,7.16 Cache control logic 36
Cache Control Logic RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO 7.16: Chip image of L1 data cache. RAM MACRO SRAM 58,496.25µm 2 60,232.16µm 2 SRAM 1,668,016.628µm 2 1,669,752.538µm 2 3.5% 3.6% RTL 37
8 FabCache FabCache L1 FabCache L1 3.5% 0.1ns 1% FabCache 38
. Synopsys CAD VDEC Rohm VDEC 39
[1] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, K. I. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. 31st International Symposium on Computer Architecture (ISCA31), pp. 64-75, June 2004. [2] H. H. Najaf-abadi, E. Rotenberg. Configurational Workload Characterization. International Symposium on Performance Analysis of Systems and Software 2008 (ISPASS-2008), pp. 147-156, April 2008. [3] P. Greenhalgh. Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM WHITE PAPER: http://www.arm.com/ja/files/downloads/big.little Final.pdf. [4] P. Greenhalgh. Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM WHITE PAPER: http://www.arm.com/ja/files/downloads/ big.little Final.pdf. [5] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Roten- 40
berg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, Oct. 2013. [6] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. FabScalar: Automating Superscalar Core Design. Micro, IEEE (Volume:32, Issue: 3 ), pp. 48-59, June 2012. [7] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan and D. M. Tullsen. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction. Int l Symposium on Microarchitecture, Dec. 2003. [8] H. H. Najaf-abadi, N. K. Choudhary and E. Rotenberg. Core- Selectability in Chip Multiprocessors. 18th Int l Conference on Parallel Architectures and Compilation Techniques, Sep. 2009. 41
[9],, Eric Rotenberg,,, FabScalar Alpha 21264, SACSIS2012. [10] E. Rotenberg, B. H. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. [11] N. K. Choudhary, B. H. Dwiel, E. Rotenberg. A physical design study of fabscalar-generated superscalar cores. VLSI and System-on-Chip (VLSI-SoC), 2012 IEEE/IFIP 20th International Conference on, pp. 165-170, Oct. 2012. [12] T. Nakabayashi, T. Sasaki, E. Rotenberg, K. Ohno and T. Kondo. Research for Transporting Alpha ISA and Adopting Multi-processor to FabScalar. Symposium on Advanced Computing Systems and Infrastructures 2012 (SACSIS2012), pp. 374-381, May 2012. (in Japanese) [13] T. Okamoto, T. Nakabayashi, T. Sasaki, T. Kondo. FabCache: Cache Design Automation for Heterogeneous Multi-core Processors. 42
Proceedins of the 1st International Symposium on Computing and Networking, pp.602-606, Dec. 2013. [14], AMBA,SWOPP2012. [15] Y. Seto, T. Nakabayashi, T. Sasaki, and T. Kondo. FabBus: A Bus Framework for Heterogeneous Multi-core processor. 28th International Technical Conferench on Circuits/Systems, Computers and Communications (ITC-CSCC2013), pp. 254-257, July 2013. [16] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceeding of the 38th IEEE/ACM Int l Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. [17] B. de Abreu Silva, L.A. Cuminato and V. Bonato. Reducing the overall cache miss rate using different cache sizes for Heterogeneous 43
Multi-core Processors. Reconfigurable Computing and FPGAs (Re- ConFig), pp. 1-6, Dec. 2012. [18] P. Yiannacouras and J. Rose. A Parameterized Automatic Cache Generator for FPGAs Field-Programmable Technology (FPT), pp. 324-327, Dec. 2003. [19] Leon 4 and GRLIB. http://www.gaisler.com. [20] Thomas D. Tessier, Designing, Verifying and Building an Advanced L2 Cache Sub-System using SystemC. ISCUG, April 2012. [21] Akgul, B.E.S., Mooney, V.J,PARLAK: Parametrized Lock Cache Generator Design, Automation and Test in Europe Conference and Exhibition, pp.1138 1139, April 2003. [22] D. Kroft., Lockup-free instruction fetch/prefetch cache organization. International Symposium on Computer Architecture Proceedings of the 8th annual symposium on Computer Architecture, pp. 81 87, May 1981. [23] H. Onodera, A. Hirata, A. Kitamura, K. Kobayashi, and K. Tamaru, P2Lib:Process Portable Library and Its Generation System, Journal 44
of Information Processing, vol.40, no. 4, pp. 1660 1669, April, 1999, (In Japanese). 45
A B 46