1 FabScalar FabCache FabBus FabHetero FabCache FabCache FabCache FabCache FabCache ns 0.1

Similar documents
FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

4.1 % 7.5 %

16.16%

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

Chip Size and Performance Evaluations of Shared Cache for On-chip Multiprocessor Takahiro SASAKI, Tomohiro INOUE, Nobuhiko OMORI, Tetsuo HIRONAKA, Han

GPGPU

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

設計現場からの課題抽出と提言 なぜ開発は遅れるか?その解決策は?

JOURNAL OF THE JAPANESE ASSOCIATION FOR PETROLEUM TECHNOLOGY VOL. 66, NO. 6 (Nov., 2001) (Received August 10, 2001; accepted November 9, 2001) Alterna

Studies of Foot Form for Footwear Design (Part 9) : Characteristics of the Foot Form of Young and Elder Women Based on their Sizes of Ball Joint Girth

2017 (413812)

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

cpu2007lectureno2.ppt

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS

TCP/IP IEEE Bluetooth LAN TCP TCP BEC FEC M T M R M T 2. 2 [5] AODV [4]DSR [3] 1 MS 100m 5 /100m 2 MD 2 c 2009 Information Processing Society of

( ) [1] [4] ( ) 2. [5] [6] Piano Tutor[7] [1], [2], [8], [9] Radiobaton[10] Two Finger Piano[11] Coloring-in Piano[12] ism[13] MIDI MIDI 1 Fig. 1 Syst

Thesis.dvi

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

修士論文

P2P Web Proxy P2P Web Proxy P2P P2P Web Proxy P2P Web Proxy Web P2P WebProxy i

P2P P2P peer peer P2P peer P2P peer P2P i

<95DB8C9288E397C389C88A E696E6462>

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

16

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

先進的計算基盤システムシンポジウム SACSIS 2011 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/25 Combining Bimode Bimode-Plus Agree Hybr

組込みシステムシンポジウム2011 Embedded Systems Symposium 2011 ESS /10/20 FPGA Android Android Java FPGA Java FPGA Dalvik VM Intel Atom FPGA PCI Express DM

DTN DTN DTN DTN i

DRAM L2 L2 DRAM L2 DRAM L2 RAM DRAM 3 DRAM 3. 1 DRAM SRAM/DRAM 2. SRAM/DRAM DRAM LLC Last Level Cache 2 2) DRAM 1(A) (B) LLC L2 DRAM DRAM L2 SRAM DRAM

Table 1. Assumed performance of a water electrol ysis plant. Fig. 1. Structure of a proposed power generation system utilizing waste heat from factori

& Vol.5 No (Oct. 2015) TV 1,2,a) , Augmented TV TV AR Augmented Reality 3DCG TV Estimation of TV Screen Position and Ro


258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

EQUIVALENT TRANSFORMATION TECHNIQUE FOR ISLANDING DETECTION METHODS OF SYNCHRONOUS GENERATOR -REACTIVE POWER PERTURBATION METHODS USING AVR OR SVC- Ju

1., 1 COOKPAD 2, Web.,,,,,,.,, [1]., 5.,, [2].,,.,.,, 5, [3].,,,.,, [4], 33,.,,.,,.. 2.,, 3.., 4., 5., ,. 1.,,., 2.,. 1,,

テストコスト抑制のための技術課題-DFTとATEの観点から

untitled

IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 I/O Jianwei Liao 1 Gerofi Balazs 1 1 Guo-Yuan Lien Prototyping F

M SRAM 1 25 ns ,000 DRAM ns ms 5,000,

スライド 1

Introduction Purpose This training course describes the configuration and session features of the High-performance Embedded Workshop (HEW), a key tool

パナソニック技報

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

LAN LAN LAN LAN LAN LAN,, i

プロセッサ・アーキテクチャ

「FPGAを用いたプロセッサ検証システムの製作」

1

橡自動車~1.PDF

Microsoft PowerPoint MPSoC-KojiInoue-web.pptx

卒業論文2.dvi

WMN Wi-Fi MBCR i

Fig. 3 Flow diagram of image processing. Black rectangle in the photo indicates the processing area (128 x 32 pixels).

A Responsive Processor for Parallel/Distributed Real-time Processing

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

NotePC 8 10cd=m 2 965cd=m Note-PC Weber L,M,S { i {

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

Vol. 42 No. 4 Apr VC 2 VC 4 VC VC 4 Recover-x Performance Evaluation of Adaptive Routers Based on the Number of Virtual Channels and Operating F

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

,,,,., C Java,,.,,.,., ,,.,, i

WebRTC P2P Web Proxy P2P Web Proxy WebRTC WebRTC Web, HTTP, WebRTC, P2P i

PC PDA SMTP/POP3 1 POP3 SMTP MUA MUA MUA i

知能と情報, Vol.30, No.5, pp

fiš„v8.dvi

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

2 ( ) i

28 Horizontal angle correction using straight line detection in an equirectangular image

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

24 LED A visual programming environment for art work using a LED matrix

IT i

先端社会研究 ★5★号/4.山崎

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2



THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE {s-kasihr, wakamiya,

先進的計算基盤システムシンポジウム SACSIS2012 Symposium on Advanced Computing Systems and Infrastructures SACSIS /5/18 CPU, CPU., Memory-bound CPU,., Memory-bo

Vol.-ARC-8 No.8 Vol.-OS- No.8 // DRAM DRAM DRAM DRAM ) DRAM. DRAM. ) DRAM DRAM DRAM DRAM DRAM SRAM DRAM MB B MB DRAM SRAM.. DRAM DRAM SRAM DRAM SRAM C

スライド 1

Synthesis and Development of Electric Active Stabilizer Suspension System Shuuichi BUMA*6, Yasuhiro OOKUMA, Akiya TANEDA, Katsumi SUZUKI, Jae-Sung CHO

Virtual Window System Virtual Window System Virtual Window System Virtual Window System Virtual Window System Virtual Window System Social Networking

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

Vol. 48 No. 4 Apr LAN TCP/IP LAN TCP/IP 1 PC TCP/IP 1 PC User-mode Linux 12 Development of a System to Visualize Computer Network Behavior for L

On the Wireless Beam of Short Electric Waves. (VII) (A New Electric Wave Projector.) By S. UDA, Member (Tohoku Imperial University.) Abstract. A new e


% 95% 2002, 2004, Dunkel 1986, p.100 1

,.,.,,.,. X Y..,,., [1].,,,.,,.. HCI,,,,,,, i

3_23.dvi

56 OS OS OS OS 1 OS HDD OS 1 OS HDD HDD OS OS OSOS HDD 図 1 二重キャッシュ環境 3. 負の参照の時間的局所性 3.1 参照の局所性 Locality of Reference Temporal locality Spatial localit

IPSJ SIG Technical Report Vol.2009-DPS-141 No.20 Vol.2009-GN-73 No.20 Vol.2009-EIP-46 No /11/27 1. MIERUKEN 1 2 MIERUKEN MIERUKEN MIERUKEN: Spe

ネットリストおよびフィジカル・シンセシスの最適化

IPSJ SIG Technical Report Vol.2010-GN-74 No /1/ , 3 Disaster Training Supporting System Based on Electronic Triage HIROAKI KOJIMA, 1 KU

1 1 tf-idf tf-idf i

1. HNS [1] HNS HNS HNS [2] HNS [3] [4] [5] HNS 16ch SNR [6] 1 16ch 1 3 SNR [4] [5] 2. 2 HNS API HNS CS27-HNS [1] (SOA) [7] API Web 2

(責)江.indd

Transcription:

2015 (413M505)

1 FabScalar FabCache FabBus FabHetero FabCache FabCache FabCache FabCache FabCache 3.5 0.1ns 0.1

Abstract Single-ISA heterogeneous multi-core architecture which consists of diverse superscalar cores is increasing importance in the processor architecture. Using a proper superscalar core for characteristic in a program contributes to reduce energy consumption and improve performance. However, designing a heterogeneous multi-core processor requires a large design and verification effort. Therefore, FabHetero has been proposed which generates diverse heterogeneous multi-core processors automatically using FabScalar, FabCache, and FabBus which generate various designs of superscalar core, cache system, and flexible shared bus system, respectively. This paper presents the detail of FabCache and shows that the caches generated by FabCache with arbitrary parameter values such as cache capacity, line size, associativity, access latency, and line transmission width between cache hierarchies work correctly. This paper also focuses on performance estimation and the physical design of the caches. According to the estimation results, FabCache generates cache systems which have almost the same area and power consumption as hand-tuned cache because the ratio of L1 instruction and data cache controller including extra circuits is only 3.5% and the increased power consumption by comparing with hand-tuned cache is less than 0.1% although having the overhead of automatic generation.

1 1 2 3 2.1........... 3 2.2...................... 4 3 FabHetero 7 3.1............... 9 3.2........................ 9 3.3.................... 10 4 12 4.1 FPGA............. 12 4.2 LEON............. 13 5 14 5.1 FabCache....................... 14 5.2............. 15 5.3.............. 16 6 18 6.1..................... 18 6.2 L1................... 20 6.3 L1................. 21 6.4 L2..................... 23 6.5................ 25 6.5.1...... 25 6.5.2........ 28 6.6 FabCache...................... 31 7 32 7.1............................ 32 7.2............................ 34 7.3............................ 36 8 38 i

39 40 A 46 B 46 ii

2.1 Homogeneous and Heterogeneous multi-core........ 3 2.2 Example of Cache System................... 5 3.3 FabHetero........................... 7 6.4 Implementation of interleaved L1 instruction cache.... 20 6.5 L1 Data Cache........................ 21 6.6 L2 cache design........................ 23 6.7 Fetch image of superscalar.................. 25 6.8 Interleaved memory...................... 25 6.9 Interleaved memory...................... 26 6.10 Interleaved memory...................... 27 6.11 Miss status holding register................. 29 7.12 Cache hit rate......................... 33 7.13 L1Icache Power Consumption................. 34 7.14 L1Dcache Power Consumption................ 34 7.15 Chip image of L1 instruction cache.............. 36 7.16 Chip image of L1 data cache................. 37 iii

5.1 Available designs in FabCache................ 16 7.2 EDA environment....................... 32 7.3 Delay.............................. 34 iv

1 [1, 2, 3, 4] RTL(Register Transfer Level) FabScalar [5, 6, 7, 8, 9, 10, 11] FabScalar FabScalar 1

FabHetero [12] FabHetero 3 FabScalar FabCache [13] FabBus [14, 15] 3 FabHetero 4 5 FabCache 6 7 FabCache 2

2 2.1 Homogeneous Heterogeneous 2.1: Homogeneous and Heterogeneous multi-core CPU 1 ( 2.1 ) ( 2.1 ) 3

2.2 50% 2.2 L1 L2 2 3 4

CPU L1 L2 2.2: Example of Cache System. 2.1 1 5

6

FabScalar Core0 Core1 Core2 FabCache L1 Inst Cache L1 Data Cache L1-I L1-D L1-I L1-D L2 Cache L2-I L2-D FabBus Inter Connect Last level cache or main memory 3.3: FabHetero 3 FabHetero FabCache FabHetero FabHetero FabHetero 7

FabScalar FabCache FabBus 3 3.3 FabHetero 3 (Core 0, Core 1, Core 3) Core 0 L1 L1 Core 1 L1 L2 Core 2 L1 L2 L1 L2 FabHetero FabScalar, FabCache, FabBus FabScalar, FabBus 8

FabCache 3.1 FabScalar N.K.Choudhary RTL [16] FabScalar ILP 1 8 Load store unit (LSU) load store queue (LSQ) FabScalar 3.2 FabBus 9

FabBus FabBus ARM AMBA 3.3 [1, 2, 3, 5, 17] B. de Abreu Silva [17] 10

FabCache 11

4 [18, 19, 20, 21] 4.1 FPGA FPGA P. Yiannacouras [18] 2 3 12

4.2 LEON Leon4 [19] 2 n [20, 21] 13

5 5.1 FabCache 3.3 3.3 L1 3.3 L1 L2 3.3 L1L2 L2 L2 FabCache 1 FabScalar 14

FabCache FabCache ASIC 5.2 5.1 FabCache 1 2 1 2 n 3 15

5.1: Available designs in FabCache Memory hierarchy Dimensions( L = line size, Specific microarchitectures S = set size, W = associativity) L1 instruction cache L = (fetch width to 2 n ) 4(byte) two banks interleaved vs. non-interleaved S = 1 to 2 n 1 to 8 fetch width W = 1, 2 n -way, full Interface with L2 cache line size transmission vs. burst transmission enable vs. disable L1 data cache L = (1 to 2 n ) 4(byte) Miss handling S = 1 to 2 n blocking vs. non-blocking W = 1, 2 n -way, full Writing approach MSHR = 1 to 8 entry write-through vs. write-back Interface with L2 cache line size transmission vs. burst transmission enable vs. disable L2 cache L = wider than higher hierarchy dedicated instruction and data vs. unified S = 1 to 2 n Cache coherency W = 1, 2 n -way, full MOESI vs. MOSI vs. MEI vs. dedicated for each processor core. interface with shared memory processor num to/from one vs. processor num to/from multi-ported memory cache replacement policy LRU vs. Pseudo-LRU enable vs. disable 5.3 1 16

I/O I/O FabCache LRU 100 % 17

6 L1 L2 6.1 FabCache SystemVerilog 1 RTL P. Yiannacouras [18] RTL RTL 18

FabCache RTL 1 2 6.2 FabCache 19

6.2 L1 PC Tag Index Bank select bit Line select bit byte offset {Tag, Index} {Tag, Index} + 1 Bank select bit {Tag, Index} Bank select bit Line select bit Fetch width Even Bank line size set size way size are defined in parameter file swap Odd Bank line size set size way size are defined in parameter file (a, b, c, d) (e, f, g, h) (a, b, c, d, e, f, g, h) or (e, f, g, h, a, b, c, d) squeeze N (Fetch width) instructions 6.4: Implementation of interleaved L1 instruction cache 6.4 L1 L1 2 swap squeeze 20

L2 cache Store Ack Store request Replay request Miss status holding register Filled data Missed address Stage 1 Stall Stage 2 Store buffer Load buffer Memory buffer Request from cpu Tag memory controller Tag address Tag Hit/Miss signal Tag SRAM memory Data memory controller CPU Requested data Data SRAM memory Data address 6.5: L1 Data Cache bank select bit Line select bit 2 ( 3 1 4 ) squeeze 6.3 L1 6.5 L1 L1 2 miss status holding register (MSHR) 21

2 LRU Holding register 2 MSHR MSHR 1 MSHR L2 22

L1 inst. L1 data dedicated design Superset interface design L2 inst. or L2 bank0 L2 data or L2 bank1 bank select bit == 0 unified design arbiter0 arbiter1 bank select bit == 1 6.6: L2 cache design 6.4 L2 L2 2 2 6.6 L2 L2 2 L2 1 L2 L2 L2 2 L2 L1 2 3 23

L2 L1-L2 L2 L1 L2 L2 2 (2 ) SRAM L2 24

Core Cache 1 FETCH_WIDTH1 8 2 3 4 6.7: Fetch image of superscalar Line0 Line1 Line2 Line3 Line4 Line5 Line6 Line7 a b c d e f g h i j k l m n o p Line0 Line2 c d a b c d i j k l c d e f e f Line1 e f g h Line3 m n o p Normal Memory Even Bank Odd Bank 6.8: Interleaved memory 6.5 6.5.1 FabScalar 1 6.7 25

Even Bank a b c d i j k l q r s t Odd Bank e f g h m n o p u v w x (a, b, c, d) (e, f, g, h) swap (a, b, c, d, e, f, g, h) squeeze (b,c,d,e) 6.9: Interleaved memory 1 6.8 6.8 FabScalar 4 a p 1 6.8 1 4 ( a b c d ) 1 1 1 26

Even Bank a b c d i j k l q r s t Odd Bank e f g h m n o p u v w x (i, j, k, l) (e, f, g, h) swap (e, f, g, h, i, j, k, l) squeeze (f,g,h,i) 6.10: Interleaved memory c d e f 2 2 1 6.9 L1 2 27

c d e f 4 c, d e f squeeze 1 swap f, g, h, i i f, g, h swap 6.5.2 [22] FabCache CPU L1 AMBA4 16 MSHR 28

L2 Cache Filled data with ID Fill buffer Missed request address Filled data or Invalidate signal Stall signal Fill signal with ID Missed request address MHSR is full stage 2 stage 1 Missed request packet Status Miss status holding registar Comparator and status collection Replay signal Stall signal 6.11: Miss status holding register AMBA4 4 ID 16 MSHR 16 29

MSHR 6.11 MSHR 6.3 2 MSHR MSHR 2 1 MSHR ID fill buffer Fill buffer L2 fill buffer 1 fill buffer MSHR ID fill fill MSHR ID replay 30

1 6.6 FabCache FabCache FabHetero AMBA AMBA System on Chip (SOC) FabCache AMBA AMBA FabCache 1 31

7.2: EDA environment. Phase EDA tool functional verification Cadence NC-Verilog 09.20-S038 synthesis Synopsys Design Compiler 2013.03-SP2 place & route Synopsys IC Compiler G-2012.06 power estimation Synopsys XA G-2012.06-SP2 7 6.1 RTL SPEC2000INT EDA 7.2 7.1 FabCache SPEC2000INT 1 7.12 32

1 gap Hit rate 0.9 Direct 2-way 4-way 8-way 16-way Hit rate 0.8 4096 8192 16384 Cache capacity (KB) 32768 1 mcf Direct 2-way 4-way 8-way 16-way 0.9 0.8 4096 8192 16384 32768 Cache capacity (KB) 7.12: Cache hit rate 16 FabCache 33

1 FabCache design Hand design 0.8 0.6 0.4 0.2 0 gzip gcc bzip mcf Benchmarks 7.13: L1Icache Power Consumption. 1 FabCache design Hand design 0.8 0.6 0.4 0.2 0 gzip gcc bzip mcf Benchmarks 7.14: L1Dcache Power Consumption. 7.3: Delay. Design L1 instruction cache L1 data cache FabCache 2.39ns 2.45ns Hand-tuned 2.27ns 2.32ns 7.2 L1 L1 RTL L1 LRU L1 34

MSHR SPEC2000 INT 5000 EDA Synopsys XA G-2012.06-SP2 7.13, 7.14 L1 FabCache design 7.13, 7.14 FabCache design FabCache 7.3 FabCache 0.1% 0.1ns RTL 35

Cache Control Logic RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO 7.15: Chip image of L1 instruction cache. 7.3 7.15,7.16 L1 L1 8KB 4 1 LRU L1 1 MSHR FabCache Rohm 180nm [23] 7.15,7.16 Cache control logic 36

Cache Control Logic RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO RAM MACRO 7.16: Chip image of L1 data cache. RAM MACRO SRAM 58,496.25µm 2 60,232.16µm 2 SRAM 1,668,016.628µm 2 1,669,752.538µm 2 3.5% 3.6% RTL 37

8 FabCache FabCache L1 FabCache L1 3.5% 0.1ns 1% FabCache 38

. Synopsys CAD VDEC Rohm VDEC 39

[1] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, K. I. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. 31st International Symposium on Computer Architecture (ISCA31), pp. 64-75, June 2004. [2] H. H. Najaf-abadi, E. Rotenberg. Configurational Workload Characterization. International Symposium on Performance Analysis of Systems and Software 2008 (ISPASS-2008), pp. 147-156, April 2008. [3] P. Greenhalgh. Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM WHITE PAPER: http://www.arm.com/ja/files/downloads/big.little Final.pdf. [4] P. Greenhalgh. Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM WHITE PAPER: http://www.arm.com/ja/files/downloads/ big.little Final.pdf. [5] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Roten- 40

berg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, Oct. 2013. [6] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. FabScalar: Automating Superscalar Core Design. Micro, IEEE (Volume:32, Issue: 3 ), pp. 48-59, June 2012. [7] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan and D. M. Tullsen. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Processor Power Reduction. Int l Symposium on Microarchitecture, Dec. 2003. [8] H. H. Najaf-abadi, N. K. Choudhary and E. Rotenberg. Core- Selectability in Chip Multiprocessors. 18th Int l Conference on Parallel Architectures and Compilation Techniques, Sep. 2009. 41

[9],, Eric Rotenberg,,, FabScalar Alpha 21264, SACSIS2012. [10] E. Rotenberg, B. H. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. [11] N. K. Choudhary, B. H. Dwiel, E. Rotenberg. A physical design study of fabscalar-generated superscalar cores. VLSI and System-on-Chip (VLSI-SoC), 2012 IEEE/IFIP 20th International Conference on, pp. 165-170, Oct. 2012. [12] T. Nakabayashi, T. Sasaki, E. Rotenberg, K. Ohno and T. Kondo. Research for Transporting Alpha ISA and Adopting Multi-processor to FabScalar. Symposium on Advanced Computing Systems and Infrastructures 2012 (SACSIS2012), pp. 374-381, May 2012. (in Japanese) [13] T. Okamoto, T. Nakabayashi, T. Sasaki, T. Kondo. FabCache: Cache Design Automation for Heterogeneous Multi-core Processors. 42

Proceedins of the 1st International Symposium on Computing and Networking, pp.602-606, Dec. 2013. [14], AMBA,SWOPP2012. [15] Y. Seto, T. Nakabayashi, T. Sasaki, and T. Kondo. FabBus: A Bus Framework for Heterogeneous Multi-core processor. 28th International Technical Conferench on Circuits/Systems, Computers and Communications (ITC-CSCC2013), pp. 254-257, July 2013. [16] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceeding of the 38th IEEE/ACM Int l Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. [17] B. de Abreu Silva, L.A. Cuminato and V. Bonato. Reducing the overall cache miss rate using different cache sizes for Heterogeneous 43

Multi-core Processors. Reconfigurable Computing and FPGAs (Re- ConFig), pp. 1-6, Dec. 2012. [18] P. Yiannacouras and J. Rose. A Parameterized Automatic Cache Generator for FPGAs Field-Programmable Technology (FPT), pp. 324-327, Dec. 2003. [19] Leon 4 and GRLIB. http://www.gaisler.com. [20] Thomas D. Tessier, Designing, Verifying and Building an Advanced L2 Cache Sub-System using SystemC. ISCUG, April 2012. [21] Akgul, B.E.S., Mooney, V.J,PARLAK: Parametrized Lock Cache Generator Design, Automation and Test in Europe Conference and Exhibition, pp.1138 1139, April 2003. [22] D. Kroft., Lockup-free instruction fetch/prefetch cache organization. International Symposium on Computer Architecture Proceedings of the 8th annual symposium on Computer Architecture, pp. 81 87, May 1981. [23] H. Onodera, A. Hirata, A. Kitamura, K. Kobayashi, and K. Tamaru, P2Lib:Process Portable Library and Its Generation System, Journal 44

of Information Processing, vol.40, no. 4, pp. 1660 1669, April, 1999, (In Japanese). 45

A B 46