Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH [8

Similar documents
Intel Xeon Phi (60 ) IBM Cyclops (64 [7]) [1] 10nm Memory Wall [6] [9] FPGA SH-2 2. FPGA FPGA FPGA Xilinx Virtex-6 HXT XC6VHX565T FPGA 2

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

Core1 FabScalar VerilogHDL Cache Cache FabScalar 1 CoreConnect[2] Wishbone[3] AMBA[4] AMBA 1 AMBA ARM L2 AMBA2.0 AMBA2.0 FabScalar AHB APB AHB AMBA2.0

Iteration 0 Iteration 1 1 Iteration 2 Iteration 3 N N N! N 1 MOPT(Merge Optimization) 3) MOPT MOP

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

2017 (413812)

Run-Based Trieから構成される 決定木の枝刈り法

GPGPU

Chip Size and Performance Evaluations of Shared Cache for On-chip Multiprocessor Takahiro SASAKI, Tomohiro INOUE, Nobuhiko OMORI, Tetsuo HIRONAKA, Han

FabHetero FabHetero FabHetero FabCache FabCache SPEC2000INT IPC FabCache 0.076%

A Responsive Processor for Parallel/Distributed Real-time Processing

IPSJ SIG Technical Report Vol.2017-ARC-225 No.12 Vol.2017-SLDM-179 No.12 Vol.2017-EMB-44 No /3/9 1 1 RTOS DefensiveZone DefensiveZone MPU RTOS


IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE {s-kasihr, wakamiya,

4.1 % 7.5 %

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2013-HPC-139 No /5/29 Gfarm/Pwrake NICT NICT 10TB 100TB CPU I/O HPC I/O NICT Gf

16.16%

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

,,,,., C Java,,.,,.,., ,,.,, i

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

The 18th Game Programming Workshop ,a) 1,b) 1,c) 2,d) 1,e) 1,f) Adapting One-Player Mahjong Players to Four-Player Mahjong

DS0 0/9/ a b c d u t (a) (b) (c) (d) [].,., Del Barrio [], Pilato [], [].,,. [],.,.,,.,.,,.,, 0%,..,,, 0,.,.,. (variable-latency unit)., (a) ( DFG ).,

IPSJ SIG Technical Report Vol.2015-ARC-215 No.7 Vol.2015-OS-133 No /5/26 Just-In-Time PG 1,a) 1, Just-In-Time VM Geyser Dalvik VM Caffei

[2] 2. [3 5] 3D [6 8] Morishima [9] N n 24 24FPS k k = 1, 2,..., N i i = 1, 2,..., n Algorithm 1 N io user-specified number of inbetween omis

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

A Feasibility Study of Direct-Mapping-Type Parallel Processing Method to Solve Linear Equations in Load Flow Calculations Hiroaki Inayoshi, Non-member

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

日本感性工学会論文誌

- March UFJ IBM M PC

6 2. AUTOSAR 2.1 AUTOSAR AUTOSAR ECU OSEK/VDX 3) OSEK/VDX OS AUTOSAR AUTOSAR ECU AUTOSAR 1 AUTOSAR BSW (Basic Software) (Runtime Environment) Applicat

1_26.dvi

3_39.dvi

Consideration of Cycle in Efficiency of Minority Game T. Harada and T. Murata (Kansai University) Abstract In this study, we observe cycle in efficien

28 Docker Design and Implementation of Program Evaluation System Using Docker Virtualized Environment

26 FPGA FPGA (Field Programmable Gate Array) ASIC (Application Specific Integrated Circuit) FPGA FPGA FPGA FPGA Linux FreeDOS skewed way L1

デザインパフォーマンス向上のためのHDLコーディング法

n 2 n (Dynamic Programming : DP) (Genetic Algorithm : GA) 2 i

卒業論文2.dvi

1 4 4 [3] SNS 5 SNS , ,000 [2] c 2013 Information Processing Society of Japan

i

DPA,, ShareLog 3) 4) 2.2 Strino Strino STRain-based user Interface with tacticle of elastic Natural ObjectsStrino 1 Strino ) PC Log-Log (2007 6)

untitled

IPSJ SIG Technical Report Vol.2009-DPS-141 No.23 Vol.2009-GN-73 No.23 Vol.2009-EIP-46 No /11/27 t-room t-room 2 Development of


1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

17 Proposal of an Algorithm of Image Extraction and Research on Improvement of a Man-machine Interface of Food Intake Measuring System

1

LTE移動通信システムのフィールドトライアル

strtok-count.eps

untitled

258 5) GPS 1 GPS 6) GPS DP 7) 8) 10) GPS GPS ) GPS Global Positioning System

3_23.dvi

12 DCT A Data-Driven Implementation of Shape Adaptive DCT

修士論文

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

組込みシステムシンポジウム2011 Embedded Systems Symposium 2011 ESS /10/20 FPGA Android Android Java FPGA Java FPGA Dalvik VM Intel Atom FPGA PCI Express DM

IEEE802.11n LAN WiMAX(Mobile Worldwide Interoperability for Microwave Access) LTE(Long Term Evolution) IEEE LAN Bluetooth IEEE LAN

Web Web Web Web i

DEIM Forum 2009 B4-6, Str

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

B

MAC root Linux 1 OS Linux 2.6 Linux Security Modules LSM [1] Security-Enhanced Linux SELinux [2] AppArmor[3] OS OS OS LSM LSM Performance Monitor LSMP

: ( 1) () 1. ( 1) 2. ( 1) 3. ( 2)

untitled

, IT.,.,..,.. i

2006 [3] Scratch Squeak PEN [4] PenFlowchart 2 3 PenFlowchart 4 PenFlowchart PEN xdncl PEN [5] PEN xdncl DNCL 1 1 [6] 1 PEN Fig. 1 The PEN

1: ( 1) 3 : 1 2 4

1 DHT Fig. 1 Example of DHT 2 Successor Fig. 2 Example of Successor 2.1 Distributed Hash Table key key value O(1) DHT DHT 1 DHT 1 ID key ID IP value D

2 ( ) i

The 15th Game Programming Workshop 2010 Magic Bitboard Magic Bitboard Bitboard Magic Bitboard Bitboard Magic Bitboard Magic Bitboard Magic Bitbo

Vol. 23 No. 4 Oct Kitchen of the Future 1 Kitchen of the Future 1 1 Kitchen of the Future LCD [7], [8] (Kitchen of the Future ) WWW [7], [3

LAN LAN LAN LAN LAN LAN,, i

Nios II ハードウェア・チュートリアル

Present Situation and Problems on Aseismic Design of Pile Foundation By H. Hokugo, F. Ohsugi, A. Omika, S. Nomura, Y. Fukuda Concrete Journal, Vol. 29

IPSJ SIG Technical Report NetMAS NetMAS NetMAS One-dimensional Pedestrian Model for Fast Evacuation Simulator Shunsuke Soeda, 1 Tomohisa Yam

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

<95DB8C9288E397C389C88A E696E6462>

IPSJ SIG Technical Report Vol.2015-HPC-150 No /8/6 I/O Jianwei Liao 1 Gerofi Balazs 1 1 Guo-Yuan Lien Prototyping F

IPSJ SIG Technical Report Vol.2014-EIP-63 No /2/21 1,a) Wi-Fi Probe Request MAC MAC Probe Request MAC A dynamic ads control based on tra

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

Q [4] 2. [3] [5] ϵ- Q Q CO CO [4] Q Q [1] i = X ln n i + C (1) n i i n n i i i n i = n X i i C exploration exploitation [4] Q Q Q ϵ 1 ϵ 3. [3] [5] [4]

system.pptx

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

28 Horizontal angle correction using straight line detection in an equirectangular image

Stratix IIIデバイスの外部メモリ・インタフェース

スライド 1

P2P P2P peer peer P2P peer P2P peer P2P i

Vol. 42 No. SIG 8(TOD 10) July HTML 100 Development of Authoring and Delivery System for Synchronized Contents and Experiment on High Spe

スパコンに通じる並列プログラミングの基礎

IPSJ SIG Technical Report Vol.2011-MUS-91 No /7/ , 3 1 Design and Implementation on a System for Learning Songs by Presenting Musical St

LinuxDeviceDriver2003-PDF.PDF

Computer Security Symposium October 2013 Android OS kub

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [

B HNS 7)8) HNS ( ( ) 7)8) (SOA) HNS HNS 4) HNS ( ) ( ) 1 TV power, channel, volume power true( ON) false( OFF) boolean channel volume int

1, 4,a) 1, 4 1, 4 1, , 4 3, 4 HPC HPC HPC Slurm 1. HPC Tianhe MW MW [1] MW CREST a)

Web Basic Web SAS-2 Web SAS-2 i

Transcription:

1600 1,a) 1,b) 8080 SH-2 8080 SH-2 Simulation of a Many-Core Architecture with 16 Million Processing Cores Hisanobu Tomari 1,a) Kei Hiraki 1,b) Abstract: 8080 and SH-2 processors are evaluated as building blocks for a many-core architecture. In manycore architecture processor core designs simpler than conventional ones are often used because the number of processing elements that are integrated on a chip is limited by the size of the processor core. A many-core system design intends to maximize the throughput of instruction execution through the balance between the number of processor cores and the performance of a processor core. We put the 8080, which is one of the simplest processors, and the SH-2 pipelined processor in our many-core design to examine the optimal balance of simplicity and performance for the processor core in many-core designs. 1. [3] 1 The University of Tokyo a) tomari@is.s.u-tokyo.ac.jp b) hiraki@is.s.u-tokyo.ac.jp [4] 1 Intel Single-chip 1

Cloud[2] (48 ) Xeon Phi (50+ ) IBM Cyclops[9] (64 ) Cavium Octeon II (32 ) Tilera Tile-GX (100 ) PE [11][7] 2 Nsim[10] 8080[1] SH-2[5] SH-2 1600 70 [8] 1 2. 2.1 Processing Element (PE) PE Shuffle Exchange 1 PE 2 Pipe ( ), Rank ( ) Rank Shuffle Exchange 1 ( 1) I/O PE Shuffle Exchange hop hop Reflective Memory [6] PE PE n n 1 PE ( 2) n n + 1 PE PE 1 PE 1 PE 2 8080 SH2 8080 8 1 3 1 4 16 SH-2 32 16 1 / 8080 PE 8080 1 16 SH-2 MIPS ARM, PowerPC 32 SH 68000 16 80 SH 8080 SH-2 SH-2 GNU ROM GNU Binutils 2

Pipe 0 Pipe 1 Pipe 2 Pipe 3 Pipe 4 Pipe 5 Pipe 6 Pipe 7 rank 0 rank 1 rank 2 rank 3 rank 4 1 PE PE Address space RM_P1 RM_P2 Other PE 0000 0300 0380 Local Memory RM_IN1 RM_IN2 Mapped to local memory in PEs in the next rank Mapped to local memory 0400 4000 RM_P1 Another PE RM_P1 4080 6000 6080 RM_P2 RM_P2 8000 Config Previous rank 2 PE 3

C 8080 1 KiB SH-2 2 KiB 8080 1 128 bytes Shuffle Exchange 2 1 KiB 256 bytes SH-2 2 2.2 MAME MAME CPU PE 0 MAME CPU 8080 SH-2 1 2 S N Number of packets Synchronization bit 3 N packets follow Payload Destination Pipe ID Distasnce to the destination 4 PE SH-2 7 8080 25 PE 2 1 8080 1 KiB 3200 36 GiB SH-2 2 KiB SH-2 512 1600 50 GiB OS PE 4

Clock count 1200 800 600 400 200 0 100 5 Npipe 8080 SH-2 0 1 8080 N rank N pipe = 4 8 16 32 2 3,269 6,010 12,059 22,824 4 5,594 12,314 23,897 8 12,578 24,146 16 24,270 1 δ δ2 r(n, t) = r(n, t) (1) δt δn2 r(n, t) = (x, y, z, w) 4 3. 3.1 1 (8080 1 SH-2 1 ) ( 3) 1 PE-PE 0 ( ) 1 ( PE) PE 2 PE log 2 N pipe 3 5 8080 SH-2 Shuffle Exchange log 2 N pipe SH-2 8080 1/5 8080 5 3.2 2 x t+1 (n) = (x t (n 1) 2x t (n) + x t (n + 1))/4 (2) log 2 N pipe PE log 2 N pipe PE Algorithm 1 hop ID ( 1) log 2 N pipe 1 8080 SH2 ( 6) Routing Calc Send 5

Algorithm 1 PE loop wait(output port 0 sync bit=0) output port 0 number of packets 0 wait(output port 1 sync bit=0) output port 1 number of packets 0 for p input port 0 and input port 1 do wait(p sync bit=1) for n = 0 to p number of packets do q pointer to the head of nth packet if distance to destination in q > 0 then route this packet to output port else copy payload to static region end if end for done(input port p, sync bit 0) end for do calculation output port 0, sync bit 1 output port 1, sync bit 1 end loop Cycles Cycles/s 6000 5000 4000 3000 2000 6 0 100 Route/SH Calc/SH Route/80 Calc/80 Send Calc Routing SH-2 1/5 8080 SH-2 8080 8 SH-2 32 4 SH-2 1 16 SH-2 SH-2 4 8080 SH 1 8080 468 SH-2 624 SH-2 33% 8080 1 KiB SH-2 3.3 6 Intel Westmere (2.93 GHz) 12 8080 25 SH-2 7 SH-2 8080 1/7 ( 7) SH-2 8080 1 SH-2 2 1 8080 805 MHz SH-2 112 MHz 10 1e+06 1e+07 1e+08 PE count 7 3.4 8080 FPGA 8080 8080 6

64 FPGA Xilinx Virtex-6 XC6VLX240T-1FF1156 1 Shuffle Exchange Shuffle Exchange PE Shuffle Exchange I/O Shuffle Exchange Shuffle Exchange 4. SH-2 8080 5 8080 64 SH-2 8080 SH-2 8080 5 SH-2 FPGA MIPS 8080 2 [1] Intel Corporation. intel 8080 microcomputer systems user s manual. September 1975. [2] Jim Held. Single-chip cloud computer an experimental many-core processor from Intel Labs. Intel Labs Singlechip Cloud Computer Symposium, 2010. [3] R. Kalla, B. Sinharoy, W.J. Starke, and M. Floyd. Power7: Ibm s next-generation server processor. Micro, IEEE, 30(2):7 15, march-april 2010. [4] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multithreaded sparc processor. Micro, IEEE, 25(2):21 29, march-april 2005. [5] Hitachi America Ltd. Superh risc engine sh-1/sh-2 programming manual. September 1996. [6] S. Lucci, I. Gertner, A. Gupta, and U. Hegde. Reflectivememory multiprocessor. In System Sciences, 1995. Proceedings of the Twenty-Eighth Hawaii International Conference on, volume 1, pages 85 94 vol.1, jan 1995. [7] Hisanobu Tomari. Design and evaluation of sea-of-core array architecture with 32 million processor cores. Masther Thesis, Dept. of Computer Science, the University of Tokyo, Mar. 2012. [8] M. Yokokawa, F. Shoji, A. Uno, M. Kurokawa, and T. Watanabe. The k computer: Japanese nextgeneration supercomputer development project. In Low Power Electronics and Design (ISLPED) 2011 International Symposium on, pages 371 372, aug. 2011. [9] Ying Ping Zhang, Taikyeong Jeong, Fei Chen, Haiping Wu, R. Nitzsche, and G.R. Gao. A study of the on-chip interconnection network for the ibm cyclops64 multicore architecture. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, page 10 pp., april 2006. [10],,,,,, and. PSI-NSIM :. IEICE technical report. Computer systems, 107(276):45 50, 2007. [11] and.. ARC 2010-ARC-190(3), jul 2010. 7