,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

1 1 1 1 SPEC CPU 2000 EQUAKE 1.6 50 500 A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes GAKUHO TAGUCHI 1 YOUICHI ABE 1 KEIJI KIMURA 1 HIRONORI KASAHARA 1 A parallelizing compiler cooperative multicore architecture simulation framework, which enables reducing simulation time by a flexible simulation-mode changeover mechanism, is proposed. A multicore architecture simulator in this framework has two modes; namely, functional-and-fast simulation mode and cycle-accurate-and-slow simulation modes. This framework generates appropriate sampling points for cycle-accurate mode and runtime for mode changeover of the simulator depending on a parallelized application by cooperating with a parallelizing compiler. The proposed framework is evaluated with EQUAKE from SPEC2000. The evaluation result shows 50 times to 500 times speedup can be achieved within 1.6% error. 1. 500010000 SimFlex 1) SimPoint 2) 3,4) 1 2 3 4 5 1 WASEDA UNIVERSITY 1

2. 2.1 3,4) 1 P% P%P=2.5 5%!%! (1) = (2) 2.2 2 l l 100130 2.3 1 1 Figure 1 A compilation flow of the proposing sampling based architecture simulation FE MP BE OSCAR 5) 1 2

3) 3. 2 3.1 2 2 3.2 3 OSCAR 6) sim_count 12 238 3 sim_change sim_change PE int sim_count[] = {122383 */ MAIN_PE0{ /*PE0 */ /* for(){ /**/ sim_change(0sim_count); */ /* MAIN_PE1{ /*PE1 */ for(){ sim_change(1sim_count); Figure 2 2 An image of simulation-mode changeover 3 Figure3 An image of code for changeover of simulation modes 3

4. 4.1 4.1.1 4.1.2 2 L2 L1 cache size L2 cache size 32kB 64kB256kB512kB L1 4 L2 5 4 5 20.0% 15.0% 10.0% 3 SPARC V9 8 L1 cache latency 1 L2 cache latency 4 memory latency 60 5.0% 0.0% L2 L1 Cache L2 Cache L1 Cache L2 Cache 32kB 4 L1 16kB Figure 4 L1 Cache-miss rate with and without runtime overhead 1 L1 L1 cache size 32kB16kB L2 cache size 512kB 4

16.0% 15.0% 14.0% 13.0% 12.0% 11.0% 10.0% 9.0% 8.0% 7.0% 6.0% 5.0% 5 L1 Cache L2 Cache L1 Cache L2 Cache L1 Cache L2 Cache 512kB 256kB 64kB L2 Figure 5 L2 Cache-miss rate with and without runtime overhead 4.2 4 Intel Xeon E5506 CPU Xeon CPU 8 CPU Clock 283GHz L1 Cache(I/D) 32KB/32KB L2 Cache 60MB Main Memory 78GB 7 250 250 125 250 125 250 (1) 5 6.00E+08 5.00E+08 4.00E+08 3.00E+08 2.00E+08 1.00E+08 0.00E+00 0 1000 2000 3000 4000 Figure 7 7 Execution cost of each iteration in a main loop of EQUAKE on a real server 6 EQUAKE Figure 6 Program structure of EQUAKE 5 EQUAKE 250 250 172E+07 349E+08 4 250 3605 722E+05 319E+08 1 5

3.0E+11 2.5E+11 2.0E+11 1.5E+11 1.0E+11 5.0E+10 0.0E+00 4 152545 all 4 152545 all 4 152545 all 4 152545 all 1PE 2PE 4PE 8PE 1.60% 1.40% 1.20% 1.00% 0.80% 0.60% 0.40% 0.20% 0.00% = 100 (3) 8 250 Figure 8 The number of presumed execution cycles and error rate of a portion before 250 iterations 6 2.5E+12 0.35% SPARC V9 1248 L1 cache size 32kB L1 cache latency 1 L2 cache size 512kB 2E+12 1.5E+12 1E+12 5E+11 0.30% 0.25% 0.20% 0.15% 0.10% 0.05% L2 cache latency 4 memory latency 60 L2 0 1 5 30 50 all 1 5 30 50 all 1 5 30 50 all 1 5 30 50 all 1PE 2PE 4PE 8PE 0.00% 9 250 Figure 9 The number of presumed execution cycles and error rate of a portion after 250 iterations 10 4 54 15 16 25 10 45 5 11 1 558 5 345 30 102 50 65 6

60 250 B23700064 50 40 30 20 10 0 4 15 25 45 all 10 250 Figure 10 The speedup rate of a portion before 250 iterations 250 600 500 400 300 200 100 0 1 5 30 50 all 1) Thomas F. Wenishch, Roland E. Wunderlich, Michael Ferdman, Anastassia Ailamaki, Bavak Falsafi, and James C. Hoe, Sim-Flex: Statistical Sampling of Computer System Simulation Micro IEEE, Volume 26, Issue 4, pp.32-42, July-Aug, 2006 2) Erez PerelmanGreg HamerlyMichael Van Biesbrouck Timothy SherwoodBrad Calder Using SimPoint for Accurate and Efficient Simulation SIGMETRICS 03, San Diego, California, USA. ACM 1-58113-664-1/03/0006, June 10 14, 2003 3). 2011-ARC-196(14), 1-11, 2011-07-20 4) 191 Vol. 2012-ARC-199, No.3, 2011-07-20 5) Hironori Kasahara, Motoki Obata, Kazuhisa Ishizaka, Automatic Coarse Grain Task Parallel Processing on SMP using OpenMP, Proc. of 13 th International Workshop on Languages and Compilers for Parallel Computing (LCPC 00), Aug., 2000 6) Keiji kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako and Hironori Kasahara, OSCAR API for Real-time Low-Power Multicores and Its Performance on Multicores and SMP Servers, Lecture Note in Computer Science, Springer, Vol.5898, pp.188-202, 2010 Figure 11 11 250 The speedup rate of a portion after 250 iterations 5. 7