User-defined Logic Application Memory Manager (Replacement) Application Specific Prefetcher (ASP) Application Kernel On-chip RAM (BRAM) On-chip RAM I/

RTL 1,2,a) 1,b) CPU Verilog HDL RTL 1. CPU GPU Verilog HDL VHDL RTL HDL Vivado HLS Impulse C CPU 1 2 a) takamaeda@arch.cs.titech.ac.jp b) kise@cs.titech.ac.jp RTL RTL RTL Verilog HDL RTL 2. 1 HDL 1

User-defined Logic Application Memory Manager (Replacement) Application Specific Prefetcher (ASP) Application Kernel On-chip RAM (BRAM) On-chip RAM I/O I/O s s 1 3 2 User-defined Logic Application I/O s On-chip RAM 3 3 (ASP: Application Specific Prefetecher) 2

5 Preprocess (Resolving macros) Lexical Analysis (Separating into tokens) Parse (AST generation) 4 Source Codes Module Analysis (Module / Input / Output / Inout / Parameter) Signal Analysis (Reg / Wire / Localparam) Bind Analysis (dataflow generation from =/<= assignments) Definition Tree Definition Tree Control Flow Analysis (Constructing FSM) Memory Access Timing Analysis Memory Address Analysis (Data Flow Analysis) Generating Definition Tree of Prefetcher Combining Trees of Application and Prefetcher Generating RTL in Verilog HDL Source Code with Prefetcher 3. RTL 3.1 RTL RTL RTL 5?? RTL Verilog HDL (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11), (12) RTL Verilog HDL (7) (8) (9) Python 9000 3.2 6 6 1 4 2 1 cnt 4 cnt cnt 6 4 7 1 cnt 1 4 cnt 4 CPU 4. Verilog HDL 3

情報処理学会研究報告 100.0% 250000 191318 150000 100000 93.7% 96.9% Base Prefetch 80.0% Hit rate! Cycle! 200000 195414 60.0% 40.0% 20.0% 50000 0.0% 0 Base Prefetch (a) 実行サイクル数図 8 (b) キャッシュヒット率実行サイクル数とキャッシュヒット率 Read サイクルレベルのタイミングシミュレータを VPI (Verilog Programming Interface) を介して HDL シミュレーションに組み込み使用したキャッシュの構成はラインサイズを 64 バイトウェイ数を 4 キャッシュ容量を 16K バイトアクセスレイテンシを 1 としたメインメモリにはアクセスレイテンシは 16 サイクル固定としたシンプルなモデルを用いたベクター加算の扱うデータのメモリフットプリントは 96K バイトとした 1 回のベクター加算の処 Write 理には 8 サイクルのレイテンシを要するもとして演算はパイプライン化されていないものとした図 8(a) に基準のアプリケーションの実行サイクル数とプリフェッチャーを用いた場合の実行サイクル数を示すまた図 8(b) に両者のキャッシュヒット率を示すプリフェッチャーの導入により 2.1%の性能向上を達成したまたキャッシュヒット率が 3.1%向上した性能向上率が Source of Address 伸び悩んだ理由としてはキャッシュが許可するアウトスタンディングミスの数を 1 としたためプリフェッチリクエストが後続の読み出しを妨害したことと今回のプリ図 6 Verilog HDL で記述したメモリアクセスを制御する状態遷移フェッチ対象がループ中の同状態における次回のアクセコード例ス先であったため時系列において後続のリクエストに対する先行読み出しが行えなかったことなどが挙げられる前者を回避するにはアプリケーションカーネルのリクエストを優先しカーネルからリクエストが発行された場合にはプリフェッチャー側の処理をアボートするなどの処置を施すことなどが必要である後者を回避するには時系列順に次のアクセスを対象としてプリフェッチするようなプリフェッチャーの構成を検討する必要がある 5. 関連研究向けのメモリシステムの最適化の研究としては Samuel ら [2] による高位合成言語で記述されたカーネル図 7 生成されるプリフェッチ用コード例のコースコードを解析しオフチップ S へのメモリアクセスを並べ替えることによりメモリバンド幅を有効単なベンチマークを用いて提案手法による性能向上の度利用する方式や Eric ら [3] による抽象度の高いメモリモ合いを評価するデルを用いてアプリケーションを記述し外部メモリとの性能およびキャッシュヒット率を Icarus Verilog[1] を用カーネルの間にキャッシュとデータ転送機構を自動的に挿いてシミュレーションにより評価するベンチマークには入するフレームワークの CoRAM などが挙げられる前者ベクター加算を用いたキャッシュには C++で記述したは高位合成系をターゲットしておりまたループ中のイ 2013 Information Processing Society of Japan 4

SMT [4], [5] [4] Lu, J., Das, A., Hsu, W.-C., Nguyen, K. and Abraham, S. G.: Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, Washington, DC, USA, IEEE Computer Society, pp. 93 104 (online), DOI: 10.1109/MI- CRO.2005.18 (2005). [5] Kamruzzaman, M., Swanson, S. and Tullsen, D. M.: Inter-core prefetching for multicore processors using migrating helper threads, Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS 11, New York, NY, USA, ACM, pp. 393 404 (online), DOI: 10.1145/1950365.1950411 (2011). 6. Verilog HDL RTL (CREST) [1] Williams, S. and Baxter, M.: Icarus verilog: opensource verilog more than a year later, Linux J., Vol. 2002, No. 99, pp. 3 (online), available from http://dl.acm.org/citation.cfm?id=513581.513584 (2002). [2] Bayliss, S. and Constantinides, G. A.: Optimizing S bandwidth for custom loop accelerators, Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays, 12, New York, NY, USA, ACM, pp. 195 204 (online), DOI: 10.1145/2145694.2145727 (2012). [3] Chung, E. S., Hoe, J. C. and Mai, K.: CoRAM: an infabric memory architecture for -based computing, Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays, 11, New York, NY, USA, ACM, pp. 97 106 (online), DOI: 10.1145/1950413.1950435 (2011). 5