2015 8 28 14:30-15:50 (s5c) SWEST17@ ( 38 SIGEMB ) Python PyCoRAM FPGA E-mail: shinya_at_is_naist_jp
SWEST2015 Shinya T-Y, NAIST n l : Python : FPGA n l l PyCoRAM: Python IP Pyverilog: Verilog HDL Veriloggen: Python Verilog HDL
SWEST2015 Shinya T-Y, NAIST All software are available! n GitHub l PyCoRAM: https://github.com/pyhdi/pycoram l Pyverilog: https://github.com/pyhdi/pyverilog l Veriloggen: https://github.com/pyhdi/veriloggen $ git clone https://github.com/pyhdi/pyverilog.git $ git clone https://github.com/pyhdi/pycoram.git $ git clone https://github.com/pyhdi/veriloggen.git n PIP Python l GitHub $ pip install pyverilog $ pip install pycoram $ pip install veriloggen
n FPGA l FPGA l FPGA n l l n Python PyCoRAM l Python Verilog HDL IP l : Pyverilog, Veriloggen n l n SWEST2015 Shinya T-Y, NAIST
FPGA SWEST2015 Shinya T-Y, NAIST
ヘテロジニアスコンピューティング Multicore (Intel Corei7) GPU (NVIDIA GeForce) OoO Core OoO Core OoO Core OoO Core L2 L2 L2 L2 L3 Cache L2 Cache DDR3 DRAM GDDR5 DRAM Manycore (Intel Xeon Phi) FPGA (Xilinx Virtex-7) DDR3 DRAM SWEST2015 DDR3 DRAM Shinya T-Y, NAIST 6
SWEST2015 Shinya T-Y, NAIST FPGA (Field Programmable Gate Array) n LSI (PLD: Programmable Logic Device) l l CPU GPU FLD (Fixed Logic Device) n CPU l CPU CPU l FPGA Digital circuits
FPGAボードの基礎 Digilent NetFPGA SUME Price $24,500 SATA-3 x2 DDR3 SODIMM (4GB x2) 10Gbps Ethernet x4 SWEST2015 FPGA (Xilinx Virtex-7 XC7V690T) PCI-express Shinya T-Y, NAIST 8
Digilent Nexys3 FPGA: Xilinx Spartan-6 LX16 Size: Pipelined CPU 2 Price: 15,000yen (Academic) SWEST2015 Digilent ZedBoard FPGA: Xilinx Zynq 7020 Size: Pipelined CPU 8 (+ ARM DualCore) Price: 60,000yen (Academic) Shinya T-Y, NAIST 9
SSD FPGA (Xilinx Zynq 7020, ARM Dualcore) +DDR3 DRAM 512MB SSD Interface Dual Camera SWEST2015 Shinya T-Y, NAIST 10
Xilinx ZC706 FPGA: Xilinx Zynq 7045 Size: Pipelined CPU 16 Price: 300,000yen SWEST2015 Shinya T-Y, NAIST 11
Tokyo Electron Deice TB-6V-LX760-LSI FPGA: Xilinx Virtex-6 LX760 Size: Pipelined CPU x100? Price: 4,000,000yen? SWEST2015 Shinya T-Y, NAIST 12
ScalableCore System FPGA: Xilinx Spartan-6 100 Size: Pipelined CPU x200? Price: 100万円程度 SWEST2015 Shinya T-Y, NAIST 13
FPGA An LB has logical circuit components for both combinational circuits and sequential circuits They are connected via interconnection components (SB, CB and wire) FPGA IOB IOB LB IOB LB IOB LB IOB LB Logic Block IOB LB LB LB IOB Switch Block Connection Block IOB LB LB LB IOB IOB I/O Block Wire IOB IOB IOB
FPGA An LB has logical circuit components for both combinational circuits and sequential circuits They are connected via interconnection components (SB, CB and wire) FPGA IOB IOB LB IOB LB IOB LB IOB LB Logic Block IOB LB LB LB IOB Switch Block Connection Block IOB LB LB LB IOB IOB I/O Block Wire IOB IOB IOB
Logic Block n Two basic elements in a logic block l LUT (Look Up Table): for combinational circuits l Flip-flop: for memory (sequential circuits) FPGA IOB IOB IOB IOB LB LB LB IOB Logic Block LUT IOB LB LB LB IOB D Q IOB LB LB LB IOB IOB IOB IOB
LUT: Look Up Table n LUTs realize combinational logics n An LUT returns a 1-bit value corresponding to the input bit-vector (=Boolean function) l N-input LUT has 2 N combinations of results: 4-input LUT has 16 a[0] a[1] a[2] a[n-1] Input 000 0 000 1 111 1 N-input LUT Output 0 1 0 b
FPGA in Anywhere n LSI l LSI l HW SW l n LSI l l n l l l l Convolution Pooling Max Out Convolution Full Connection Input Layer Hidden Layers Output Layer
ASIC vs. FPGA n ASIC (Application Specific Integrated Circuit) l n FPGA (Field Programmable Gate Array) l FPGA OK l FPGA is cheaper ASIC is cheaper FPGA Cost ASIC The number of units
From Xilinx UG872 FPGA n FPG l 5 6 l n l l RTL
How to Develop a Software? Writing a software in programming languages Preprocess int main(){ int a = 1 + 2; printf( Hello %d\n, a); return 0; } Compiler Flow Compile Assemble add $t0, $t1, $t2 li $v0, 1 syscall Link Execution on a CPU Executable Binary ELF01ABF00F1...
How to Develop a (FPGA) Hardware? Writing a hardware design in HDL (Hardware Description Language) EDA Flow Synthesis Technology Mapping Place and Route Bitstream Generation Original HW on an FPGA Configuration of the bitstream to an FPGA SWEST2015 module top (input CLK, RST, output reg [7:0] LED); always @(posedge CLK) begin LED <= LED + 1; end endmodule Shinya T-Y, NAIST 1A0C021E... Bitstream 22
FPGA
ARM搭載FPGAの登場 (1) n ARMプロセッサ+FPGA (Xilinx Zynq, Altera SoC) l 専用インターコネクトで密結合 キャッシュ DRAM共有 l 普通のLinuxが動作する 大量なソフトウェア資源が利用可能 AlteraのARMベースSoC https://www.altera.com/ja_jp/pdfs/literature/br/br-soc-fpga_j.pdf Zynq-7000 All Programmable SoC http://japan.xilinx.com/products/silicon-devices/soc/zynq-7000.html SWEST2015 Shinya T-Y, NAIST 24
ARM FPGA (2) n FPGA l FPGA CPU l 10 MicroBlaze: 100MHz 200MHz, In-order, Single issue ARM: 600MHz 1GHz, OoO, Super scalar n HW/SW SoC l CPU l FPGA l CPU-FPGA DRAM CPU
例 Zynq 7000 アーキテクチャ n ARM Cortex-A9 (Dual-core, OoO, 8-stage) n 3種類のCPU-PL間接続 (すべてAXIインターフェース) GP 低速 制御レジスタ アクセス用 HP 高バンド幅 DRAMへの バースト転送向け ACP 低レイテンシ キャッシュ コヒーレント CPUとの データ共有向け Cache DRAM AXI GP Port http://www.ioe.nchu.edu.tw/pic/courseitem/4468_20_zynq_architecture.pdf http://japan.xilinx.com/support/documentation/data_sheets/j_ds190-zynq-7000-overview.pdf Shinya T-Y, NAIST SWEST2015 AXI HP Port AXI ACP Port 26
浮動小数点ユニット搭載FPGA n 従来FPGAのDSP 乗算 ユニットは整数のみ対応 l 浮動小数点演算は変換ロジックを組み合わせて実現 ソフトマクロ 大きな回路 電力オーバーヘッド そのため浮動小数点演算ではGPUが有利だった n Altera次期モデルがハードマクロ浮動小数点DSPを搭載 l コンピューティングデバイスとしてのFPGAの利用が増加 SWEST2015 Altera Expands Floating-Point Hardware Support Across Its Product Lines http://www.bdti.com/insidedsp/2014/07/22/altera Shinya T-Y, NAIST 27
IP n IP HW J l IP l EDA l IP FPGA CPU Ether HW Acc Interconnect DRAM I/F HW Acc PCI-E Xilinx Vivado IP ARM DRAM
アプリケーションの変化 n 以前は画像処理やネットワーキングなどが主流 n ビッグデータ 指向へ: 脱ノイマン型 l イーサネットNICでMemcached [Fukuda+, FPL 14] l Microsoft Bing search engine (Catapult) [Putnam+, ISCA'14] FPGA間を専用線で接続するクラスタシステム SWEST2015 Fig2 from [Fukuda+, FPL'14] Shinya T-Y, NAIST Fig1 from [Putman+, ISCA'14] 29
n RTL l : C, C++, OpenCL, Java, Python,... l : RTL HDL Verilog HDL, VHDL n SW l : (AST) (CDFG) l : RTL
FPGA (HW/SW ) HW/SW SW HW/SW HDL HW RTL HW SW HW/SW
FPGA SW HW RTL (HW/SW ) ( ) HW HW SW HW HW/SW
RTL n RTL (Register Transfer Level) l l Timed l n High Level Synthesis l l Untimed (Directive) l
例 2配列の積和演算 (c += a * b) RTL設計 (Verilog HDL): 105行 2098文字 15分 積和演算器 トップレベル 乗算器 いつ なに を どのように するかを 設計者が決める SWEST2015 Shinya T-Y, NAIST 35
2 (c += a * b) (C ): 11 163 1 1/10 1/15
FPGA向け商用高位合成ツールが多数登場 n Xilinx Vivado HLS (+ SDSoC) l C/C++で振る舞いを定義 ディレクティブで性能チューニング l SDSoCならSWコードから部分的にHW化 I/Fも自動生成 l その他C言語ベースImpulse C CWB exciteなど n OpenCL系: Altera OpenCLやXilinx SDAccel l ホストPCありき ホストPC上SWのお作法も定義 https://www.youtube.com/ watch?v=uruvkq6zqhq SWEST2015 Shinya T-Y, NAIST 37
n LegUp: C l C MIPS CPU SW HW n Synthesijer: Java l Java Synthesijer とは JavaプログラムをFPGA 上のハードウェアに変換 複雑なアルゴリズムのハードウェア実装を楽に オブクジェクト指向設計による再利用性の向上 Open-source クイックスタート 5/8 (5) 間隔をおいて変数 led を true/false するプログラムを書く 特殊な記法, 追加構文はない ソフトウェアとして実行可能. 動作の確認 検証が容易 書けるプログラムに制限は加える Java コンパイラフロントエンド L チカに相当する変数 ( 動的な new, 再帰は不可など ) Synthesijer エンジン Java コンパイラバックエンド 点滅 適当なウェイト while(){ if(...){ }else{ }. } 複雑な状態遷移も,Java の制御構文を使って楽に設計できる 合成配置配線 同じ Java プログラムをソフトウェアとしても FPGA 上のハードウェアとしても実行可能 2 自動コンパイルが裏で動くので,Java コードとしての正しさは即座にチェックされる http://www.sigemb.jp/ess/2014/files/ipsj-ess2014003-1.pdf 9
n J l l RTL 1/10 l n L l RTL RTL l RTL l I/O
PyCoRAM Python IP
n CPU with IP-cores l CPU (ARM) FPGA : Xilinx Zynq, etc l HW IP AXI4 Avalon CPU n IP l HDL HDL l : HDL OSS n
CoRAM [Chung+,FPGA 11] n FPGA l Read/Write Communication FIFOs (Registers) CoRAM Channel Abstracted On-chip Memories Read/Write Read Write HW Kernels (Computing Logics) CoRAM Memory Manage Control Threads (Memory Access Pattern) Off-chip Memory
PyCoRAM [Takamaeda+,CARL 13] n IP l AMBA AXI4, Altera Avalon l l CPU IP Portable application with PyCoRAM Cooperation with standard IP-cores Accelerator logic PyCoRAM Abstraction Standard IP-core CPU On-chip Interconnect (AXI4, Avalon) Device-dependent Interfaces (DRAM, etc)
PyCoRAM n 2 l Verilog HDL l Python n IP : (Verilog HDL) (Python) l DMA PyCoRAM HW RTL RTL IP : + IP IP Memory/Stream Channel/ Register DMA (DRAM) IO Channel/ IO Register
PyCoRAM Channel/ Register IO Channel/ IO Register Memory/Stream DMA
PyCoRAM Computing Logic Modeled in Verilog HDL Control Thread Modeled in Python Channel/ Register IO Channel/ IO Register Memory/Stream DMA
PyCoRAM Channel: - FIFO Register: - Channel/ Register IO Channel/ IO Register Memory/Stream DMA Memory: Stream: FIFO IO Channel: FIFO IO Register:
PyCoRAM Channel/ Register IO Channel/ IO Register Memory/Stream DMA (DRAM)
PyCoRAM Channel/ Register IO Channel/ IO Register Memory/Stream DMA Master Interface Slave Interface CPU (DRAM)
PyCoRAM IP n 2 l Verilog HDL: l Python: CoramMemory1P #(.CORAM_THREAD_NAME("thread_name"),.CORAM_ID(0),.CORAM_ADDR_LEN(ADDR_LEN),.CORAM_DATA_WIDTH(DATA_WIDTH) ) inst_memory (.CLK(CLK),.ADDR(mem_addr),.D(mem_d),.WE(mem_we),.Q(mem_q) ); def calc_sum(times): ram = CoramMemory(idx=0, datawidth=32, size=1024) channel = CoramChannel(idx=0, datawidth=32) addr = 0 sum = 0 for i in range(times): ram.write(0, addr, 128) channel.write(addr) sum += channel.read() addr += 128 * (32/8) print( sum=, sum) calc_sum(8) n PyCoRAM IP l Python-Verilog RTL
PyCoRAM n RAM FIFO l l ID l CoramMemory1P #(.CORAM_THREAD_NAME("thread_name"),.CORAM_ID(0),.CORAM_ADDR_LEN(ADDR_LEN),.CORAM_DATA_WIDTH(DATA_WIDTH) ) inst_memory (.CLK(CLK),.ADDR(mem_addr),.D(mem_d),.WE(mem_we),.Q(mem_q) ); CoramChannel #(.CORAM_THREAD_NAME("thread_name"),.CORAM_ID(0),.CORAM_ADDR_LEN(CHANNEL_ADDR_LEN),.CORAM_DATA_WIDTH(CHANNEL_DATA_WIDTH) ) inst_channel (.CLK(CLK),.RST(RST),.D(comm_d),.ENQ(comm_enq),.FULL(comm_full),.Q(comm_q),.DEQ(comm_deq),.EMPTY(comm_empty) ); (a) Memory (b) Channel
Python n PyCoRAM l CoramMemory: read(), write() Memory DRAM DMA l CoramChannel: read(), write() 0 1 2 3 4 5 6 7 8 9 10 11 def calc_sum(times): ram = CoramMemory(idx=0, datawidth=32, size=1024) channel = CoramChannel(idx=0, datawidth=32) addr = 0 sum = 0 for i in range(times): ram.write(0, addr, 128) channel.write(addr) sum += channel.read() addr += 128 * (32/8) print( sum=, sum) calc_sum(8) # Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task
: n 1 +1 l CoramMemory DRAM l CoramMemory-DRAM Computing Logic (Verilog HDL) Coram Memory 0 A + sum Control Thread (Python) Control Logic Coram Channel 0
計算ロジック (1): I/Oポート クロック(CLK)とリセット(RST) 以外に専用のI/Oは不要 CoramMemoryのための信号 (BRAMと同じインターフェース) CoramChannelのための信号 (FIFOと同じインターフェース) ステートマシン用変数 SWEST2015 Shinya T-Y, NAIST 54
計算ロジック (2): パイプライン/FSM CoramChannelから読み出し コントロールスレッドから受信 SWEST2015 CoramChannelに書き込み コントロールスレッドに通知 Shinya T-Y, NAIST 55
計算ロジック (3): 子インスタンス CoramMemory (BRAMと同じインターフェース) CoramChannel (FIFOと同じインターフェース) SWEST2015 Shinya T-Y, NAIST 56
(Python) ram (CoramMemory) channel (CoramChannel) CoramMemory DMA CoramChannel
コンパイル SWEST2015 Shinya T-Y, NAIST 58
シミュレーション結果 SWEST2015 Shinya T-Y, NAIST 59
IP n A B C CoRAM l DRAM l l B l SIMD CoRAM Memory 0 A Computing Logic (Verilog HDL) 8-stage Multiply Pipeline + B + check sum sum CoRAM Memory 2 C Control Thread (Python) CoRAM Memory 1 Control Logic CoRAM Channel 0 I/O Channel
IP n PyCoRAM Verilog HDL Python Control Threads (Modeled in Python) User Definition (Modeled in Verilog HDL and Python) Mark Visited Cthread Update Node OutStream Next Node Addr Next Node Cost Mark Visited Cthread Mark Visited OutStream Node Addr Priority Queue Cthread Node Addr Priority Queue OutStream InStream Cost Next Node Cost + Read Node Cthread Read Node InStream Next Node Addr Edge Page Addr Read Edge Cthread Read Edge InStream Main CThread FSM Dijkstra Logic (Modeled in Verilog HDL) Generated by PyCoRAM DMAC DMAC DMAC DMAC DMAC DMAC Slave I/F AXI4 Master Interfaces AXI4-lite Slave Interfaces
FPGA n CPU IP l IP AXI4, Avalon PyCoRAM IP n l RAM DRAM PyCoRAM DRAM- Python n CPU l OS = volatile OK l OS CPU: HW: MB
Zynq + PyCoRAM (+Debian) 入門 n SlideShareでチュートリアルスライド公開中 l http://www.slideshare.net/shtaxxx/zynqpycoram n 今時のFPGAアクセラレータを 作る方法の一例をまとめました l PyCoRAM IPコアの作り方 l Zynq (ARM搭載FPGA)の上で Debian Linuxを動作させるには l その上でPyCoRAM IPコアを 使うにはどんなSWが必要か n 2015年3月時点の情報なので そろそろ更新予定 l Debian 7.0から8.0へ移行 l HW用メモリ領域確保の方法更新 SWEST2015 Shinya T-Y, NAIST 64
Pyverilog & Veriloggen
Pyverilog: Verilog HDL Parser module TOP (input CLK, input RST, output rslt, Verilog HDL Code Lexical Analyzer Syntax Analyzer AST AST Code Generator module TOP (input CLK, input RST, output rslt, Verilog HDL Code Dataflow Analyzer Module Analyzer Signal Analyzer Bind Analyzer Dataflow Visualizer Graphical Output Optimizer Control-flow Analyzer State Machine Pattern Matcher Active Condition Analyzer Control-flow Input Output
(AST) 1 module stopwatch Source: 2 ( Description: 3 input CLK, ModuleDef: stopwatch Verilog HDL Paramlist: AST 4 input RST, Portlist: 5 input start, Ioport: 6 input stop, Input: CLK, False 7 input init, Width: 8 output reg busy, IntConst: 0 9 output reg [31:0] timecount IntConst: 0 10 ); Ioport: 11 localparam IDLE = 0; Input: RST, False Width: 12 localparam COUNTING = 1; IntConst: 0 13 localparam WAITINIT = 2; IntConst: 0 14 reg [3:0] state; Ioport: 15 always @(posedge CLK) begin Input: start, False 16 if(rst) begin Width: 17 state <= 0; IntConst: 0 18 timecount <= 0; IntConst: 0 19 end else begin Ioport: Input: stop, False 20 if(state == IDLE) begin Width: 21 if(start) begin IntConst: 0 22 state <= COUNTING; IntConst: 0 23 timecount <= 0; Ioport: 24 busy <= 1; Input: init, False 25 end Width: 26 end else if(state == COUNTING) begin IntConst: 0 27 timecount <= timecount + 1; IntConst: 0 Ioport: 28 if(stop) begin Output: busy, False 29 state <= WAITINIT; Width: 30 busy <= 0; IntConst: 0 31 end IntConst: 0 32 end else if(state == WAITINIT) begin Reg: busy, False 33 if(init) begin Width: 34 timecount <= 0; IntConst: 0 35 state <= IDLE; IntConst: 0 Ioport: 36 end else if(start) begin Output: timecount, False 37 timecount <= 0; Width: 38 state <= COUNTING; IntConst: 31 39 end IntConst: 0 40 end Reg: timecount, False 41 end Width: 42 end IntConst: 31 43 endmodule IntConst: 0
stopwatch.timecount Branch COND TRUE FALSE stopwatch_rst d0 Branch COND FALSE Eq Branch COND TRUE FALSE TRUE d0 Eq Plus Branch COND TRUE d1 d1 Eq Branch FALSE COND TRUE FALSE Branch stopwatch_state d2 stopwatch_init d0 Branch TRUE FALSE COND FALSE COND TRUE d0 stopwatch_timecount stopwatch_start d0
n # SIGNAL NAME: stopwatch.state # DELAY CNT: 0 0 --(stopwatch_start>'d0)--> 1 1 --(stopwatch_stop>'d0)--> 2 2 --(stopwatch_init>'d0)--> 0 2 --((!(stopwatch_init>'d0))&&(stopwatch_start>'d0))--> 1 Loop (0, 1, 2) (1, 2) 0 GreaterThan 1 GreaterThan Land 2 GreaterThan (a) Command Line Output n (b) Graphical Output Active Conditions: stopwatch.busy [((stopwatch_start 1:None) && (stopwatch_state 0:0))] Changed Conditions [((stopwatch_start 1:None) && (stopwatch_state 0:0)), ((stopwatch_state 1:1) && (stopwatch_stop 1:None))] Changed Condition Dict [(((stopwatch_start 1:None) && (stopwatch_state 0:0)), 'd1), (((stopwatch_state 1:1) && (stopwatch_stop 1:None)), 'd0)] Condition Signal Condition s Min/Max Values AND Condition Target Signal Target s Assigned Value
Python Verilog HDL 1 import pyverilog.vparser.ast as vast 2 from pyverilog.ast_code_generator.codegen import ASTCodeGenerator 3 4 params = vast.paramlist(()) 5 clk = vast.ioport( vast.input( CLK ) ) 6 rst = vast.ioport( vast.input( RST ) ) 7 width = vast.width( vast.intconst( 7 ), vast.intconst( 0 ) ) 8 led = vast.ioport( vast.output( led, width=width) ) 9 ports = vast.portlist( (clk, rst, led) ) 10 items = ( vast.assign( vast.identifier( led ), vast.intconst( 8 ) ),) 11 ast = vast.moduledef("top", params, ports, items) 12 Python 13 codegen = ASTCodeGenerator() 14 rslt = codegen.visit(ast) 15 print(rslt) Execute Verilog HDL 1 module top 2 ( 3 input [0:0] CLK, 4 input [0:0] RST, 5 output [7:0] led 6 ); 7 assign led = 8; 8 endmodule HDL Veriloggen
Veriloggen Python Verilog HDL PythonでVerilog HDLを 組み立てるライブラリ n Pythonで書いた動作を HDLに変換する 高位合成ではない n Pythonのオブジェクト としてVerilogの信号や 代入を組み上げていく 実行 n 当該オブジェクトの to_verilog() を呼ぶと Verilogのソースコード のテキストに変換 SWEST2015 Shinya T-Y, NAIST 71
例 たくさんLEDを追加してみる 実行 SWEST2015 Shinya T-Y, NAIST 72
Veriloggen n Verilog HDL l Python+Veriloggen Verilog HDL generate n l PyCoRAM Python l Post PyCoRAM Verilog HDL
n l GCC LLVM.NET l FPGA 2 5 Zybo n l l GPU CUDA FPGA OpenACC OpenMP MPI Xilinx SDSoC
n LLVM l LLVM: LLVM-IR DFG l
n l : Python : FPGA n l l PyCoRAM: Python IP Pyverilog: Verilog HDL Veriloggen: Python Verilog HDL
n l C, C++, C#, Java, Python, Ruby, Perl, JavaScrit, Scala, Go, Haskell l RTL: Verilog HDL, VHDL HDL: Chisel (Scala DSL), PyMTL (Python DSL), Veriloggen : C, C++, OpenCL, Java (Synthesijer), Python (PyCoRAM) n C C Ruby Go Python Python
n l l l n l l
n FPGA Python n l n GitHub l PyCoRAM: https://github.com/pyhdi/pycoram l Pyverilog: https://github.com/pyhdi/pyverilog l Veriloggen: https://github.com/pyhdi/veriloggen