IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

Size: px
Start display at page:

Download "IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla"

Transcription

1 GPU CRS 1,a),b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV SpMV SpMV CRS Compressed Row Storage *1 SpMV GPU GPU NVIDIA Kepler GPU NVIDIA GPGPU CUDA CRS 1 CRS CPU GPU 008 Bell [1] SpMV CUDA ELL COO HYB CRS 1 a) mukunoki@hpcs.cs.tsukuba.ac.jp b) daisuke@cs.tsukuba.ac.jp *1 NVIDIA cusparse GPU CSR Compressed Sparse Row CRS val = [8, 9, 4, 5, 7, 5, 6,, 6, 7, 9, 6,,, 7,, 8] ind = [1,, 5, 6,, 3, 4, 5,, 4, 1, 4, 6, 3, 3, 4, 5] ptr = [1, 5, 9, 11, 14, 15, 18] 1 CRS Bell [] Segmented Scan CUDA Weizhi [3] CRS Xiaowen [4] SIC Matam [5] CRS ELL CRS SpMV GPU SpMV Kubota [6] CRS CRS GPU CPU CRS SpMV CRS SpMV NVIDIA GPU cusparse[7] CRS SpMV 013 Information Processing Society of Japan 1

2 Fermi GPU Kepler Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV GPU CRS SpMV Bell [1] Bell 1 y = Ax y 1 1 CRS-scalar 1 CRS-vector CRS-scalar CPU GPU CRS-vector Bell 1 3 CRS-vector 3 1 Baskaran [8] Guo [9] ElZein [10] CRS-vector CRS-scalar Reguly [11] CRS-vector 1 1,, 4, 8, 16, CRS-scalar Reguly Yoshizawa [1] 1,, 4, 8, 16, 3 3. Kepler GPU Kepler NVIDIA 01 GPU Kepler Tesla K0 Kepler NVIDIA [13] Fermi Fermi SM SMX Fermi SM 3 CUDA SMX 19 Max X Grid Dimension 1 x 65,535,147,483,647 ID 3 65,536 Kepler Fermi Kepler CRS SpMV shfl n shfl up n shfl down XOR shfl xor 4 48KB Fermi Kepler const restrict 4. Kepler CRS SpMV cusparse cusparse y = αax + βy 013 Information Processing Society of Japan

3 thread thread , , , , ,9 10, thread iteration NT=1 NT= NT=4 CRS-vector 1 global void SpmvKernel_NT (int m, double alpha, double* a_val, int* a_ptr, int* a_idx, const double* restrict x, double beta, double* y) { unsigned int t; unsigned int tx = threadidx.x; unsigned int tid = blockdim.x * blockidx.x + tx; unsigned int rowid = tid / NT; unsigned int lane = tid % NT; double val; int val_hi, val_lo; if (rowid < m) { val = 0.0; for (i = a_ptr[rowid] + lane; i < a_ptr[rowid + 1]; i += NT) { val += a_val[i] * x[a_idx[i]]; for (i = NT / ; i > 0; i = i >> 1) { val_hi = doublehiint(val); val_lo = doubleloint(val); val += hilointdouble( shfl_xor(val_hi, i, 3), shfl_xor(val_lo, i, 3)); if (lane == 0) { y[rowid] = alpha * val + beta * y[rowid]; 3 NT 1,, 4, 8, 16, 3 for Reguly CRS-vector 1 NT NT=1,, 4, 8, 16, 3 cusparse SpMV 1 CRS SpMV CRS-vector NT=1,, 4 int Spmv (char trans, int m, int n, double alpha, double* a_val, int* a_ptr, int* a_idx, double* x, double beta, double* y, int nonzeros) { int NT, ntx, nbx; float nnzrow = (float)nonzeros/(float)m; NT = max(1, min(3, (int)pow(.,ceil(log(nnzrow))))); ntx = NTX; nbx = m / (ntx / NT) + ((m % (ntx / NT))!= 0); dim3 threads (ntx); dim3 grid (nbx); if (trans == N ) { if (NT == 3) { cudafuncsetcacheconfig (SpmvKernel_3, cudafunccachepreferl1); SpmvKernel_3 <<< grid, threads >>> (m, alpha, a_val, a_ptr, a_idx, x, beta, y); else if (NT == 16) { else if (NT == ) { else { 4 iteration 1 NT NT = max(1, min(3, (int)pow(,ceil(log (nnzrow)))) NT 3 Kepler (1) () (3) 3 SpMV GPU 4 4 for NT 013 Information Processing Society of Japan 3

4 Kepler Kepler 48KB GPU const restrict L1 L1 SpMV x 4. Kepler GPU x MaxGridDimX x 65,535,147,483,647 ID CRS-vector RowMax RowMax=MaxGridDimX BlockDim.x/NT BlockDim.x x Block- Dim.x=18 RowMax NT=3 Fermi GPU RowMax=65,535 18/3=6,140 6,140 ID * Kepler GPU RowMax=,147,483,647 18/3=8,589,934, GB GPU GB SpMV 4.3 CRS-vector 1 NT 3 Kepler XOR shfl xor 3bit 64bit double int int double GPU Kepler NVIDIA Tesla K0 CPU Intel Xeon E GHz OS CentOS 6.3 kernel: el6.x86 64 CUDA5.0 (Driver Version: ) nvcc 5.0 -O3 -arch sm 35 gcc O3 -arch sm 35 Kepler GPU Flops CPU PCI-Express GPU 3 1 CUDA5.0 NVIDIA cusparse The University of Florida Sparse Matrix Collection[14] 00 1,813 5,558,36 4,57 117,406,044 6 y = αax + βy A * volatile 5. NVIDIA cus- PARSE CUDA Information Processing Society of Japan 4

5 GFlops Performance (Tesla K0, Double Precision) 11 Speedup 0 cusparse Our Implementation Speedup [Our implementation / cusparse5.0] 5 cusparse E+07 Matrix Size (Rows) 1.0E+01 Percentage of Non-zero Elements [%] 1.0E E E E E-0 1.0E E E E E E E E E E Non-zero Elements (NNZ) 1.0E E E E+0 1.0E E Non-zero Elements per Row (NNZ/Row) 1.0E E cusparse Flops cusparse Flops Matrix Number cusparse cusparse 3 cusparse cusparse Threads per Row (NT) NT 6 00 Rows NNZ NNZ/Row 013 Information Processing Society of Japan 5

6 Ver.1: Read Only Cache Ver.: Avoid Outer Loop Ver.3: Shuffle Instruction Ver.4: All Speedup (Tesla K0, Double Precision) Ver. 0 Matrix Number 5 Flops NNZ/Row NNZ/Row 7 1 NT NT NNZ/Row NNZ/Row NT NT cusparse cusparse NNZ NNZ/Row NT Kepler 5 Ver. 0 Ver. 1 3 Fermi Ver. 1 Ver. Ver. 3 Ver. 4 Ver. 1 3 Kepler Ver. 0 Fermi Ver. 1 3 Ver. 0 Kepler Ver. 1 x Fermi Ver. 1 Ver. 4 5 Kepler Ver. 0 Ver. 1 4 Kepler Ver. 4 Fermi Ver Ver NT 3 CRS SpMV Fermi 7. Kepler GPU CRS SpMV Fermi GPU Kepler x 013 Information Processing Society of Japan 6

7 Kepler Tesla K0 CUDA5.0 cusparse CRS SpMV Kepler GPU JST CREST GPU, Manycore, and Heterogeneous Systems (InPar 01), pp. 1 1 (01). [1] Yoshizawa, H. and Takahashi, D.: Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs, Proc. 15th IEEE International Conference on Computational Science and Engineering (CSE 01), pp (01). [13] Corporation, N.: Whitepaper NVIDIA s Next Generation CUDA Compute Architecture: Kepler GK110 Kepler-GK110-Architecture-Whitepaper.pdf (01). [14] Davis, T. and Hu, Y.: The University of Florida Sparse Matrix Collection, [1] Bell, N. and Garland, M.: Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report, No. NVR (008). [] Segmented Scan CUDA Vol. 010-HPC-16, No. 1, pp. 1 7 (010). [3] Xu, W., Zhang, H., Jiao, S., Wang, D., Song, F. and Liu, Z.: Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU, Proc. 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD 01), pp (01). [4] Feng, X., Jin, H., Zheng, R., Hu, K., Zeng, J. and Shao, Z.: Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs, Proc. IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS 011), pp (011). [5] Matam, K. and Kothapalli, K.: Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU, Proc. International Conference on Parallel Processing (ICPP 011), pp (011). [6] Kubota, Y. and Takahashi, D.: Optimization of Sparse Matrix-Vector Multiplication by Auto Selecting Storage Schemes on GPU, Proc. 11th International Conference on Computational Science and Its Applications (ICCSA 011), Part II, Lecture Notes in Computer Science, No. 6783, pp (011). [7] NVIDIA Corporation: cusparse Library (included in CUDA Toolkit), [8] Baskaran, M. M. and Bordawekar, R.: Optimizing Sparse Matrix-Vector Multiplication on GPUs, IBM Research Report, Vol. RC4704 (009). [9] Guo, P. and Wang, L.: Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs, Proc. International Conference on Computational and Information Sciences (ICCIS 010), pp (010). [10] El Zein, A. H. and Rendell, A. P.: Generating Optimal CUDA Sparse Matrix Vector Product Implementations for Evolving GPU Hardware, Concurrency and Computation: Practice and Experience, Vol. 4, pp (01). [11] Reguly, I. and Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs, Proc. Innovative Parallel Computing: Foundations and Applications of 013 Information Processing Society of Japan 7

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N GPU 1 1 2 1, 3 2, 3 (Graphics Unit: GPU) GPU GPU GPU Evaluation of GPU Computing Based on An Automatic Program Generation Technology Makoto Sugawara, 1 Katsuto Sato, 1 Kazuhiko Komatsu, 2 Hiroyuki Takizawa

More information

07-二村幸孝・出口大輔.indd

07-二村幸孝・出口大輔.indd GPU Graphics Processing Units HPC High Performance Computing GPU GPGPU General-Purpose computation on GPU CPU GPU GPU *1 Intel Quad-Core Xeon E5472 3.0 GHz 2 6 MB L2 cache 1600 MHz FSB 80 GFlops 1 nvidia

More information

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU GPGPU (I) GPU GPGPU 1 GPU(Graphics Processing Unit) GPU GPGPU(General-Purpose computing on GPUs) GPU GPGPU GPU ( PC ) PC PC GPU PC PC GPU GPU 2008 TSUBAME NVIDIA GPU(Tesla S1070) TOP500 29 [1] 2009 AMD

More information

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c Vol.214-HPC-145 No.45 214/7/3 OpenACC 1 3,1,2 1,2 GPU CUDA OpenCL OpenACC OpenACC High-level OpenACC CPU Intex Xeon Phi K2X GPU Intel Xeon Phi 27% K2X GPU 24% 1. TSUBAME2.5 CPU GPU CUDA OpenCL CPU OpenMP

More information

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G 211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS211 211/1/18 GPU 4 8 BLAS 4 8 BLAS Basic Linear Algebra Subprograms GPU Graphics Processing Unit 4 8 double 2 4 double-double DD 4 4 8 quad-double

More information

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments 計算機アーキテクチャ第 11 回 マルチプロセッサ 本資料は授業用です 無断で転載することを禁じます 名古屋大学 大学院情報科学研究科 准教授加藤真平 デスクトップ ジョブレベル並列性 スーパーコンピュータ 並列処理プログラム プログラムの並列化 for (i = 0; i < N; i++) { x[i] = a[i] + b[i]; } プログラムの並列化 x[0] = a[0] + b[0];

More information

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h 23 FPGA CUDA Performance Comparison of FPGA Array with CUDA on Poisson Equation (lijiang@sekine-lab.ei.tuat.ac.jp), (kazuki@sekine-lab.ei.tuat.ac.jp), (takahashi@sekine-lab.ei.tuat.ac.jp), (tamukoh@cc.tuat.ac.jp),

More information

GPGPU

GPGPU GPGPU 2013 1008 2015 1 23 Abstract In recent years, with the advance of microscope technology, the alive cells have been able to observe. On the other hand, from the standpoint of image processing, the

More information

HPC (pay-as-you-go) HPC Web 2

HPC (pay-as-you-go) HPC Web 2 ,, 1 HPC (pay-as-you-go) HPC Web 2 HPC Amazon EC2 OpenFOAM GPU EC2 3 HPC MPI MPI Courant 1 GPGPU MPI 4 AMAZON EC2 GPU CLUSTER COMPUTE INSTANCE EC2 GPU (cg1.4xlarge) ( N. Virgina ) Quadcore Intel Xeon 5570

More information

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1 GPU 4 2010 8 28 1 GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1 Register & Shared Memory ( ) CPU CPU(Intel Core i7 965) GPU(Tesla

More information

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

(    CUDA CUDA CUDA CUDA (  NVIDIA CUDA I GPGPU (II) GPGPU CUDA 1 GPGPU CUDA(CUDA Unified Device Architecture) CUDA NVIDIA GPU *1 C/C++ (nvcc) CUDA NVIDIA GPU GPU CUDA CUDA 1 CUDA CUDA 2 CUDA NVIDIA GPU PC Windows Linux MaxOSX CUDA GPU CUDA NVIDIA

More information

GPU CUDA CUDA 2010/06/28 1

GPU CUDA CUDA 2010/06/28 1 GPU CUDA CUDA 2010/06/28 1 GPU NVIDIA Mark Harris, Optimizing Parallel Reduction in CUDA http://developer.download.nvidia.com/ compute/cuda/1_1/website/data- Parallel_Algorithms.html#reduction CUDA SDK

More information

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation 1 1 1 1 SPEC CPU 2000 EQUAKE 1.6 50 500 A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes GAKUHO TAGUCHI 1 YOUICHI ABE 1 KEIJI KIMURA 1

More information

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou

KBLAS[7] *1., CUBLAS.,,, Byte/flop., [13] 1 2. (AT). GPU AT,, GPU SYMV., SYMV CUDABLAS., (double, float) (cu- FloatComplex, cudoublecomplex).,, DD(dou Vol.214-HPC-146 No.14 214/1/3 CUDA-xSYMV 1,3,a) 1 2,3 2,3 (SYMV)., (GEMV) 2.,, mutex., CUBLAS., 1 2,. (AT). 2, SYMV GPU., SSYMV( SYMV), GeForce GTXTitan Black 211GFLOPS( 62.8%)., ( ) (, ) DD(double-double),

More information

untitled

untitled A = QΛQ T A n n Λ Q A = XΛX 1 A n n Λ X GPGPU A 3 T Q T AQ = T (Q: ) T u i = λ i u i T {λ i } {u i } QR MR 3 v i = Q u i A {v i } A n = 9000 Quad Core Xeon 2 LAPACK (4/3) n 3 O(n 2 ) O(n 3 ) A {v i }

More information

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of GPU 1,a) 1,b) 1,c) 1,d) GPU 1 GPU Structure Of Array Array Of Structure 1. MPS(Moving Particle Semi-Implicit) [1] SPH(Smoothed Particle Hydrodynamics) [] DEM(Distinct Element Method)[] [] 1 Tokyo Institute

More information

倍々精度RgemmのnVidia C2050上への実装と応用

倍々精度RgemmのnVidia C2050上への実装と応用 .. maho@riken.jp http://accc.riken.jp/maho/,,, 2011/2/16 1 - : GPU : SDPA-DD 10 1 - Rgemm : 4 (32 ) nvidia C2050, GPU CPU 150, 24GFlops 25 20 GFLOPS 15 10 QuadAdd Cray, QuadMul Sloppy Kernel QuadAdd Cray,

More information

1 4 1.1........................................... 4 1.2.................................. 4 1.3................................... 4 2 5 2.1 GPU.....

1 4 1.1........................................... 4 1.2.................................. 4 1.3................................... 4 2 5 2.1 GPU..... CPU GPU N Q07-065 2011 2 17 1 1 4 1.1........................................... 4 1.2.................................. 4 1.3................................... 4 2 5 2.1 GPU...........................................

More information

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速 1 1, 2 1, 2 3 2, 3 4 GP LES ASUCA LES NVIDIA CUDA LES 1. Graphics Processing Unit GP General-Purpose SIMT Single Instruction Multiple Threads 1 2 3 4 1),2) LES Large Eddy Simulation 3) ASUCA 4) LES LES

More information

HPC pdf

HPC pdf GPU 1 1 2 2 1 1024 3 GPUGraphics Unit1024 3 GPU GPU GPU GPU 1024 3 Tesla S1070-400 1 GPU 2.6 Accelerating Out-of-core Cone Beam Reconstruction Using GPU Yusuke Okitsu, 1 Fumihiko Ino, 1 Taketo Kishi, 2

More information

Run-Based Trieから構成される 決定木の枝刈り法

Run-Based Trieから構成される  決定木の枝刈り法 Run-Based Trie 2 2 25 6 Run-Based Trie Simple Search Run-Based Trie Network A Network B Packet Router Packet Filtering Policy Rule Network A, K Network B Network C, D Action Permit Deny Permit Network

More information

Microsoft PowerPoint - GPU_computing_2013_01.pptx

Microsoft PowerPoint - GPU_computing_2013_01.pptx GPU コンピューティン No.1 導入 東京工業大学 学術国際情報センター 青木尊之 1 GPU とは 2 GPGPU (General-purpose computing on graphics processing units) GPU を画像処理以外の一般的計算に使う GPU の魅力 高性能 : ハイエンド GPU はピーク 4 TFLOPS 超 手軽さ : 普通の PC にも装着できる 低価格

More information

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン

CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン CUDA 画像処理入門 エヌビディアジャパン CUDA エンジニア森野慎也 GTC Japan 2014 CUDA を用いた画像処理 画像処理を CUDA で並列化 基本的な並列化の考え方 目標 : 妥当な Naïve コードが書ける 最適化の初歩がわかる ブロックサイズ メモリアクセスパターン RGB Y( 輝度 ) 変換 カラー画像から グレイスケールへの変換 Y = 0.299 R + 0.587

More information

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2 ! OpenCL [Open Computing Language] 言 [OpenCL C 言 ] CPU, GPU, Cell/B.E.,DSP 言 行行 [OpenCL Runtime] OpenCL C 言 API Khronos OpenCL Working Group AMD Broadcom Blizzard Apple ARM Codeplay Electronic Arts Freescale

More information

tabaicho3mukunoki.pptx

tabaicho3mukunoki.pptx 1 2 はじめに n 目的 4倍精度演算より高速な3倍精度演算を実現する l 倍精度では足りないが4倍精度は必要ないケースに欲しい l 4倍精度に比べてデータサイズが小さい Ø 少なくともメモリ律速な計算では4倍精度よりデータ 転送時間を減らすことが可能 Ø PCIeやノード間通信がボトルネックとなりやすい GPUクラスタ環境に有効か n 研究概要 l DD型4倍精度演算 DD演算 に基づく3倍精度演算

More information

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin Windows で始める CUDA 入門 GTC 2013 チュートリアル エヌビディアジャパン CUDA エンジニア森野慎也 1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境

More information

main.dvi

main.dvi PC 1 1 [1][2] [3][4] ( ) GPU(Graphics Processing Unit) GPU PC GPU PC ( 2 GPU ) GPU Harris Corner Detector[5] CPU ( ) ( ) CPU GPU 2 3 GPU 4 5 6 7 1 toyohiro@isc.kyutech.ac.jp 45 2 ( ) CPU ( ) ( ) () 2.1

More information

untitled

untitled A = QΛQ T A n n Λ Q A = XΛX 1 A n n Λ X GPGPU A 3 T Q T AQ = T (Q: ) T u i = λ i u i T {λ i } {u i } QR MR 3 v i = Q u i A {v i } A n = 9000 Quad Core Xeon 2 LAPACK (4/3) n 3 O(n 2 ) O(n 3 ) A {v i }

More information

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055

2). 3) 4) 1.2 NICTNICT DCRA Dihedral Corner Reflector micro-arraysdcra DCRA DCRA DCRA 3D DCRA PC USB PC PC ON / OFF Velleman K8055 K8055 K8055 1 1 1 2 DCRA 1. 1.1 1) 1 Tactile Interface with Air Jets for Floating Images Aya Higuchi, 1 Nomin, 1 Sandor Markon 1 and Satoshi Maekawa 2 The new optical device DCRA can display floating images in free

More information

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL

xx/xx Vol. Jxx A No. xx 1 Fig. 1 PAL(Panoramic Annular Lens) PAL(Panoramic Annular Lens) PAL (2) PAL PAL 2 PAL 3 2 PAL 1 PAL 3 PAL PAL 2. 1 PAL PAL On the Precision of 3D Measurement by Stereo PAL Images Hiroyuki HASE,HirofumiKAWAI,FrankEKPAR, Masaaki YONEDA,andJien KATO PAL 3 PAL Panoramic Annular Lens 1985 Greguss PAL 1 PAL PAL 2 3 2 PAL DP

More information

GPU n Graphics Processing Unit CG CAD

GPU n Graphics Processing Unit CG CAD GPU 2016/06/27 第 20 回 GPU コンピューティング講習会 ( 東京工業大学 ) 1 GPU n Graphics Processing Unit CG CAD www.nvidia.co.jp www.autodesk.co.jp www.pixar.com GPU n GPU ü n NVIDIA CUDA ü NVIDIA GPU ü OS Linux, Windows, Mac

More information

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装 2010 GPGPU 2010 9 29 MPI/Pthread (DDM) DDM CPU CPU CPU CPU FEM GPU FEM CPU Mult - NUMA Multprocessng Cell GPU Accelerator, GPU CPU Heterogeneous computng L3 cache L3 cache CPU CPU + GPU GPU L3 cache 4

More information

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1 SMYLE OpenCL 128 1 1 1 1 1 2 2 3 3 3 (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 128 SMYLEref SMYLE OpenCL SMYLE OpenCL Implementation and Evaluations on 128 Cores Takuji Hieda 1 Noriko Etani

More information

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved (Version: 2013/5/16) Intel CPU (kashi@waseda.jp) 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU double 8087 FPU (floating point number processing unit)

More information

10D16.dvi

10D16.dvi D IEEJ Transactions on Industry Applications Vol.136 No.10 pp.686 691 DOI: 10.1541/ieejias.136.686 NW Accelerating Techniques for Sequence Alignment based on an Extended NW Algorithm Jin Okaze, Non-member,

More information

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›»

rank ”«‘‚“™z‡Ì GPU ‡É‡æ‡éŁÀŠñ›» rank GPU ERATO 2011 11 1 1 / 26 GPU rank/select wavelet tree balanced parenthesis GPU rank 2 / 26 GPU rank/select wavelet tree balanced parenthesis GPU rank 2 / 26 GPU rank/select wavelet tree balanced

More information

NUMAの構成

NUMAの構成 GPU のプログラム 天野 アクセラレータとは? 特定の性質のプログラムを高速化するプロセッサ 典型的なアクセラレータ GPU(Graphic Processing Unit) Xeon Phi FPGA(Field Programmable Gate Array) 最近出て来た Deep Learning 用ニューロチップなど Domain Specific Architecture 1GPGPU:General

More information

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 PC PC PC PC PC Key Words:Grid, PC Cluster, Distributed

More information

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc

IPSJ SIG Technical Report iphone iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Proc iphone 1 1 1 iphone,,., OpenGl ES 2.0 GLSL(OpenGL Shading Language), iphone GPGPU(General-Purpose Computing on Graphics Processing Unit)., AR Realtime Natural Feature Tracking Library for iphone Makoto

More information

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) HA-PACS 2012 2 HA-PACS TCA (Tightly Coupled Accelerators) TCA PEACH2 1. (Graphics Processing Unit) HPC GP(General Purpose ) TOP500 [1] CPU PCI Express (PCIe)

More information

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Information Science and Technology, Osaka University a) kawasumi.ryo@ist.osaka-u.ac.jp 1 1 Bucket R*-tree[5] [4] 2 3 4 5 6 2. 2.1 2.2 2.3

More information

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [

SQUFOF NTT Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) N UBASIC 50 / 200 [ SQUFOF SQUFOF NTT 2003 2 17 16 60 Shanks SQUFOF SQUFOF Pentium III Pentium 4 SQUFOF 2.03 (Pentium 4 2.0GHz Willamette) 60 1 1.1 N 62 16 24 UBASIC 50 / 200 [ 01] 4 large prime 943 2 1 (%) 57 146 146 15

More information

it-ken_open.key

it-ken_open.key 深層学習技術の進展 ImageNet Classification 画像認識 音声認識 自然言語処理 機械翻訳 深層学習技術は これらの分野において 特に圧倒的な強みを見せている Figure (Left) Eight ILSVRC-2010 test Deep images and the cited4: from: ``ImageNet Classification with Networks et

More information

,,, 2 ( ), $[2, 4]$, $[21, 25]$, $V$,, 31, 2, $V$, $V$ $V$, 2, (b) $-$,,, (1) : (2) : (3) : $r$ $R$ $r/r$, (4) : 3

,,, 2 ( ), $[2, 4]$, $[21, 25]$, $V$,, 31, 2, $V$, $V$ $V$, 2, (b) $-$,,, (1) : (2) : (3) : $r$ $R$ $r/r$, (4) : 3 1084 1999 124-134 124 3 1 (SUGIHARA Kokichi),,,,, 1, [5, 11, 12, 13], (2, 3 ), -,,,, 2 [5], 3,, 3, 2 2, -, 3,, 1,, 3 2,,, 3 $R$ ( ), $R$ $R$ $V$, $V$ $R$,,,, 3 2 125 1 3,,, 2 ( ), $[2, 4]$, $[21, 25]$,

More information

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~ MATLAB における並列 分散コンピューティング ~ Parallel Computing Toolbox & MATLAB Distributed Computing Server ~ MathWorks Japan Application Engineering Group Takashi Yoshida 2016 The MathWorks, Inc. 1 System Configuration

More information

untitled

untitled GPGPU NVIDACUDA Learn More about CUDA - NVIDIA http://www.nvidia.co.jp/object/cuda_education_jp.html NVIDIA CUDA programming Guide CUDA http://www.sintef.no/upload/ikt/9011/simoslo/evita/2008/seland.pdf

More information

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2

1 Fig. 1 Extraction of motion,.,,, 4,,, 3., 1, 2. 2.,. CHLAC,. 2.1,. (256 ).,., CHLAC. CHLAC, HLAC. 2.3 (HLAC ) r,.,. HLAC. N. 2 HLAC Fig. 2 CHLAC 1 2 3 3,. (CHLAC), 1).,.,, CHLAC,.,. Suspicious Behavior Detection based on CHLAC Method Hideaki Imanishi, 1 Toyohiro Hayashi, 2 Shuichi Enokida 3 and Toshiaki Ejima 3 We have proposed a method for

More information

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP

[4] ACP (Advanced Communication Primitives) [1] ACP ACP [2] ACP Tofu UDP [3] HPC InfiniBand InfiniBand ACP 2 ACP, 3 InfiniBand ACP 4 5 ACP 2. ACP ACP InfiniBand ACP 1,5,a) 1,5,b) 2,5 1,5 4,5 3,5 2,5 ACE (Advanced Communication for Exa) ACP (Advanced Communication Primitives) HPC InfiniBand ACP InfiniBand ACP ACP InfiniBand Open MPI 20% InfiniBand Implementation

More information

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D 3DCG 1) ( ) 2) 2) 1) 2) Real-Time Line Drawing Using Image Processing and Deforming Process Together in 3DCG Takeshi Okuya 1) Katsuaki Tanaka 2) Shigekazu Sakai 2) 1) Department of Intermedia Art and Science,

More information

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP Android 1 1 1 1 1 Dominic Hillenbrand 1 1 1 ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GPIO API GPIO API GPIO MPEG2 Optical Flow MPEG2 1PE 0.97[W] 0.63[W] 2PE 1.88[w] 0.46[W] 3PE 2.79[W] 0.37[W] Optical

More information

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1

B 2 Thin Q=3 0 0 P= N ( )P Q = 2 3 ( )6 N N TSUB- Hub PCI-Express (PCIe) Gen 2 x8 AME1 5) 3 GPU Socket 0 High-performance Linpack 1 TSUBAME 2.0 Linpack 1,,,, Intel NVIDIA GPU 2010 11 TSUBAME 2.0 Linpack 2CPU 3GPU 1400 Dual-Rail QDR InfiniBand TSUBAME 1.0 30 2.4PFlops TSUBAME 1.0 Linpack GPU 1.192PFlops PFlops Top500 4 Achievement of

More information

IPSJ SIG Technical Report Vol.2017-MUS-116 No /8/24 MachineDancing: 1,a) 1,b) 3 MachineDancing MachineDancing MachineDancing 1 MachineDan

IPSJ SIG Technical Report Vol.2017-MUS-116 No /8/24 MachineDancing: 1,a) 1,b) 3 MachineDancing MachineDancing MachineDancing 1 MachineDan MachineDancing: 1,a) 1,b) 3 MachineDancing 2 1. 3 MachineDancing MachineDancing 1 MachineDancing MachineDancing [1] 1 305 0058 1-1-1 a) s.fukayama@aist.go.jp b) m.goto@aist.go.jp 1 MachineDancing 3 CG

More information

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx

Microsoft PowerPoint - GPGPU実践基礎工学(web).pptx GPU のプログラム構造 長岡技術科学大学電気電子情報工学専攻出川智啓 今回の内容 GPU プログラミング環境 (CUDA) GPU プログラムの実行の流れ CUDA によるプログラムの記述 カーネル (GPU で処理する関数 ) の構造 記述方法とその理由 GPU 固有のパラメータの確認 405 GPU(Graphics Processing Unit) とは 画像処理専用のハードウェア 具体的には画像処理用のチップ

More information

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2 FFT 1 Fourier fast Fourier transform FFT FFT FFT 1 FFT FFT 2 Fourier 2.1 Fourier FFT Fourier discrete Fourier transform DFT DFT n 1 y k = j=0 x j ω jk n, 0 k n 1 (1) x j y k ω n = e 2πi/n i = 1 (1) n DFT

More information

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max)

Fuzzy Multiple Discrimminant Analysis (FMDA) 5) (SOM) 6) SOM 3 6) SOM SOM SOM SOM SOM SOM 7) 8) SOM SOM SOM GPU 2. n k f(x) m g(x) (1) 12) { min(max) SOM 1 2 2 3 1 (SOM: Self-Organizing Maps) 3 SOM SOM SOM SOM GPU A Study on Visualization of Pareto Solutions by Spherical Self-Organizing Maps MASATO YOSHIMI, 1 KANAME NISHIMOTO, 2 LUYI WANG, 2 TOMOYUKI

More information

untitled

untitled c NUMA 1. 18 (Moore s law) 1Hz CPU 2. 1 (Register) (RAM) Level 1 (L1) L2 L3 L4 TLB (translation look-aside buffer) (OS) TLB TLB 3. NUMA NUMA (Non-uniform memory access) 819 0395 744 1 2014 10 Copyright

More information

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q

4. C i k = 2 k-means C 1 i, C 2 i 5. C i x i p [ f(θ i ; x) = (2π) p 2 Vi 1 2 exp (x µ ] i) t V 1 i (x µ i ) 2 BIC BIC = 2 log L( ˆθ i ; x i C i ) + q x-means 1 2 2 x-means, x-means k-means Bayesian Information Criterion BIC Watershed x-means Moving Object Extraction Using the Number of Clusters Determined by X-means Clustering Naoki Kubo, 1 Kousuke

More information

Slide 1

Slide 1 CUDA プログラミングの基本 パート II - カーネル CUDA の基本の概要 パート I CUDAのソフトウェアスタックとコンパイル GPUのメモリ管理 パート II カーネルの起動 GPUコードの具体像 注 : 取り上げているのは基本事項のみです そのほか多数の API 関数についてはプログラミングガイドを ご覧ください GPU 上でのコードの実行 カーネルは C 関数 + 多少の制約 ホストメモリはアクセスできない戻り値型は

More information

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L

Input image Initialize variables Loop for period of oscillation Update height map Make shade image Change property of image Output image Change time L 1,a) 1,b) 1/f β Generation Method of Animation from Pictures with Natural Flicker Abstract: Some methods to create animation automatically from one picture have been proposed. There is a method that gives

More information

IDRstab(s, L) GBiCGSTAB(s, L) 2. AC-GBiCGSTAB(s, L) Ax = b (1) A R n n x R n b R n 2.1 IDR s L r k+1 r k+1 = b Ax k+1 IDR(s) r k+1 = (I ω k A)(r k dr

IDRstab(s, L) GBiCGSTAB(s, L) 2. AC-GBiCGSTAB(s, L) Ax = b (1) A R n n x R n b R n 2.1 IDR s L r k+1 r k+1 = b Ax k+1 IDR(s) r k+1 = (I ω k A)(r k dr 1 2 IDR(s) GBiCGSTAB(s, L) IDR(s) IDRstab(s, L) GBiCGSTAB(s, L) Verification of effectiveness of Auto-Correction technique applied to preconditioned iterative methods Keiichi Murakami 1 Seiji Fujino 2

More information

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm

IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm Neutron Visual Sensing Techniques Making Good Use of Computer Science J-PARC CT CT-PET TB IEEE HDD RAID MPI MPU/CPU GPGPU GPU cm I m cm /g I I n/ cm 2 s X n/ cm s cm g/cm cm cm barn cm thn/ cm s n/ cm

More information

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co

~~~~~~~~~~~~~~~~~~ wait Call CPU time 1, latch: library cache 7, latch: library cache lock 4, job scheduler co 072 DB Magazine 2007 September ~~~~~~~~~~~~~~~~~~ wait Call CPU time 1,055 34.7 latch: library cache 7,278 750 103 24.7 latch: library cache lock 4,194 465 111 15.3 job scheduler coordinator slave wait

More information

スライド 1

スライド 1 GPU クラスタによる格子 QCD 計算 広大理尾崎裕介 石川健一 1.1 Introduction Graphic Processing Units 1 チップに数百個の演算器 多数の演算器による並列計算 ~TFLOPS ( 単精度 ) CPU 数十 GFLOPS バンド幅 ~100GB/s コストパフォーマンス ~$400 GPU の開発環境 NVIDIA CUDA http://www.nvidia.co.jp/object/cuda_home_new_jp.html

More information

iphone GPGPU GPU OpenCL Mac OS X Snow LeopardOpenCL iphone OpenCL OpenCL NVIDIA GPU CUDA GPU GPU GPU 15 GPU GPU CPU GPU iii OpenMP MPI CPU OpenCL CUDA OpenCL CPU OpenCL GPU NVIDIA Fermi GPU Fermi GPU GPU

More information

2. Amazon GPU Cluster Compute Instance Amazon CCI Amazon EC2 CCI GPU Cluster GPU Quadruple Extra Large Instance (cg1.4xlarge) [6] On Demand Inhouse In

2. Amazon GPU Cluster Compute Instance Amazon CCI Amazon EC2 CCI GPU Cluster GPU Quadruple Extra Large Instance (cg1.4xlarge) [6] On Demand Inhouse In Amazon EC2 GPU OpenFOAM 1 1,2 1,3 VM HPC HPC Amazo EC2 GPGPU OpeFOAM GPU OpenFOAM MPI GPGPU 8 EC2 GPU, Cloud, CFD Akihiko Saijo 1 Yasushi Inoguchi 1,2 Teruo Matsuzawa 1,3 1. HPC (Inhouse) IaaS (Infrastructre

More information

(MIRU2010) NTT Graphic Processor Unit GPU graphi

(MIRU2010) NTT Graphic Processor Unit GPU graphi (MIRU2010) 2010 7 889 2192 1-1 905 2171 905 NTT 243 0124 3-1 E-mail: ac094608@edu.okinawa-ct.ac.jp, akisato@ieee.org Graphic Processor Unit GPU graphic processor unit CUDA Fully automatic extraction of

More information

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal

2. Eades 1) Kamada-Kawai 7) Fruchterman 2) 6) ACE 8) HDE 9) Kruskal MDS 13) 11) Kruskal AGI Active Graph Interface 3) Kruskal 5) Kruskal 4) 3. Kruskal 1 2 3 A projection-based method for interactive 3D visualization of complex graphs Masanori Takami, 1 Hiroshi Hosobe 2 and Ken Wakita 3 Proposed is a new interaction technique to manipulate graph layouts

More information

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral

Shonan Institute of Technology MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Paral MEMOIRS OF SHONAN INSTITUTE OF TECHNOLOGY Vol. 41, No. 1, 2007 Ships1 * ** ** ** Development of a Small-Mid Range Parallel Computer Ships1 Makoto OYA*, Hiroto MATSUBARA**, Kazuyoshi SAKURAI** and Yu KATO**

More information

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP :   BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S FINAL PROGRAM 23rd Annual Workshop SWoPP 2010 2010 / / 2010 Kanazawa Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2010 8 3 ( ) 8 5 ( ) 920-0864 15 1 http://www.bunka-h.gr.jp/

More information

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing

FINAL PROGRAM 22th Annual Workshop SWoPP / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing FINAL PROGRAM 22th Annual Workshop SWoPP 2009 2009 / / 2009 Sendai Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2009 8 4 ( ) 8 6 ( ) 981-0933 1-2-45 http://www.forestsendai.jp

More information

HPEハイパフォーマンスコンピューティング ソリューション

HPEハイパフォーマンスコンピューティング ソリューション HPE HPC / AI Page 2 No.1 * 24.8% No.1 * HPE HPC / AI HPC AI SGIHPE HPC / AI GPU TOP500 50th edition Nov. 2017 HPE No.1 124 www.top500.org HPE HPC / AI TSUBAME 3.0 2017 7 AI TSUBAME 3.0 HPE SGI 8600 System

More information

Vol.57 No (Mar. 2016) 1,a) , L3 CG VDI VDI A Migration to a Cloud-based Information Infrastructure to Support

Vol.57 No (Mar. 2016) 1,a) , L3 CG VDI VDI A Migration to a Cloud-based Information Infrastructure to Support 1,a) 1 1 2015 6 22, 2015 12 7 L3 CG 50 600 VDI VDI A Migration to a Cloud-based Information Infrastructure to Support University Education and It s Analysis Kaori Maeda 1,a) Nobuo Suematsu 1 Toshiaki Kitamura

More information

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012 FINAL PROGRAM 25th Annual Workshop SWoPP 2012 2012 / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012 8 1 ( ) 8 3 ( ) 680-0017 101-5 http://www.torikenmin.jp/kenbun/

More information

スライド 1

スライド 1 swk(at)ic.is.tohoku.ac.jp 2 Outline 3 ? 4 S/N CCD 5 Q Q V 6 CMOS 1 7 1 2 N 1 2 N 8 CCD: CMOS: 9 : / 10 A-D A D C A D C A D C A D C A D C A D C ADC 11 A-D ADC ADC ADC ADC ADC ADC ADC ADC ADC A-D 12 ADC

More information

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv

IPSJ SIG Technical Report Vol.2016-ARC-221 No /8/9 GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC Dalv GC 1 1 GC GC GC GC DalvikVM GC 12.4% 5.7% 1. Garbage Collection: GC GC Java GC GC GC GC DalvikVM[1] GC 1 Nagoya Institute of Technology GC GC 2. GC GC 2.1 GC 1 c 2016 Information Processing Society of

More information

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved (Version: 2013/7/10) Intel CPU (kashi@waseda.jp) 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU double 8087 FPU (floating point number processing unit)

More information

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do (Version: 2017/4/18) Intel CPU (kashi@waseda.jp) 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU double 8087 FPU (floating point number processing unit)

More information

imai@eng.kagawa-u.ac.jp No1 No2 OS Wintel Intel x86 CPU No3 No4 8bit=2 8 =256(Byte) 16bit=2 16 =65,536(Byte)=64KB= 6 5 32bit=2 32 =4,294,967,296(Byte)=4GB= 43 64bit=2 64 =18,446,744,073,709,551,615(Byte)=16EB

More information

2013 M

2013 M 2013 M0110453 2013 : M0110453 20 1 1 1.1............................ 1 1.2.............................. 4 2 5 2.1................................. 6 2.2................................. 8 2.3.................................

More information

XACCの概要

XACCの概要 2 global void kernel(int a[max], int llimit, int ulimit) {... } : int main(int argc, char *argv[]){ MPI_Int(&argc, &argc); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); dx

More information

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted DEGIMA LINPACK Energy Performance for LINPACK Benchmark on DEGIMA 1 AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK 1.4698 GFlops/Watt 1.9658 GFlops/Watt Abstract GPU Computing has

More information

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai,

IPSJ SIG Technical Report 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai, 1,a) 1,b) 1,c) 1,d) 2,e) 2,f) 2,g) 1. [1] [2] 2 [3] 1 599 8531 1 1 Osaka Prefecture University 1 1, Gakuencho, Naka, Sakai, Osaka 599 8531, Japan 2 565 0871 Osaka University 1 1, Yamadaoka, Suita, Osaka

More information

IPSJ SIG Technical Report Vol.2015-MUS-106 No.10 Vol.2015-EC-35 No /3/2 BGM 1,4,a) ,4 BGM. BGM. BGM BGM. BGM. BGM. BGM. 1.,. YouTube 201

IPSJ SIG Technical Report Vol.2015-MUS-106 No.10 Vol.2015-EC-35 No /3/2 BGM 1,4,a) ,4 BGM. BGM. BGM BGM. BGM. BGM. BGM. 1.,. YouTube 201 BGM 1,4,a) 1 2 2 3,4 BGM. BGM. BGM BGM. BGM. BGM. BGM. 1.,. YouTube 2015 1 100.. Web.. BGM.BGM [1]. BGM BGM 1 Waseda University, Shinjuku, Tokyo 169-8555, Japan 2 3 4 JST CREST a) ha-ru-ki@asagi.waseda.jp.

More information

Cell/B.E. BlockLib

Cell/B.E. BlockLib Cell/B.E. BlockLib 17 17115080 21 2 10 i Cell/B.E. BlockLib SIMD CELL SIMD Cell Cell BlockLib BlockLib NestStep libspe1 Cell SDK 3.1 libspe2 BlockLib Cell SDK 3.1 NestStep libspe2 BlockLib BlockLib libspe1

More information

名称 : 日本 GPU コンピューティングパートナーシップ (G-DEP) 所在 : 東京都文京区本郷 7 丁目 3 番 1 号東京大学アントレプレナープラザ, 他工場 URL アライアンスパートナー コアテクノロジーパートナー NVIDIA JAPAN ソリュ

名称 : 日本 GPU コンピューティングパートナーシップ (G-DEP) 所在 : 東京都文京区本郷 7 丁目 3 番 1 号東京大学アントレプレナープラザ, 他工場 URL   アライアンスパートナー コアテクノロジーパートナー NVIDIA JAPAN ソリュ GPUDirect の現状整理 multi-gpu に取組むために G-DEP チーフエンジニア河井博紀 (kawai@gdep.jp) 名称 : 日本 GPU コンピューティングパートナーシップ (G-DEP) 所在 : 東京都文京区本郷 7 丁目 3 番 1 号東京大学アントレプレナープラザ, 他工場 URL http://www.gdep.jp アライアンスパートナー コアテクノロジーパートナー

More information

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC

2. CABAC CABAC CABAC 1 1 CABAC Figure 1 Overview of CABAC 2 DCT 2 0/ /1 CABAC [3] 3. 2 値化部 コンテキスト計算部 2 値算術符号化部 CABAC CABAC H.264 CABAC 1 1 1 1 1 2, CABAC(Context-based Adaptive Binary Arithmetic Coding) H.264, CABAC, A Parallelization Technology of H.264 CABAC For Real Time Encoder of Moving Picture YUSUKE YATABE 1 HIRONORI

More information

IPSJ SIG Technical Report 1 1, Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima

IPSJ SIG Technical Report 1 1, Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima 1 1, 2 1 1 Nested Transactional Memory Selecting the Optimal Rollback Point Yuji Ito, 1 Ryota Shioya, 1, 2 Masahiro Goshima 1 and Shuichi Sakai 1 Lock-based synchronization is common in parallel programming.

More information

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa 連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 16 1.1 問題の定義 16 1.2 αβ 法 16 2 αβ 法の並列化 17 2.1 概要 17 2.2 Young Brothers Wait Concept 17 2.3 段数による逐次化 18 2.4 適応的な待機 18 2. 強制終了

More information

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325

1 Table 1: Identification by color of voxel Voxel Mode of expression Nothing Other 1 Orange 2 Blue 3 Yellow 4 SSL Humanoid SSL-Vision 3 3 [, 21] 8 325 社団法人人工知能学会 Japanese Society for Artificial Intelligence 人工知能学会研究会資料 JSAI Technical Report SIG-Challenge-B3 (5/5) RoboCup SSL Humanoid A Proposal and its Application of Color Voxel Server for RoboCup SSL

More information

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing

[2] OCR [3], [4] [5] [6] [4], [7] [8], [9] 1 [10] Fig. 1 Current arrangement and size of ruby. 2 Fig. 2 Typography combined with printing 1,a) 1,b) 1,c) 2012 11 8 2012 12 18, 2013 1 27 WEB Ruby Removal Filters Using Genetic Programming for Early-modern Japanese Printed Books Taeka Awazu 1,a) Masami Takata 1,b) Kazuki Joe 1,c) Received: November

More information

AutoTuned-RB

AutoTuned-RB ABCLib Working Notes No.10 AutoTuned-RB Version 1.00 AutoTuned-RB AutoTuned-RB RB_DGEMM RB_DGEMM ( TransA, TransB, M, N, K, a, A, lda, B, ldb, b, C, ldc ) L3BLAS DGEMM (C a Trans(A) Trans(B) b C) (1) TransA:

More information

1_26.dvi

1_26.dvi C3PV 1,a) 2,b) 2,c) 3,d) 1,e) 2012 4 20, 2012 10 10 C3PV C3PV C3PV 1 Java C3PV 45 38 84% Programming Process Visualization for Supporting Students in Programming Exercise Hiroshi Igaki 1,a) Shun Saito

More information

2012年度HPCサマーセミナー_多田野.pptx

2012年度HPCサマーセミナー_多田野.pptx ! CCS HPC! I " tadano@cs.tsukuba.ac.jp" " 1 " " " " " " " 2 3 " " Ax = b" " " 4 Ax = b" A = a 11 a 12... a 1n a 21 a 22... a 2n...... a n1 a n2... a nn, x = x 1 x 2. x n, b = b 1 b 2. b n " " 5 Gauss LU

More information

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日 TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日 目次 1. TSUBAMEのGPU 環境 2. プログラム作成 3. プログラム実行 4. 性能解析 デバッグ サンプルコードは /work0/gsic/seminars/gpu- 2011-09- 28 からコピー可能です 1.

More information

三石貴志.indd

三石貴志.indd 流通科学大学論集 - 経済 情報 政策編 - 第 21 巻第 1 号,23-33(2012) SIRMs SIRMs Fuzzy fuzzyapproximate approximatereasoning reasoningusing using Lukasiewicz Łukasiewicz logical Logical operations Operations Takashi Mitsuishi

More information