1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Proce

Similar documents
1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

! 行行 CPUDSP PPESPECell/B.E. CPUGPU 行行 SIMD [SSE, AltiVec] 用 HPC CPUDSP PPESPE (Cell/B.E.) SPE CPUGPU GPU CPU DSP DSP PPE SPE SPE CPU DSP SPE 2

07-二村幸孝・出口大輔.indd

1 GPU GPGPU GPU CPU 2 GPU 2007 NVIDIA GPGPU CUDA[3] GPGPU CUDA GPGPU CUDA GPGPU GPU GPU GPU Graphics Processing Unit LSI LSI CPU ( ) DRAM GPU LSI GPU

untitled

on PS3 Linux Core 2 Quad (GHz) SMs 7 SPEs 1 OS 4 1 Hz 1 (GFLOPS) SM PPE SPE bit

main.dvi

GPGPU

TSUBAME2.0 における GPU の 活用方法 東京工業大学学術国際情報センター丸山直也第 10 回 GPU コンピューティング講習会 2011 年 9 月 28 日

untitled

Nios® II HAL API を使用したソフトウェア・サンプル集 「Modular Scatter-Gather DMA Core」

Microsoft PowerPoint - GPU_computing_2013_01.pptx

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

IPSJ SIG Technical Report Vol.2013-ARC-203 No /2/1 SMYLE OpenCL (NEDO) IT FPGA SMYLEref SMYLE OpenCL SMYLE OpenCL FPGA 1

XACCの概要

Vol.214-HPC-145 No /7/3 C #pragma acc directive-name [clause [[,] clause] ] new-line structured block Fortran!$acc directive-name [clause [[,] c

Slides: TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments

XcalableMP入門


GPUコンピューティング講習会パート1

WebGL OpenGL GLSL Kageyama (Kobe Univ.) Visualization / 57

OpenGL GLSL References Kageyama (Kobe Univ.) Visualization / 58

HBase Phoenix API Mars GPU MapReduce GPU Hadoop Hadoop Hadoop MapReduce : (1) MapReduce (2)JobTracker 1 Hadoop CPU GPU Fig. 1 The overview of CPU-GPU

,4) 1 P% P%P=2.5 5%!%! (1) = (2) l l Figure 1 A compilation flow of the proposing sampling based architecture simulation

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

OpenMP¤òÍѤ¤¤¿ÊÂÎó·×»»¡Ê£±¡Ë

untitled

NUMAの構成

GPU n Graphics Processing Unit CG CAD

workshop Eclipse TAU AICS.key

ストリームを用いたコンカレントカーネルプログラミングと最適化 エヌビディアジャパン CUDAエンジニア森野慎也 GTC Japan 2014

XMPによる並列化実装2

GPU.....

Slide 1

B

01_OpenMP_osx.indd

strtok-count.eps

supercomputer2010.ppt

HP High Performance Computing(HPC)

MATLAB® における並列・分散コンピューティング ~ Parallel Computing Toolbox™ & MATLAB Distributed Computing Server™ ~

GPU チュートリアル :OpenACC 篇 Himeno benchmark を例題として 高エネルギー加速器研究機構 (KEK) 松古栄夫 (Hideo Matsufuru) 1 December 2018 HPC-Phys 理化学研究所 共通コードプロジェクト

スパコンに通じる並列プログラミングの基礎

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

netcdf

programmingII2019-v01

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

マルチコアPCクラスタ環境におけるBDD法のハイブリッド並列実装

: (1), ( ) 1 1.1,, 1 OpenMP [3, 5, 21, 22], MPI [13, 18, 23].., (C Fortran)., OS,. C Fortran,,,,. ( ),,.,,.,,,.,,,.,.,. 1

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

DO 時間積分 START 反変速度の計算 contravariant_velocity 移流項の計算 advection_adams_bashforth_2nd DO implicit loop( 陰解法 ) 速度勾配, 温度勾配の計算 gradient_cell_center_surface 速

r07.dvi

ohp07.dvi

Copyright Oracle Parkway, Redwood City, CA U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated softw

スパコンに通じる並列プログラミングの基礎

AMD/ATI Radeon HD 5870 GPU DEGIMA LINPACK HD 5870 GPU DEGIMA LINPACK GFlops/Watt GFlops/Watt Abstract GPU Computing has lately attracted

WinDriver PCI Quick Start Guide

IPSJ SIG Technical Report Vol.2013-HPC-138 No /2/21 GPU CRS 1,a) 2,b) SpMV GPU CRS SpMV GPU NVIDIA Kepler CUDA5.0 Fermi GPU Kepler Kepler Tesla

スパコンに通じる並列プログラミングの基礎

Cell/B.E. BlockLib

卒業論文

熊本大学学術リポジトリ Kumamoto University Repositor Title GPGPU による高速演算について Author(s) 榎本, 昌一 Citation Issue date Type URL Presentation

目的と目次 OpenCL はこれからのマルチコアプログラミングの主流 ( かも ) GPU (NVIDIA, AMD/ATI) Cell B.E. 組み込みプロセッサ 共通コードとして OpenCL に対応するために必要な準備を考える 目次 基礎編 ( 今回 ) OpenCL とは 実践編 高速化の

BASIC / / BA- SIC Web 1/10 1/10 / / JavaScript

WinHPC ppt

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

第3回戦略シンポジウム緑川公開用

double float

PC Development of Distributed PC Grid System,,,, Junji Umemoto, Hiroyuki Ebara, Katsumi Onishi, Hiroaki Morikawa, and Bunryu U PC WAN PC PC WAN PC 1 P

IPSJ SIG Technical Report Vol.2014-ARC-213 No.24 Vol.2014-HPC-147 No /12/10 GPU 1,a) 1,b) 1,c) 1,d) GPU GPU Structure Of Array Array Of

3次元画像計測分野でのGPUを用いた高速化事例

FINAL PROGRAM 25th Annual Workshop SWoPP / / 2012 Tottori Summer United Workshops on Parallel, Distributed, and Cooperative Processing 2012

大統一Debian勉強会 gdb+python拡張を使ったデバッグ手法

AMD AMD AMD Opteron x86 OS 2P 8P x GHz 75W ACP OEM Q4 2.3GHz HE (55W) 2.8GHz SE (105W) AMD PC 2009 All rights reserved. AMD Japan, L

CudaWaveField

連載講座 : 高生産並列言語を使いこなす (4) ゲーム木探索の並列化 田浦健次朗 東京大学大学院情報理工学系研究科, 情報基盤センター 目次 1 準備 問題の定義 αβ 法 16 2 αβ 法の並列化 概要 Young Brothers Wa

C

GPU Computing on Business

¥×¥í¥°¥é¥ß¥ó¥°±é½¬I Exercise on Programming I [1zh] ` `%%%`#`&12_`__~~~ alse

Quartus II ハンドブック Volume 5、セクションIV. マルチプロセッサの調整

GPUコンピューティング講習会パート1

ex01.dvi


インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

IPSJ SIG Technical Report Vol.2013-ARC-206 No /8/1 Android Dominic Hillenbrand ODROID-X2 GPIO Android OSCAR WFI 500[us] GPIO GP

IPSJ SIG Technical Report Vol.2012-ARC-202 No.13 Vol.2012-HPC-137 No /12/13 Tightly Coupled Accelerators 1,a) 1,b) 1,c) 1,d) GPU HA-PACS

SystemC 2.0を用いた簡易CPUバスモデルの設計

I. Opal SSC 1. Opal SSC 2. Opal Storage 3. Opal Storage MBR Shadowing 6. SP II. TCG Opal SSC HDD 9. Opal SSC HDD *1. TCG: Trusted Computin

1. GPU コンピューティング GPU コンピューティング GPUによる 汎用コンピューティング GPU = Graphics Processing Unit CUDA Compute Unified Device Architecture NVIDIA の GPU コンピューティング環境 Lin

SWoPP BOF BOF-1 8/3 19:10 BoF SWoPP : BOF-2 8/5 17:00 19:00 HW/SW 15 x5 SimMips/MieruPC M-Core/SimMc FPGA S

MPI usage

ohp03.dvi

橡3_2石川.PDF

IPSJ SIG Technical Report Vol.2014-DBS-159 No.6 Vol.2014-IFAT-115 No /8/1 1,a) 1 1 1,, 1. ([1]) ([2], [3]) A B 1 ([4]) 1 Graduate School of Info

CX-Checker CX-Checker (1)XPath (2)DOM (3) 3 XPath CX-Checker. MISRA-C 62%(79/127) SQMlint 76%(13/17) XPath CX-Checker 3. CX-Checker 4., MISRA-C CX- Ch

untitled

第5回お試しアカウント付き並列プログラミング講習会

Nexus7 2 Skia 3 4 skia 5 2. Skia 2D Android 2D Skia 2.1 Skia Skia 2D Skia Google Chrome Mozilla Firefox Android Chorome OS Android 2D Skia [7]. Androi

ストラドプロシージャの呼び出し方

ex01.dvi

Transcription:

GPGPU (VI) GPGPU 1 GPGPU CUDA CUDA GPGPU GPGPU CUDA GPGPU ( ) CUDA GPGPU 2 OpenCL OpenCL GPGPU Apple Khronos Group OpenCL Working Group [1] CUDA GPU NVIDIA GPU *1 OpenCL NVIDIA AMD GPU CPU DSP(Digital Signal Processor) NVIDIA GPU AMD GPU OpenCL OpenCL OpenCL CUDA CUDA OpenCL 2.1 OpenCL OpenCL 1 OpenCL Host( CPU ) ( )Compute Device( GPU ) Compute Device ( )Compute Unit( ) Compute Unit ( )Processing Element( ) OpenCL 2 OpenCL CUDA MP Block SP(CUDA Core) Thread Processing Element Work-Item( ) Compute Unit Workgroup( ) OpenCL *1 1

1 OpenCL Work-Item Private Memory Workgroup Local Memory Compute Device Global/Constant Memory Host Host Memory OpenCL CUDA CUDA Compute Unit MP Processing Element SP Private Memory Register Local Memory Shared Memory Global/Constant Memory Global Memory Host Memory Host Memory CUDA OpenCL OpenCL CUDA OpenCL GPU ( ) OpenCL 0 OpenCL Embedded Profile 2.2 OpenCL OpenCL OpenCL C/C++ OpenCL API OpenCL ( ) Compute Unit Processing Element (API) 2

2 OpenCL OpenCL C/C++ C/C++ GPU( ) kernel Global Memory global OpenCL CUDA OpenCL ( ) CUDA OpenCL OpenCL ( ) 3

2.3 OpenCL OpenCL CentOS 5.5 x86 64 CUDA 3.2RC TeslaC2050 PC AMD GPU ( ) 1 OpenCL ( ) (CPU ) 1. 16 24 : GPU 2. 25 28 : 3. 29 42 : 4. 43 55 : CPU/GPU CPU GPU 5. 56 63 : GPU GPU 6. 64 65 : GPU CPU 7. 66 68 : 8. 69 : 1 arraytest.cpp 1 #include <oclutils.h> 2 3 #define DATA LENGTH 16 4 cl context cxgpucontext; 5 cl kernel kernel; 6 cl command queue commandqueue; 7 8 #define CHK DO(name,o) cierrnum=o;if(cierrnum!=cl SUCCESS){printf(name);printf(" failed\n" );return( 1);} 9 #define CHK ERR(name) if(cierrnum!=cl SUCCESS){printf(name);printf(" failed\n");return ( 1);} 10 11 int main(int argc, char argv){ 12 cl platform id cpplatform = NULL; 13 cl uint cidevicecount = 0; 14 cl device id cddevices = NULL; 15 cl int cierrnum = CL SUCCESS; 16 // get platform 17 CHK DO("oclGetPlatformID", oclgetplatformid(&cpplatform)); 18 // get devices 19 CHK DO("clGetDeviceIDs1", clgetdeviceids(cpplatform, CL DEVICE TYPE GPU, 0, NULL, & cidevicecount)); 20 cddevices = (cl device id )malloc(cidevicecount sizeof(cl device id)); 21 CHK DO("clGetDeviceIDs2", clgetdeviceids(cpplatform, CL DEVICE TYPE GPU, cidevicecount, cddevices, NULL)); 22 // get context 23 cxgpucontext = clcreatecontext(0, cidevicecount, cddevices, NULL, NULL, &cierrnum); 24 CHK ERR("clCreateContext"); 25 // create command queue 26 cl device id device = oclgetdev(cxgpucontext, 0); 27 commandqueue = clcreatecommandqueue(cxgpucontext, device, CL QUEUE PROFILING ENABLE, &cierrnum); 28 CHK ERR("clCreateCommandQueue"); 4

29 // program setup 30 size t program length; 31 const char source path = "gpu.cl"; 32 char source = oclloadprogsource(source path, "", &program length); 33 if(!source){printf("oclloadprogsource failed(%s)\n", source path);return 2000;} 34 // create the program 35 cl program cpprogram = clcreateprogramwithsource(cxgpucontext, 1, (const char )&source, & program length, &cierrnum); 36 CHK ERR("clCreateProgramWithSource"); 37 free(source); 38 // build the program 39 CHK DO("clBuildProgram", clbuildprogram(cpprogram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL)); 40 // Create Kernel 41 kernel = clcreatekernel(cpprogram, "arraytest", &cierrnum); 42 CHK ERR("clCreateKernel"); 43 // setup data 44 cl mem d A; 45 cl mem d R; 46 float h A data = (float )malloc(sizeof(float) DATA LENGTH); 47 for(int i=0; i<data LENGTH; i++)h A data[i] = (float)(i+1); 48 float h R data = (float )malloc(sizeof(float) DATA LENGTH); 49 d A = clcreatebuffer(cxgpucontext, CL MEM READ ONLY CL MEM COPY HOST PTR, sizeof( float) DATA LENGTH, h A data, NULL); 50 d R = clcreatebuffer(cxgpucontext, CL MEM WRITE ONLY, sizeof(float) DATA LENGTH, NULL, NULL); 51 float value = 2.0f; 52 // set args 53 clsetkernelarg(kernel, 0, sizeof(cl mem), (void )&d R); 54 clsetkernelarg(kernel, 1, sizeof(cl mem), (void )&d A); 55 clsetkernelarg(kernel, 2, sizeof(cl float), (void )&value); 56 // run kernel 57 cl event GPUExecution; 58 size t localworksize[] = {4}; 59 size t globalworksize[] = {DATA LENGTH}; 60 clenqueuendrangekernel(commandqueue, kernel, 1, 0, globalworksize, localworksize, 0, NULL, & GPUExecution); 61 clflush(commandqueue); 62 // sync 63 clfinish(commandqueue); 64 // bloking readback 65 clenqueuereadbuffer(commandqueue, d R, CL TRUE, 0, sizeof(float) DATA LENGTH, h R data, 0, NULL, NULL); 66 // check result 67 printf("before: "); for(int i=0; i<data LENGTH; i++){printf(" %.2f", h A data[i]);}printf("\n"); 68 printf("after : "); for(int i=0; i<data LENGTH; i++){printf(" %.2f", h R data[i]);}printf("\n"); 69 // release mem and event 70 clreleasememobject(d A); 71 clreleasememobject(d R); 72 clreleaseevent(gpuexecution); 73 // cleanup 74 cierrnum = clreleasekernel(kernel); 75 cierrnum = clreleasecommandqueue(commandqueue); 76 cierrnum = clreleaseprogram(cpprogram); 77 cierrnum = clreleasecontext(cxgpucontext); 78 CHK ERR("release"); 79 free(h A data); 80 free(h R data); 81 return 0; 82 } GPU PlatformID Device DeviceID Context 2 GPU 5

2 arraytest.cpp( ) 1 // get context create command queue 2 size t ndevicebytes; 3 CHK DO("clGetContextInfo", clgetcontextinfo(cxgpucontext, CL CONTEXT DEVICES, 0, NULL, &ndevicebytes)); 4 cidevicecount = (cl uint)ndevicebytes/sizeof(cl device id); 5 if(cidevicecount == 0){printf("no devices (return code %i)\n", cierrnum);return 1;} 6 // check all devices (get device and print the device name) 7 for(unsigned int i = 0; i < cidevicecount; ++i){ 8 // device GPU 9 cl device id device = oclgetdev(cxgpucontext, i); 10 printf("device %d: ", i); 11 oclprintdevname(logboth, device); // GPU 12 printf("\n"); 13 } clsetkernelarg API GPU API ( ) clenqueuendrangekernel 5 6 (localworksize globalworksize) CUDA Thread Block Thread Block OpenCL Work-Item CUDA Thread Block Thread CUDA ( ) WorkSize 2 CUDA (CUDA C) CUDA C CUDA Driver API CUDA CUDA OpenCL API OpenCL CUDA CUDA Stream( ) Stream GPU CPU-GPU Stream CUDA C 6

1. ( ) oclloadprogsource 2. clcreateprogramwithsource 3. clbuildprogram 4. clcreatekernel 3 CUDA ( ) Work-Item ID CUDA ID get global id Work-Item 1 3 gpu.cl 1 #define DATA LENGTH 16 2 kernel void arraytest( global float R, global float A, float value){ 3 int i; 4 i = get global id(0); 5 R[i] = A[i] value; 6 } CUDA CUDA CUDA CUDA ( ) OpenCL OpenCL CUDA nvcc OpenCL 3 gcc(g++) OpenCL OpenCL CUDA CUDA(CUDA C) CUDA Driver API CUDA C OpenGL GLUT ( ) OpenCL 7

$ ls arraytest.cpp gpu.cl $ g++ -O3 -m64 -o arraytest arraytest.cpp -lopencl \ -I/path_to_cudasdk_3.1/OpenCL/common/inc -I/path_to_cudasdk_3.1/shared/inc \ -L/path_to_cudasdk_3.1/OpenCL/common/lib -L/path_to_cudasdk_3.1/shared/lib \ -loclutil_x86_64 -lshrutil_x86_64 $ ls arraytest arraytest.cpp gpu.cl $./arraytest before: 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 ( ) after : 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 ( ) 3 OpenCL 3 GPGPU 3.1 GPGPU GPGPU C/C++ pthread OpenMP MPI GPGPU GPGPU OpenMP CUDA OMPCUDA[3] OMPCUDA OpenMP parallel for (for ) for GPU GPU CPU-GPU OpenMP OMPCUDA GPU SP ( 4) OMPCUDA OpenMP OMNI OpenMP compiler( OMNI)[2] OMNI OMPCUDA 5 OMNI OMPCUDA OMPCUDA CUDA GPU OMPCUDA OpenMP C/C++ Fortran OMPCUDA 8

4 OpenMP OMPCUDA 5 OMNI OMPCUDA 9

OMPCUDA Lee [5] OpenMP CUDA OpenMP GPGPU PGI [4] CAPS HMPP[6] 3.2 GPGPU CUDA OpenCL C/C++ C/C++ HPC C/C++ Fortran Java, Perl, Python, Ruby,.NET ( ) CUDA OpenCL GPU PyCUDA[7], PyOpenCL[8] PyCUDA PyOpenCL Python CUDA OpenCL Python CUDA OpenCL CPU-GPU GPU Python C++ Ruby-OpenCL[9] Ruby-OpenCL PyOpenCL Ruby OpenCL GPU Python Ruby JCuda[10] JCuda Java CUDA Python/Ruby GPU (PTX) CUDA (CUBLAS, CUFFT ) API CUDA Fortran[11] CUDA Fortran Fortran CUDA PGI CUDA Fortran C/C++ Fortran ( ) CUDA C Fortran CUDA OpenCL GPU 10

GPU GPU 3.3 GPGPU / GPU GPU CUDA GPU GPU ( / ) CUDA 2005 GPU BrookGPU[12] RapidMind[14] BrookGPU Stanford University 6 BrookGPU (kfunc) (streamread,streawrite) 2 GPU BrookGPU DirectX, OpenGL, CPU, CTM(Close to the Metal AMD ) AMD GPU BrookGPU Brook+[13] RapidMind University of Waterloo Sh[15] RAPIDMIND 7 RapidMind RapidMind BrookGPU RapidMind PetaFLOPS ExaFLOPS GPGPU ( ) CUDA GPGPU CUDA 11

// kernel void kfunc (float x<>, float y<>, out float z<>) { z = x + y; } int main() { float a<100>; float b<100>; float c <100>; // streamread(a, data1); streamread(b, data2); // kfunc(a, b); // streamwrite(c, result); return 0; } int main() { // Program kfunc=begin { In<Value3f>x, y; Out<Value3f>z; z = x + y; } END; // Array<1, Value3f> a(512); Array<1, Value3f> b(512); Array<1, Value3f> c(512); // c = kfunc(a, b); return 0; } 6 BrookGPU 7 RapidMind OpenCL NVIDIA GPU GPU C/C++ GPGPU GPGPU GPGPU 2010 11 Tianhe-1A (2009 11 TOP500 RadeonHD 5 ) 14,336 CPU 7,168 TeslaM2050 LINPACK 2PFLOPS TOP500(2010 11 ) TSUBAME 4,224 TeslaM2050 TSUBAME 2.0 TOP500 Green500 GPGPU GPU GPU GPU GPU GPU GPGPU [1] OpenCL - The open standard for parallel programming of heterogeneous systems, http: //www.khronos.org/opencl/ [2] M.Sato, S.Satoh, K.Kusano, and Y.Tanaka. Design of OpenMP Compiler for an SMP Cluster. 12

In EWOMP 99, pp. 32 39, 1999. [3],,. OMPCUDA : GPU OpenMP. HPCS2009 2009, pp.131 138, 2009. [4] PGI. PGI Accelerator Compilers, http://www.pgroup.com/resources/accel.htm [5] Seyong Lee, Seung-Jai Min, Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp.101-110, 2009. [6] CAPS. HMPP Workbench, http://www.caps-entreprise.com/fr/page/index.php?id= 49\&p\ p=36 [7] PyCUDA, http://mathema.tician.de/software/pycuda [8] PyOpenCL, http://mathema.tician.de/software/pyopencl [9] Ruby-OpenCL, http://ruby-opencl.rubyforge.org/ [10] jcuda, http://www.jcuda.org/ [11] CUDA Fortran, http://www.pgroup.com/resources/cudafortran.htm [12] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. SIGGRAPH 2004, 2004. [13] AMD. Brook+. SC07 BOF Session presentation, November 2007. [14] Michael D. McCool. Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform. In GSPx Multicore Applications Conference, 2006. [15] Michael McCool and Stefanus Du Toit. Metaprogramming GPUs with Sh. A K Peters Ltd, 2004. 13