ストリーミング SIMD 拡張命令2 (SSE2) を使用した SAXPY/DAXPY

Similar documents
ストリーミング SIMD 拡張命令2 (SSE2) を使用した、倍精度浮動小数点ベクトルの最大/最小要素とそのインデックスの検出

CPU Levels in the memory hierarchy Level 1 Level 2... Increasing distance from the CPU in access time Level n Size of the memory at each level 1: 2.2

Source: Intel.Config: Pentium III Processor-Intel Seattle SE440BX-2, 128MB PC100 CL2 SDRAM Intel 440BX-2 Chipset Platform- Diamond Viper 550 /

(Basic Theory of Information Processing) 1

インテル(R) Visual Fortran Composer XE

mate10„”„õŒì4

スパコンに通じる並列プログラミングの基礎

スパコンに通じる並列プログラミングの基礎

Pentium 4

Itanium2ベンチマーク

GRAPE GRAPE-DR V-GRAPE

GRAPE GRAPE-DR V-GRAPE

スパコンに通じる並列プログラミングの基礎

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

MBLAS¤ÈMLAPACK; ¿ÇÜĹÀºÅÙÈǤÎBLAS/LAPACK¤ÎºîÀ®

Express5800/110Ee Pentium 1. Express5800/110Ee N N Express5800/110Ee Express5800/110Ee ( /800EB(256)) ( /800EB(256) 20W) CPU L1 L2 CD-

Express5800/110Ee (2002/01/22)

64bit SSE2 SSE2 FPU Visual C++ 64bit Inline Assembler 4 FPU SSE2 4.1 FPU Control Word FPU 16bit R R R IC RC(2) PC(2) R R PM UM OM ZM DM IM R: reserved

(Version: 2017/4/18) Intel CPU 1 Intel CPU( AMD CPU) 64bit SIMD Inline Assemler Windows Visual C++ Linux gcc 2 FPU SSE2 Intel CPU do

untitled

v10 IA-32 64¹ IA-64²

倍々精度RgemmのnVidia C2050上への実装と応用

インテル(R) C++ Composer XE 2011 Windows版 入門ガイド

untitled

23 Fig. 2: hwmodulev2 3. Reconfigurable HPC 3.1 hw/sw hw/sw hw/sw FPGA PC FPGA PC FPGA HPC FPGA FPGA hw/sw hw/sw hw- Module FPGA hwmodule hw/sw FPGA h

RaVioli SIMD

Web Web Web Web i

Express5800/110Rc-1 1. Express5800/110Rc-1 N N Express5800/110Rc-1 Express5800/110Rc-1 ( /1BG(256)) (C/850(128)) CPU Pentium (1BGHz) 1

G007 Panasonic CF-R7 U GHz 2GB 250GB 12inch 0.9Kg G008 Panasonic CF-R3 1.10GHz 768MB 40GB 10inch 0.9Kg WinXP Pro G009 Panasonic CF-R4 1.1GHz 7

07-二村幸孝・出口大輔.indd

SonicStage Ver. 2.0

01_OpenMP_osx.indd

HPC / (CfCA) HPC 2007/11/23-25

橡Webcamユーザーガイド03.PDF

Second-semi.PDF

2

XcalableMP入門

GPU GPU CPU CPU CPU GPU GPU N N CPU ( ) 1 GPU CPU GPU 2D 3D CPU GPU GPU GPGPU GPGPU 2 nvidia GPU CUDA 3 GPU 3.1 GPU Core 1

FileMaker Mobile 8 User’s Guide

1重谷.PDF

1st-session key

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)


develop

main.dvi

DPCK-US10

Express5800/120Rb-1 (2002/01/22)

HP xw9400 Workstation

NEC All rights reserved 1

Express5800/120Ra-1

HP Compaq Business Desktop dx7300シリーズ

Microsoft PowerPoint - sales2.ppt

1 OpenCL OpenCL 1 OpenCL GPU ( ) 1 OpenCL Compute Units Elements OpenCL OpenCL SPMD (Single-Program, Multiple-Data) SPMD OpenCL work-item work-group N

Getting Started Creative Sound Blaster Live! 5.1 Creative Sound Blaster Live! 5.1 Digital Audio Creative Technology Ltd. Creative Technology Ltd. 1 Co

Contents Windows* /Linux* C++/Fortran... 3 Microsoft* embedded Visual C++* C Microsoft* Windows* CE.NET Platform Builder C IP

卒業論文2.dvi

Compiler Differences on OpenVMS I64

FFTSS Library Version 3.0 User's Guide

XMPによる並列化実装2

Untitled

展開とプロビジョニングの概念

Express5800/120Lf 1. Express5800/120Lf N N N Express5800/120Lf Express5800/120Lf Express5800/120Lf ( /1BG(256)) ( /1BG(256)) (

NW-E062 / E063 / E062K/ E063K

インテル(R) Visual Fortran Composer XE 2013 Windows版 入門ガイド

Express5800/120Rb-2

DPD Software Development Products Overview

Express5800/120Rc-2 Workgroup/Department 1. Express5800/120Rc-2 N N N Express5800/120Rc-2 Express5800/120Rc-2 Express5800/120R

Express5800/120Le

Infoprint 250 GA

,,,,., C Java,,.,,.,., ,,.,, i


Express5800/120Ed

indd

Intel® Compilers Professional Editions

Printer Driverセットアップ編

Express5800/120Lc

Excel97関数編


MultiPASS B-20 MultiPASS Suite 3.10使用説明書

HP Compaq Business Desktop dc7700シリーズ

GPU.....

untitled

HPC

pptx

H.264/AVC 2 H.265/HEVC 1 H.265 JCT-VC HM(HEVC Test Model) HM 5 5 SIMD HM 33%

02_C-C++_osx.indd

(SAD) x86 MPSADBW H.264/AVC H.264/AVC SAD SAD x86 SAD MPSADBW SAD 3x3 3 9 SAD SAD SAD x86 MPSADBW SAD 9 SAD SAD 4.6

単位、情報量、デジタルデータ、CPUと高速化 ~ICT用語集~

1 3DCG [2] 3DCG CG 3DCG [3] 3DCG 3 3 API 2 3DCG 3 (1) Saito [4] (a) 1920x1080 (b) 1280x720 (c) 640x360 (d) 320x G-Buffer Decaudin[5] G-Buffer D

M-crew for HAR-LH500 (Version 2.6J)

OpenMP (1) 1, 12 1 UNIX (FUJITSU GP7000F model 900), 13 1 (COMPAQ GS320) FUJITSU VPP5000/64 1 (a) (b) 1: ( 1(a))

211 年ハイパフォーマンスコンピューティングと計算科学シンポジウム Computing Symposium 211 HPCS /1/18 a a 1 a 2 a 3 a a GPU Graphics Processing Unit GPU CPU GPU GPGPU G

EPSON ES-D200 パソコンでのスキャンガイド

インテル(R) Visual Fortran Composer XE 2011 Windows版 入門ガイド

EPSON Easy Interactive Tools Ver.4.2 Operation Guide

HP COMPAQ BUSINESS DESKTOP DC7800シリーズ

supercomputer2010.ppt

NW- E052 / E053

たのしいプログラミング Pythonではじめよう!

Transcription:

SIMD 2(SSE2) SAXPY/DAXPY 2.0 2000 7 : 248600J-001 01/12/06 1

305-8603 115 Fax: 0120-47-8832 * Copyright Intel Corporation 1999, 2000 01/12/06 2

1...5 2 SAXPY DAXPY...5 2.1 SAXPY DAXPY...6 2.1.1 SIMD C++...6 2.1.2 C/C++...7 3...8 4...8 5 C/C++...9 6 SSE2 C++...10 7 SSE2...11 A -... A-1... A-1... A-2 01/12/06 3

2.0 Pentium 4 2000 7 1.0 1999 9 Lawson Hanson Kincaid Krogh Basic linear algebra subprograms for Fortran usage ACM Transactions on Mathematical Software Vol. 5 No. 3 308 371 Dongarra Moler Bunch Stewart LINPACK User's Guide SIAM 1979 C++ SIMD 693500J 1999 C/C++ 741901J 1999 01/12/06 4

1 SIMD 2(SSE2 Streaming SIMD Extensions 2) SIMD(Single Instruction Multiple Data) SIMD IA-32 SIMD SIMD (SSE) SIMD 128 SIMD 64 SIMD 3D (3D) / SAXPY/DAXPY / SSE2 SSE Pentium 4 Pentium III SAXPY DAXPY SAXPY(DAXPY) SAXPY(DAXPY) ( ) (IA) SIMD SAXPY(DAXPY) SIMD C++ C/C++ 1 16 SAXPY(DAXPY) C/C++ SIMD 2 SAXPY DAXPY SAXPY(DAXPY) Lawson Hanson Kincaid Krogh BLAS(Basic Linear Algebra Subprograms)[Lawson, 1979] SAXPY (SA) X Y DAXPY Y = a * X + Y X Y ( 1 n) a FORTRAN 01/12/06 5

CALL SAXPY (N, A, X, INCX, Y, INCY) N INCX INCY SAXPY X Y 2 INCX=INCY=1 LINPACK BLAS INCX=INCY=1 [Dongarra, 1979] LINPACK BLAS SAXPY DAXPY SIMD INCX=INCY=1 SAXPY DAXPY C++ 2.1 SAXPY DAXPY SSE2 C/C++ C/C++ 2 1 SIMD C++ SSE2 1 C/C++ 2 2 2.1.1 SIMD C++ FVEC DVEC 16 float(double) F32vec4(F64vec2) x y F32vec4(F64vec2) y[i] = scalar * x[i] + y[i] 128 4(2) n 4(2) sa(da) 1 sa(da) 4(2) 1 16 float(double) F32vec4(F64vec2) x y 2 01/12/06 6

x y 1 C/C++ declspec ( C/C++ ) x y 2 if SAXPY(DAXPY) 2.1.2 C/C++ 30 SAXPY(DAXPY) C/C++ /QxK /QxW SSE SSE2 /Qvec_verbose3 SAXPY(DAXPY) C/C++ SAXPY( DAXPY) C/C++ FORTRAN /Qvec_verbose3 2 ivdep pragma for pragma 4(2) vector aligned pragma C/C++ ivdep vector aligned 2 pragma SAXPY DAXPY pragma 1 C/C++ novector pragma 1 /QxW /QaxW /QxW SSE2( Pentium III ) Pentium 4 IA SSE2 /QaxW SSE2 2 1 /QxW 01/12/06 7

SSE2 1 /QxW SSE2 2 1 3 SSE2 FVEC(DVEC) SIMD C++ C/C++ FVEC(DVEC) ( ) C/C++ SSE SSE2 (SSE 4 SSE2 2 ) SAXPY SSE 4 Pentium 4 Pentium III 2 SSE DAXPY SSE2 4 SSE SSE2 SIMD 2 (1)SIMD C++ ( C/C++ ) (2) C/C++ pragma Itanium TM (2) 01/12/06 8

5 C/C++ * *Saxpy from BLAS, Lawson, Manson, Kincaid, and Krogh (1979) * *this compilation from _Linpack Users' Guide_, Dongarra, Moler, * Bunch, & Stewart, Siam 1979, appendix A. * *These versions are "unit" saxpy (daxpy), meaning they assume stride = 1. *(note that this is what BLAS requires, and is the most common form) * * Assume all vectors are aligned. * */ void usaxpy (int n, float sa, float *sx, float *sy) { if (sa == 0.0) return; } //The latest intel compilers can now vectorize this loop. //Use the novector pragma to prevent vectorization. #pragma novector for (int i = 0; i < n; i++) sy[i] = sa * sx[i] + sy[i]; void udaxpy (int n, double da, double *dx, double *dy) { if (da == 0.0) return; //The latest intel compilers can now vectorize this loop. //Use the novector pragma to prevent vectorization. #pragma novector for (int i = 0; i < n; i++) dy[i] = da * dx[i] + dy[i]; } 01/12/06 9

6 SSE2 C++ /* Assumes vectors are aligned, and that the vector length n is divisible * by 4 for saxpy and 2 for daxpy. */ #include <fvec.h> #include <dvec.h> void usaxpy_fvec (int n, float sa, float *sx, float *sy) { F32vec4 *x = (F32vec4 *)sx, *y = (F32vec4 *)sy; F32vec4 a(sa); if (sa == 0.0) return; n >>= 2; for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } void udaxpy_dvec (int n, double da, double *dx, double *dy) { F64vec2 *x = (F64vec2 *) dx, *y = (F64vec2 *) dy; F64vec2 a(da); if (da == 0.0) return; n >>= 1; for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } 01/12/06 10

7 SSE2 /* Assumes vectors are aligned */ void usaxpy_vec (int n, float sa, float *sx, float *sy) { if (sa == 0.0) return; #pragma ivdep #pragma vector aligned for (int i = 0; i < n; i++) sy[i] = sa * sx[i] + sy[i]; } void udaxpy_vec (int n, double da, double *dx, double *dy) { if (da == 0.0) return; #pragma ivdep #pragma vector aligned for (int i = 0; i < n; i++) dy[i] = da * dx[i] + dy[i]; } 01/12/06 11

A - 2.0 1.20 GHz Pentium 4 2000 7 1.0 1999 9 1:SAXPY/DAXPY ( MFLOPS) Pentium III (733 MHz) SAXPY: C 360 523 SAXPY: FVEC 837 1932 SAXPY: 865 1476 DAXPY: C 353 640 DAXPY: DVEC N/A 951 DAXPY: N/A 672 Pentium 4 (1.20 GHz) 1 1 733 MHz Pentium III 1.20 GHz Pentium 4 A-2 2 ( ) FLOPS SAXPY SIMD (SSE) 4 Pentium III 2 Pentium 4 3.7 SSE 1.9 GFLOPS DAXPY SSE2 1.5 951 MFLOPS DAXPY 1 SIMD 2 SAXPY 4 01/12/06 A-1

2: Pentium III Pentium III (733 MHz) Desktop Board VC820 BIOS VC82010A.86A.0028.P10 2 256 KB 128 MB RDRAM PC800-45 Ultra ATA 6.00.012 IBM DJNA-371800 ATA-66 Creative Labs 3D Blaster Annihilator Pro AGP nvidia GeForce256 DDR 32MB Nvidia Reference Driver 5.22 Windows 2000 2195 3: Pentium 4 Pentium 4 (1.20 GHz) Desktop Board D850GB BIOS GB85010A.86A.0014.D.0007201756 2 256 KB 128 MB RDRAM PC800-45 Ultra ATA 6.00.012 IBM DJNA-371800 ATA-66 / Creative Labs 3D Blaster Annihilator Pro AGP nvidia GeForce256 DDR 32MB NVidia Reference Driver 5.22 Windows 2000 2195 01/12/06 A-2