AutoTuned-RB

Similar documents
2008 ( 13 ) C LAPACK 2008 ( 13 )C LAPACK p. 1

I117 7 School of Information Science, Japan Advanced Institute of Science and Technology

CM-3G 周辺モジュール拡張技術文書 MS5607センサ(温度、気圧)

Untitled

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裵²ó ¨¡ À©¸æ¹½Â¤¡§¾ò·ïʬ´ô ¨¡

XMPによる並列化実装2

comment.dvi

II 3 yacc (2) 2005 : Yacc 0 ~nakai/ipp2 1 C main main 1 NULL NULL for 2 (a) Yacc 2 (b) 2 3 y

第5回お試しアカウント付き並列プログラミング講習会

file"a" file"b" fp = fopen("a", "r"); while(fgets(line, BUFSIZ, fp)) {... fclose(fp); fp = fopen("b", "r"); while(fgets(line, BUFSIZ, fp)) {... fclose

115 9 MPIBNCpack 9.1 BNCpack 1CPU X = , B =

r07.dvi

ohp07.dvi

C

( ) 1 1: 1 #include <s t d i o. h> 2 #include <GL/ g l u t. h> 3 #include <math. h> 4 #include <s t d l i b. h> 5 #include <time. h>

[ 1] 1 Hello World!! 1 #include <s t d i o. h> 2 3 int main ( ) { 4 5 p r i n t f ( H e l l o World!! \ n ) ; 6 7 return 0 ; 8 } 1:

£Ã¥×¥í¥°¥é¥ß¥ó¥°(2018) - Âè10²ó – ¿¹à¼°¤Îɾ²Á¡§¥¢¥ë¥´¥ê¥º¥à¤Î²þÁ± –

: CR (0x0d) LF (0x0a) line separator CR Mac LF UNIX CR+LF MS-DOS WINDOWS Japan Advanced Institute of Science and Technology

Microsoft Word - C.....u.K...doc

file:///D|/C言語の擬似クラス.txt

3 C 2 T C 2 T C 2 T < T (x + r) T (x) 2 > r 2/3 (1) t i V i T i Taylor s Hypothesis C 2 T = 1 n 1 n i=2 T i T i 1 2 {(t i t i 1 ) (V i + V i 1 )/2 2/3

( CUDA CUDA CUDA CUDA ( NVIDIA CUDA I

‚æ2›ñ C„¾„ê‡Ìš|

II ( ) prog8-1.c s1542h017%./prog8-1 1 => 35 Hiroshi 2 => 23 Koji 3 => 67 Satoshi 4 => 87 Junko 5 => 64 Ichiro 6 => 89 Mari 7 => 73 D

1 (bit ) ( ) PC WS CPU IEEE754 standard ( 24bit) ( 53bit)

44 6 MPI 4 : #LIB=-lmpich -lm 5 : LIB=-lmpi -lm 7 : mpi1: mpi1.c 8 : $(CC) -o mpi1 mpi1.c $(LIB) 9 : 10 : clean: 11 : -$(DEL) mpi1 make mpi1 1 % mpiru

memo

Krylov (b) x k+1 := x k + α k p k (c) r k+1 := r k α k Ap k ( := b Ax k+1 ) (d) β k := r k r k 2 2 (e) : r k 2 / r 0 2 < ε R (f) p k+1 :=

1st-session key

double float

1 return main() { main main C 1 戻り値の型 関数名 引数 関数ブロックをあらわす中括弧 main() 関数の定義 int main(void){ printf("hello World!!\n"); return 0; 戻り値 1: main() 2.2 C main

joho07-1.ppt

ohp03.dvi

/ SCHEDULE /06/07(Tue) / Basic of Programming /06/09(Thu) / Fundamental structures /06/14(Tue) / Memory Management /06/1

BW BW

A/B (2018/10/19) Ver kurino/2018/soft/soft.html A/B

[1] #include<stdio.h> main() { printf("hello, world."); return 0; } (G1) int long int float ± ±

1 1.1 C 2 1 double a[ ][ ]; 1 3x x3 ( ) malloc() 2 double *a[ ]; double 1 malloc() dou

Minimum C Minimum C Minimum C BNF T okenseq W hite Any D

C による数値計算法入門 ( 第 2 版 ) 新装版 サンプルページ この本の定価 判型などは, 以下の URL からご覧いただけます. このサンプルページの内容は, 新装版 1 刷発行時のものです.

void hash1_init(int *array) int i; for (i = 0; i < HASHSIZE; i++) array[i] = EMPTY; /* i EMPTY */ void hash1_insert(int *array, int n) if (n < 0 n >=

超初心者用

/* do-while */ #include <stdio.h> #include <math.h> int main(void) double val1, val2, arith_mean, geo_mean; printf( \n ); do printf( ); scanf( %lf, &v

Microsoft Word - Cプログラミング演習(12)

1) OOP 2) ( ) 3.2) printf Number3-2.cpp #include <stdio.h> class Number Number(); // ~Number(); // void setnumber(float n); float getnumber();

倍々精度RgemmのnVidia C2050上への実装と応用

Informatics 2010.key

第7章 有限要素法のプログラミング

Taro-リストⅢ(公開版).jtd

r08.dvi

実際の株価データを用いたオプション料の計算

1 $ cat aboutipa 2 IPA is a Japanese quasi-government 3 organization established in accor- 4 dance with The Law for Information 5 Processing Technolog

para02-2.dvi

debug ( ) 1) ( ) 2) ( ) assert, printf ( ) Japan Advanced Institute of Science and Technology

ohp08.dvi

DKA ( 1) 1 n i=1 α i c n 1 = 0 ( 1) 2 n i 1 <i 2 α i1 α i2 c n 2 = 0 ( 1) 3 n i 1 <i 2 <i 3 α i1 α i2 α i3 c n 3 = 0. ( 1) n 1 n i 1 <i 2 < <i

Microsoft Word - 計算科学演習第1回3.doc

tuat1.dvi

memo

j x j j j + 1 l j l j = x j+1 x j, n x n x 1 = n 1 l j j=1 H j j + 1 l j l j E

Original : Hello World! (0x0xbfab85e0) Copy : Hello World! (0x0x804a050) fgets mstrcpy malloc mstrcpy (main ) mstrcpy malloc free fgets stream 1 ( \n

ex14.dvi

£Ã¥×¥í¥°¥é¥ß¥ó¥°ÆþÌç (2018) - Â裶²ó ¨¡ À©¸æ¹½Â¤¡§·«¤êÊÖ¤· ¨¡

導入基礎演習.ppt

program.dvi

LAN Copyright c Daikoku Manabu This tutorial is licensed under a Creative Commons Attribution 2.1 Japan License

1 1.1 C 2 1 double a[ ][ ]; 1 3x x3 ( ) malloc() malloc 2 #include <stdio.h> #include

A/B (2010/10/08) Ver kurino/2010/soft/soft.html A/B

練習&演習問題

< A796BD8AD991E58A77976C2D8CBE8CEA C B B835E2E706466>

ex01.dvi

Condition DAQ condition condition 2 3 XML key value

programmingII2019-v01

yacc.dvi

WinHPC ppt

2 1. Ubuntu 1.1 OS OS OS ( OS ) OS ( OS ) VMware Player VMware Player jp/download/player/ URL VMware Plaeyr VMware

Informatics 2014

BIND 9 BIND 9 IPv6 BIND 9 view lwres

86 8 MPIBNCpack 15 : int n, myid, numprocs, i; 16 : double pi, start_x, end_x; 17 : double startwtime = 0.0, endwtime; 18 : int namelen; 19 : char pro

slide5.pptx

2 [1] 7 5 C 2.1 (kikuchi-fem-mac ) input.dat (cat input.dat type input.dat ))

273? C

r03.dvi

1 C STL(1) C C C libc C C C++ STL(Standard Template Library ) libc libc C++ C STL libc STL iostream Algorithm libc STL string vector l

文字数は1~6なので 同じ本数の枝を持つパスで生成される呪文の長さは最大で6 倍の差がある 例えば 上図のようなケースを考える 1サイクル終了した時点では スター節点のところに最強呪文として aaaaaac が求まる しかしながら サイクルを繰り返していくと やがてスター節点のところに aaaaaa

新・明解C言語 ポインタ完全攻略

卒 業 研 究 報 告.PDF


スライド タイトルなし

1 ( ) 1.1 (convert.sh) (18GHz 26GHz) C (convert.c, convert1.c) mesure-ryudai convert.sh #!/bin/sh # file1 file1= ls -1 $1 # file1 data for data in $fi

Informatics 2015

Java

Microsoft PowerPoint - kougi9.ppt

3.1 stdio.h iostream List.2 using namespace std C printf ( ) %d %f %s %d C++ cout cout List.2 Hello World! cout << float a = 1.2f; int b = 3; cout <<

£Ã¥×¥í¥°¥é¥ß¥ó¥°(2018) - Âè11²ó – ½ÉÂꣲ¤Î²òÀ⡤±é½¬£² –

lexex.dvi

エラー処理・分割コンパイル・コマンドライン引数

PowerPoint Presentation

新版明解C言語 実践編

PC Windows 95, Windows 98, Windows NT, Windows 2000, MS-DOS, UNIX CPU

Transcription:

ABCLib Working Notes No.10 AutoTuned-RB Version 1.00

AutoTuned-RB AutoTuned-RB RB_DGEMM RB_DGEMM ( TransA, TransB, M, N, K, a, A, lda, B, ldb, b, C, ldc ) L3BLAS DGEMM (C a Trans(A) Trans(B) b C) (1) TransA: A n t (char ) TransB: B n t (char ) M: A C (int ) N: B C (int ) K: A B (int ) a: (1) a(double ) A: A (double ) lda: A (int ) B: B (double ) ldb: B (int ) b: (1) b(double ) C: C (double ) ldc: C (int ) RB_DGEMM void

BLAS 1CPU SMP (1) L1 (2) L3 (3) L3 SMP SMP (1) (2) 1CPU (1) AutoTuned-RB (Automatically Tuned Recursive BLAS) AutoTuned-RB BLAS ATLAS BLAS (1) 1CPU SMP (2) BLAS OS (3) C (4) Posix Thread

L3 L1 L2

AutoTuned-RB SMP 1CPU (1) Makefile (2) make config (3) Config file make install ###################################################################### Makefile ######################################## # # AutoTuned-RB Install Makefile # # # # ######################################### SHELL = /bin/sh ARCH = Linux_P4SSE2

( :ATLAS ) ######################################### TOPdir = /home/kinoshita ATLdir =$(TOPdir)/ATLAS ATLlibdir = $(ATLdir)/lib/$(ARCH) ######################################## (AutoTune-RB ) ATRBdir = $(TOPdir)/ATRB ATRBobjdir = $(ATRBdir)/source/$(ARCH) ATRBsrcdir = $(ATRBdir)/source ATRBincdir = $(ATRBdir)/include/$(ARCH) ATRBlibdir = $(ATRBdir)/lib/$(ARCH) #ATRbindir = $(TOPdir)/bin/$(ARCH) ######################################### CC = cc CCFLAGS = -O3 NM = -o OJ = -c ######################################### ARCHIVER = ar ARFLAGS = r ######################################### LATL = -latlas LPTH = -lpthread #Math = -lm

######################################### all:install config: rm -f cofig $(CC) -lm config.c $(NM) config./config mkdir -p $(ATRBincdir) mv Confg.h $(ATRBincdir)/ rm -f config install:$(atrbsrcdir)/abclib_blas_src.c $(ATRBsrcdir)/ABCLib_BLAS_Rec ursive.c rm -f $(ATRBlibdir)/*.a $(CC) $(ATRBsrcdir)/ABCLib_BLAS_Src.c $(NM) ABCLib_BLAS_Src $(CCFLAGS) -I$(ATRBincdir) -L$(ATLlibdir) $(LATL) $(LPTH)./ABCLib_BLAS_Src mv Mtdev.h $(ATRBincdir)/ rm -f ABCLib_BLAS_Src $(CC) $(OJ) -I$(ATRBincdir) $(ATRBsrcdir)/ABCLib_BLAS_Recursive.c mkdir -p $(ATRBobjdir) mv $(ATRBdir)/ABCLib_BLAS_Recursive.o $(ATRBobjdir)/ mkdir -p $(ATRBlibdir) $(ARCHIVER) $(ARFLAGS) $(ATRBlibdir)/libatr.a $(ATRBobjdir)/ ABCLib_BLAS_Recursive.o.PHONY : cleanall cleanbin cleanobj cleanlib cleanhead cleanall: cleanbin cleanobj cleanlib cleanhead cleanbin: rm -f $(ATRBdir)/ABCLib_BLAS_Src rm -f $(ATRBdir)/config cleanobj:

rm -f $(ATRBobjdir)/*.o rmdir $(ATRBobjdir) cleanlib: rm -f $(ATRBlibdir)/*.a rmdir $(ATRBlibdir) cleanhead: rm -f $(ATRBincdir)/*.h rmdir $(ATRBincdir) ##################################################################### Makefile make config (1) SMP (2) CPU (3) L3 (4) L1 (5) L3 (L2 ) (6) Configure file Configure file make install AutoTuned-RB SMP 10 1CPU 2

Makefile make config SMP No Yes L1 L3 No Yes L3 L2 Configure file Configure file

ATRB/ Makefile Config.c ATRB/source ABCLib_BLAS_Recursive.c ABCLib_BLAS_Src.c ATRB/source/<arch> ABCLib_BLAS_Recursive.o Recursive_p.o ATRB/include/<arch> Confg.h Mtdev.h ATRB/lib Libatr.a ATRB/test Makefile ABCLib_BLAS_Test.c source <arch> include <arch> ATRB lib <arch> test AutoTuned-RB

test/abclibtestblas.c MATRIXSIZE /******************************************/ /* */ /* Posix Pthread version */ /* Recursive BLAS Matrix Mutmal */ /* */ /* Yasuo Kinoshita */ /* */ /******************************************/ #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <unistd.h> #include <time.h> #define MATRIXSIZE 4000 int main(int argc char **argv) { int M, N, K; double *C, *A, *B; char TransA='n', TransB='n'; double a=1.0, b=0.0; int lda, ldb, ldc; int i, j, x; struct timeval t1, t2; double soltime, sec, usec; time_t seed; time(&seed);

srand(seed); M=MATRIXSIZE; N=M; K=M; lda = K; ldb = N; ldc = M; printf("matrixc:%d*%d ", M, N); printf("matrixa:%d*%d ", M, K); printf("matrixb:%d*%d n", K, N); C = (double *)malloc(sizeof(double)*(m*n)); A = (double *)malloc(sizeof(double)*(m*k)); B = (double *)malloc(sizeof(double)*(k*n)); x=10; for(i=0 ;i<m ;i++ ){ for(j=0 ;j<n ;j++ ){ *(C+j+i*N) = 0.0; } } for(i=0 ;i<m ;i++ ){ for(j=0 ;j<k ;j++ ){ *(A+j+i*K) =(double)(rand()%x); } } for(i=0 ;i<k ;i++ ){ for(j=0 ;j<n ;j++ ){ *(B+j+i*N) =(double)(rand()%x); } }

gettimeofday(&t1, NULL); RB_DGEMM(TransA, TransB, M, N, K, a, A, lda, B, ldb, b, C, ldc); gettimeofday(&t2, NULL); sec = t2.tv_sec - t1.tv_sec; usec = t2.tv_usec - t1.tv_usec; soltime = (sec + usec/1000000.0); printf("solve time = %0.3lf n", soltime); printf(" flops = %0.3lf n", 2*M*M*(M/soltime)/1000000); free(c); free(a); free(b); return 0; } #####################################################################

AutoTuned-RB CPU SMP ######################################################################## kinoshita@opt01:~/atrb> make config rm -f cofig cc -lm config.c -o config./config ############################################ ABCLib-BLAS version ver.1.0 composed by Yasuo Kinoshita Graduate School of Information Systems, The University of Electro-Communications /JAPAN SCIENCE AND TECHNOLOGY AGENCY 2004/01/16 AutoTuned-RB Configure ############################################ ============== make Confg.h ================ SMP SUPPORT[y/n]: y Input Number of CPU : 2 ======== Sampling point Setting ===========

Input L1Cache Size[KByte]: 64 L3Cache Machine?[y/n]: n Input L2CacheSize[KByte]: 1024 Configration Completed!! Type "make install" if you continue install mkdir -p /home/kinoshita/atrb/include/linux_unknownsse2_2 mv Confg.h /home/kinoshita/atrb/include/linux_unknownsse2_2/ rm -f config kinoshita@opt01:~/atrb> make install rm -f /home/kinoshita/atrb/lib/linux_unknownsse2_2/*.a cc /home/kinoshita/atrb/source/abclib_blas_src.c -o ABCLib_BLAS_Src -O3 -I/home/kinoshita/ATRB/include/Linux_UNKNOWNSSE2_2 -L/home/kinoshita/AutoTuned-RB/ATLAS/lib/Linux_UNKNOWNSSE2_2 -latlas -lpthread./abclib_blas_src ############################################ ABCLib-BLAS version ver1.0 composed by Yasuo Kinoshita Graduate School of Information Systems, The University of Electro-Communications /JAPAN SCIENCE AND TECHNOLOGY AGENCY 2005/01/16 AutoTuned-RB Install-time Optimization ############################################ ########## SMP Machine Tuning ##########

########## Near L1 Cache size ######### MATRIX SIZE 123 * 123 RNUM TIME MFLOPS ============================= 0 0.002 1542.6 1 0.002 1602.1 2 0.003 1487.1 3 0.003 1400.2 ########## In L3 Cache size ######### MATRIX SIZE 395 * 395 RNUM TIME MFLOPS ============================= 0 0.065 1906.9 1 0.039 3197.5 2 0.041 3021.7 3 0.043 2899.4 4 0.050 2448.9 5 0.065 1901.1 Matrixsize OptiNum ========================== 123 1 395 1 Tuning Completed! mv Mtdev.h /home/kinoshita/atrb/include/linux_unknownsse2_2/ rm -f ABCLib_BLAS_Src cc -c -I/home/kinoshita/ATRB/include/Linux_UNKNOWNSSE2_2 /home/kinoshita/atrb/source/abclib_blas_recursive.c mkdir -p /home/kinoshita/atrb/source/linux_unknownsse2_2 mv /home/kinoshita/atrb/abclib_blas_recursive.o /home/kinoshita/atrb/source/linux_unknownsse2_2/

mkdir -p /home/kinoshita/atrb/lib/linux_unknownsse2_2 ar r /home/kinoshita/atrb/lib/linux_unknownsse2_2/libatr.a /home/kinoshita/atrb/source/linux_unknownsse2_2/abclib_blas_recursive. o kinoshita@opt01:~/atrb> ##################################################################### ( test/abclibtestblas.c ) 1000 1000 kinoshita@opt01:~/atrb/test> make dgemm cc ABCLib_BLAS_Test.c -o ABCLib_BLAS_Test -O3 -L/home/kinoshita/AutoTuned-RB/ATLAS/lib/Linux_UNKNOWNSSE2_2 -L/home/kinoshita/ATRB/lib/Linux_UNKNOWNSSE2_2 -latr -latlas -lpthread -lm kinoshita@opt01:~/atrb/test>./abclib_blas_test MatrixC:1000*1000 MatrixA:1000*1000 MatrixB:1000*1000 s = 2 Solve time = 0.509 Mflops = 3926.897 kinoshita@opt01:~/atrb/test> ) test/makefile

BLAS3 BLAS AutoTuned-RB http://www.abc-lib.org/online/abclib.htm