1 ARM 2017 6 6 ARM Raspberry Pi 3 ARM 64 C ARM 1 2 1.1............................................ 2 1.2............................................. 3 1.3................................ 3 1.4.......................................... 4 1.5.................................... 4 2 5 2.1...................................... 5 2.2......................................... 7 3 ARMv8 8 3.1 Hello, world!............................................ 8 3.2 hello.s.......................................... 9 3.3 Tips............................................ 10 3.4 LEGv8.................................... 10 3.5 (2.3 )........................................ 12 3.6 (2.1 )......................................... 13 3.7 (2.4 )................................. 15 3.8 (2.5 )........................................ 15 3.9 (2.6 )......................................... 15 3.10 (2.7 )......................................... 15 3.11 (2.8 )........................................... 18 3.12 (2.9 )................................... 24 3.13 (2.10 )........................ 24 : contact17@numericalbrain.org, : http://numericalbrain.org/
2017 ARM 2 3.14 : (2.11 ).................................. 24 3.15 (2.12 )................................ 24 3.16 C (2.13 )................................ 24 3.17.............................................. 28 3.18 ( 3 )........................... 28 3.19 Introduction (3.1 )....................................... 28 3.20 (3.2 )...................................... 28 3.21 (3.3 )........................................... 28 3.22 (3.4 )........................................... 28 3.23 (3.5 )........................................ 29 3.24 (3.6 )................................... 31 A io.s 32 1 ARM 10 2020 *1 CPU SPARC ARM ARM *2 CPU C ARM 1.1 ARM 64 (aarch64) MIPS ( ) (N-body) *1 https://brain-hpc.jp/ *2 iphone Apple Ax Android Qualcomm Snapdragon ARM
2017 ARM 3 ARM ( SIMD ) MPI 1.2 Raspberry Pi 3 ( ) 1 1 OS 64 bit Arch Linux ARM 64bit CED ARM MPI 1. David A. Patterson, John L. Hennessy. Computer Organization and Design ARM Edition: The Hardware Software Interface. Morgan Kaufmann, 2016. 2. ARM Cortex-A Series Programmer s Guide for ARMv8-A Version: 1.0. https://developer.arm.com/docs/den0024/a/1-introduction 3. ARM Cortex-A53 Processor. https://developer.arm.com/products/processors/cortex-a/cortex-a53 4. 2016 12, CQ. 5. Ananth Grama, et al. Introduction to Parallel Computing. 2nd Edition, Addison Wesley, 2003. 1. Linux Arm64. http://www.mztn.org/dragon/arm6400idx.html 2. Exploring AArch64 assember - Chapter 1. http://thinkingeek.com/2016/10/08/exploring-aarch64-assembler-chapter1/ 3. The GNU Assembler. http://www-ug.eecg.toronto.edu/msl/assembler.html 4. Using as. http://www.delorie.com/gnu/docs/binutils/as.html GNU as ARM Architecture Reference Manual: ARMv8, for ARMv8-A architecture profile https://silver.arm.com/download/arm and AMBA Architecture/AR150-DA-70000-r0p0-02eac0/DDI0487B a a 6354 1.3 ARM
2017 ARM 4 * 3 1.4 *4 SIMD MPI 1.5 1 5/19 2 6/2 3 4 5 6 7 8 7/21? SIMD *3 LLVM *4
2017 ARM 5 2 2.1 64bit linux Arch Linux ARM: https://archlinuxarm.org/platforms/armv8/broadcom/raspberry-pi-3 64bit URL Installation 1. 1. # fdisk /dev/sdx linux 2. 5. OS AArch64 Installation http://os.archlinuxarm.org/os/archlinuxarm-rpi-3-latest.tar.gz 32bit 3. sync Ubuntu 16 1. sudo su (HHK Pro Ctrl+Command+Fn+1) root sudo bash passwd root export LC ALL=C 2. Ubuntu ps kill 3. sync sync SD 4. umount root mount mount ext4 OS micro SD AC Arch Linux 4.10.1-1-ARCH (tty1) alarm login: root su su Arch Linux pacman [alarm@alarm ~]$ su Password: <root > [alarm@alarm ~]# pacman -Syu < -Syu> : < >
2017 ARM 6 [alarm@alarm ~]# 1. /etc/pacman.d/mirrorlist us pacman -Syyu y 2 (yy) 2. linux-aarch64 3. PC USB [alarm@alarm ~]# reboot OS [alarm@alarm ~]$ uname -a Linux alarm 4.10.13-1-ARCH #1 SMP Fri Apr 28 20:02:39 MDT 2017 aarch64 GNU/Linux gcc emacs emacs [alarm@alarm ~]$ su Password: [alarm@alarm ~]# pacman -S gcc : < > [alarm@alarm ~]# pacman -S emacs : < > [alarm@alarm ~]# alarm kobo users kobo [alarm@alarm ~]# useradd -m -g users -s /bin/bash kobo [alarm@alarm ~]# passwd kobo Hello, world 1 hello.c 1 #include<stdio.h> 2 int main(void) 3 {
2017 ARM 7 4 printf("hello, world!\n"); 5 return 0; 6 } [kobo@alarm ~]$ gcc -Wall -o hello hello.c [kobo@alarm ~]$./hello Hello, world! [kobo@alarm ~]$ 2.2 CPU ARM Cortex-A53 4 Raspberry Pi 3 is out now! Specs, benchmarks & more https://www.raspberrypi.org/magpi/raspberry-pi-3-specs-benchmarks/ Kernel Panic OS
2017 ARM 8 3 ARMv8 3.1 Hello, world! Hello world hello/hello.s, hello/io.s io.s *5 hello.s 2 hello.s 1.include "io.s" // include some helper functions 2 3.text 4 5.global _start 6 _start: 7 adr x0, mystring 8 bl print0s 9 10 adr x20, mydword 11 ldur x1, [x20] 12 bl print1d 13 14 mov x21, #124 15 stur x21, [x20] 16 ldur x1, [x20] 17 bl print1d 18 19 adr x22, mydwords 20 ldur x1, [x22, #0] 21 bl print1d 22 ldur x1, [x22, #8] 23 bl print1d 24 25 exit 26 27.data 28 29 mydword: 30.dword 123 31 mydwords: 32.dword 125, 126, 127 33 mystring: 34.string "Hello, world!" [alarm@alarm hello]$ as -o hello.o hello.s [alarm@alarm hello]$ ld -o hello hello.o -lc [alarm@alarm hello]$ *5 io.s
2017 ARM 9 [alarm@alarm hello]$./hello Hello, world! 123 124 125 126 [alarm@alarm hello]$ as ld start main gcc [alarm@alarm hello]$ gcc -o hello hello.s [alarm@alarm hello]$ *6 3.2 hello.s hello.s GNU as ARM *7 1 io.s // *8 3.text 22.data.text.data 1 5.global *9 start 6 start C main (:) 7,8 Hello, world! 7 Hello, world!\ n mystring x0 8 print0s bl (branch with link) print0s x0 io.s 10 12 123 10 mydword x20 11 x1 (ldur) 12 x1 print1d x1 *6 start main *7 ARM GNU as *8 *9.globl
2017 ARM 10 14 17 124 14 x1 124 # * 10 15 x1 mydword (stur) 16 mydword 17 123 124 19 23 125, 126 19 mydwords x20 20 mydwords 0 mydwords x1 11 11 0 21 22 mydwords 8 125 x1 8 mydwords 23 25 exit 27 29,30 64 123 mydword 64.dword (double word) 31,32 64 [125, 126, 127] mydwords (,) 33,34 Hello, world!\n mystring.string * 11 3.3 Tips C gcc [alarm@alarm hello]$ gcc -S foo.c foo.c foo.s 3.4 LEGv8 ARM * 12 LEGv8 *10 *11.asciz.ascii *12
2017 ARM 11 * 13 LEGv8 LEGv8 32 2 62 * 14 ( 1) 1 LEGv8 32 X0 X30, XZR XZR 0 2 62 Memory[0], Memory[4],, Memory[4,611,686,018,427,387,904] LEGv8 2 8 LEGv8 2 adr mov 2 LEGv8 add ADD X1, X2, X3 X1 = X2 + X3 3 subtract SUB X1, X2, X3 X1 = X2 - X3 3 add immediate ADDI X1, X2, #20 * 15 X1 = X2 + 20 subtract immediate SUBI X1, X2, #20 X1 = X2-20 add and set flags ADDS X1, X2, X3 X1 = X2 + X3 subtract and set flags SUBS X1, X2, X3 X1 = X2 - X3 add immediate and set flags ADDIS X1, X2, #20 X1 = X2 + 20 subtract immediate and set flags SUBIS X1, X2, #20 X1 = X2-20 load register LDUR X1, [X2, #40] X1 = Memory[X2 + 40] store register STUR X1, [X2, #40] Memory[X2 + 40] = X1 load signed word LDURSW X1, [X2, #40] X1 = Memory[X2 + 40] store word STURW X1, [X2, #40] Memory[X2 + 40] = X1 load half LDURH X1, [X2, #40] X1 = Memory[X2 + 40] store half STURH X1, [X2, #40] Memory[X2 + 40] = X1 load byte LDURB X1, [X2, #40] X1 = Memory[X2 + 40] store byte STURB X1, [X2, #40] Memory[X2 + 40] = X1 load exclusive register LDXR X1, [X2, #0] X1 = Memory[X2 + 0] store exclusive register STXR X1, [X2, #0] Memory[X2 + 0] = X1 move wide with zero MOVZ X1, #20, LSL 0 X1 = 20 16bit move wide with keep MOVK X1, #20, LSL 0 X1 = 20 16bit and AND X1, X2, X3 X1 = X2 & X3 AND inclusive or ORR X1, X2, X3 X1 = X2 X3 OR exclusive or EOR X1, X2, X3 X1 = X2 ^ X3 XOR and immediate ANDI X1, X2, #20 X1 = X2 & 20 AND inclusive or immediate ORRI X1, X2, #20 X1 = X2 20 OR exclusive or immediate EORI X1, X2, #20 X1 = X2 ^ 20 XOR logical shift left LSL X1, X2, #10 X1 = X2 << 10 logical shift right LSR X1, X2, #10 X1 = X2 >> 10 compare and branch on equal 0 CBZ X1, #25 * 16 if X1 == 0 goto PC+100 0 compare and branch on not equal 0 CBNZ X1, #25 if X1!= 0 goto PC+100 0 branch conditionally B.cond #25 if condition goto PC+100 branch B #2500 goto PC+10000 branch to register BR X30 goto X30 branch with link BL #2500 X30 = PC+4; PC += 10000 *13 Lessen Extrinsic Garrulity Lessen= Extrinsic = Garrulity = *14 1 = 32
2017 ARM 12 3.5 (2.3 ) LDUR * 17 STUR * 18 : A=[0, 1, 2, 3, 4] X22 g X9 g = A[3]; LDUR X9, [X22, #24] 8 3 24 3 load.s 1.include "io.s" 2.text 3.global _start 4 _start: 5 adr x22, A // get the address of A 6 ldur x9, [x22, #24] // g = A[3], where 24 = 8 * 3 7 mov x1, x9 // X1 = g for print 8 bl print1d 9 exit 10 11.data 12 A: 13.dword 0, 1, 2, 3, 4 3 [alarm@alarm ~]$ as -o load.o load.s [alarm@alarm ~]$ ld -o load load.o -lc [alarm@alarm ~]$./load 3 [alarm@alarm ~]$ : A=[0, 1, 2, 3, 4] X22 g X9 A[3] = g; STUR X9, [X22, #24] g 33 *17 (LoaD Register)+ (Unscaled) U 2.19 *18 (STore Register)+ (Unscaled) U
2017 ARM 13 4 store.s 1.include "io.s" 2.text 3.global _start 4 _start: 5 adr x22, A // get the address of A 6 mov x9, #33 // g = 33 7 stur x9, [x22, #24] // A[3] = g 8 ldur x1, [x22, #24] // X1 = A[3] 9 bl print1d 10 exit 11 12.data 13 A: 14.dword 0, 1, 2, 3, 4 [alarm@alarm ~]$ as -o store.o store.s [alarm@alarm ~]$ ld -o store store.o -lc [alarm@alarm ~]$./store 33 [alarm@alarm ~]$ 3.6 (2.1 ) 2 (immediate) ADD SUB 3 : f, g, h, i, j X19,, X23 f = (g + h) - (i + j); ADD X9, X20, X21 // X9 = g + h ADD X10, X22, X23 // X10 = i + j SUB X19, X9, X10 // X19 = X9 - X10 = (g + h) - (i + j) g = 5, h = 4, i = 3, j = 2 5 addsub.s 1.include "io.s" 2.text 3.global _start 4 _start: 5 mov x20, #5 // g = 5 6 mov x21, #4 // h = 4 7 mov x22, #3 // i = 3 8 mov x23, #2 // j = 2 9 add x9, x20, x21 // X9 = g + h 10 add x10, x22, x23 // X10 = i + j 11 sub x19, x9, x10 // X19 = X9 - X10 = (g + h) - (i + j) = f
2017 ARM 14 12 mov x1, x19 // X1 = X19 for print 13 bl print1d 14 exit (5 + 4) (3 + 2) = 4 [alarm@alarm ~]$ as -o addsub.o addsub.s [alarm@alarm ~]$ ld -o addsub addsub.o -lc [alarm@alarm ~]$./addsub 4 [alarm@alarm ~]$ : h X21 A X22 A[4] = h + A[3]; LDUR X9, [X22, #24] // X9 = A[3] ADD X9, X21, X9 // X9 = h + A[3] STUR X9, [X22, #96] // A[4] = h + A[3] h = 33 A[4] = 36 6 addstore.s 1.include "io.s" 2.text 3.global _start 4 _start: 5 adr x22, A // get the address of A 6 mov x21, #33 // h = 33 7 ldur x9, [x22, #24] // X9 = A[3] 8 add x9, x21, x9 // X9 = h + A[3] 9 stur x9, [x22, #32] // A[4] = h + A[3] 10 ldur x1, [x22, #32] // X1 = A[4] 11 bl print1d 12 exit 13 14.data 15 A: 16.dword 0, 1, 2, 3, 4 [alarm@alarm ~]$ as -o addstore.o addstore.s [alarm@alarm ~]$ ld -o addstore addstore.o -lc [alarm@alarm ~]$./addstore 36 [alarm@alarm ~]$ ADDI ADD
2017 ARM 15 ADDI X0, #123 ADD X0, #123 3.7 (2.4 ) 3.8 (2.5 ) LEGv8 32 4 aarch64 3.9 (2.6 ) LEGv8 NOT 3 1 1 XOR aarch64 3.10 (2.7 ) CBZ, CBNZ CBZ register, L1 register 0 L1 CBNZ register, L1 register 0 L1 B : if-then-else if (i == j) f = g + h; else f = g - h; f,g,h,i,j X19,,X23 SUB X9, X22, X23 // X9 = i - j CBNZ X9, Else // Go to Else if i!= j (X9!= 0) ADD X19, X20, X21 // f = g + h (skipped if i!= j) B Exit // Go to Exit Else: SUB X19, X20, X21 // f = g - h (skipped if i == j) Exit: 7 if.s
2017 ARM 16 1.include "io.s" 2.text 3.global _start 4 _start: 5 mov x20, #5 6 mov x21, #4 7 mov x22, #3 8 mov x23, #2 9 sub x9, x22, x23 10 cbnz x9, Else 11 add x19, x20, x21 12 b Exit 13 Else: 14 sub x19, x20, x21 15 Exit: 16 mov x1, x19 17 bl print1d 18 exit i j 1 [alarm@alarm ~]$ as -o if.o if.s [alarm@alarm ~]$ ld -o if if.o -lc [alarm@alarm ~]$./if 1 [alarm@alarm ~]$ X22 2 9 [alarm@alarm ~]$ as -o if.o if.s [alarm@alarm ~]$ ld -o if if.o -lc [alarm@alarm ~]$./if 9 [alarm@alarm ~]$ while (save[i] == k) i += 1; save k i,k X22, X24 save X25 Loop: LSL X10, X22, #3 // Temp reg X10 = i << 3 = i * 8 ADD X10, X10, X25 // X10 = address of save[i] LDUR X9, [X10, #0] // Temp reg X9 = save[i]] SUB X11, X9, X24 // Temp reg X11 = save[i] - k CBNZ X11, Exit // goto Exit if save[i]!= k ADDI X22, X22, #1 // i += 1; B Loop // goto Loop Exit
2017 ARM 17 k = 5 save[] = {5, 5, 5, 5, 1, 5, 5, 5, 5}; 8 loop.s 1.include "io.s" 2.text 3.global _start 4 _start: 5 adr x25, save 6 mov x22, #0 // i = 0 7 mov x24, #5 // k = 5 8 Loop: 9 lsl x10, x22, #3 10 add x10, x10, x25 11 ldur x9, [x10, #0] 12 sub x11, x9, x24 13 cbnz x11, Exit 14 add x22, x22, #1 15 b Loop 16 Exit: 17 mov x1, x22 18 bl print1d 19 exit 20 21.data 22 save: 23.dword 5, 5, 5, 5, 1, 5, 5, 5, 5 4 [alarm@alarm ~]$ as -o loop.o loop.s [alarm@alarm ~]$ ld -o loop loop.o -lc [alarm@alarm ~]$./loop 4 [alarm@alarm ~]$ 2 ADDS, ADDIS, ANDS, ANDIS, SUBS, SUBIS S set flags 3 B.?? 3 ( ) ( ) = B.EQ B.EQ B.NE B.NE < B.LT B.LO (LOwer) B.LE B.LS (Lower or Same) > B.GT B.HI (HIgher) B.GE B.HS (Higher or Same)
2017 ARM 18 3.11 (2.8 ) * 19 1. 2. 3. 4. 5. 6. X0 X7: LR (X30): BL (branch with link) * 20 BL ProcedureAddress ProcedureAddress BR (branch register) BR LR push pop SP X28 * 21 C long long int leaf_example (long long int g, long long int h, long long int i, long long int j) { long long int f; f = (g + h) - (i + j); return f; } *19 *20 branch-and-link instruction ARM *21 SP XZR (X31) XZR SP
2017 ARM 19 g, h, i, j X0, X1, X2, X3 f X19 leaf_example: f g + h i + j 2 X9, X10 f X19 3 push SUBI SP, SP, #24 // 3 STUR X10, [SP, #16] // X10 push STUR X9, [SP, #8] // X9 push STUR X19, [SP, #0] // X19 push 3 24 ADD X9, X0, X1 // X9 = g + h ADD X10, X2, X3 // X19 = i + j SUB X19, X9, X10 // f = X9 - X10, which is (g + h) - (i + j) X0 ADD X0, X19, XZR // returns f (X0 = X19 + 0) X9, X10, X19 pop LDUR X19, [SP, #0] LDUR X9, [SP, #8] LDUR X10, [SP, #16] ADDI SP, SP, #24 BR LR X9 X17: ( ) X19 X28: LEGv8 aarch64 ADDI, SUBI ADD, SUB LR X30 16 * 22 1.include "io.s" 9 leaf1.s *22 Programmer s Guide 5-6
2017 ARM 20 2.text 3 4.global leaf_example 5 leaf_example: 6 sub sp, sp, #48 7 stur x10, [sp, #32] 8 stur x9, [sp, #16] 9 stur x19, [sp, #0] 10 add x9, x0, x1 11 add x10, x2, x3 12 sub x19, x9, x10 13 add x0, x19, xzr 14 ldur x19, [sp, #0] 15 ldur x9, [sp, #16] 16 ldur x10, [sp, #32] 17 add sp, sp, #48 18 br x30 19 20.global _start 21 _start: 22 mov x0, #5 // g = 5 23 mov x1, #4 // h = 4 24 mov x2, #3 // i = 3 25 mov x3, #2 // j = 2 26 bl leaf_example 27 mov x1, x0 28 bl print1d 29 exit 6 sub sp, sp, #48 7 stur x10, [sp, #32] 8 stur x9, [sp, #16] 9 stur x19, [sp, #0] 24 48 [alarm@alarm src]$ as -o leaf1.o leaf1.s [alarm@alarm src]$ ld -o leaf1 leaf1.o -lc [alarm@alarm src]$./leaf1 4 [alarm@alarm src]$ LD(U)R, ST(U)R LDP, STP 2 push, pop leaf2.s 10 leaf2.s 1.include "io.s" 2.text 3 4.global leaf_example
2017 ARM 21 5 leaf_example: 6 stp x10, x9, [sp, #-16]! 7 stp x19, xzr, [sp, #-16]! 8 add x9, x0, x1 9 add x10, x2, x3 10 sub x19, x9, x10 11 add x0, x19, xzr 12 ldp x19, xzr, [sp], #16 13 ldp x10, x9, [sp], #16 14 ret 15 16.global _start 17 _start: 18 mov x0, #5 // g = 5 19 mov x1, #4 // h = 4 20 mov x2, #3 // i = 3 21 mov x3, #2 // j = 2 22 bl leaf_example 23 mov x1, x0 24 bl print1d 25 exit stp x10, x9, [sp, #-16]! X10, X9 push 16! * 23 push 3 X19 XZR stp x19, xzr, [sp, #-16]! push pop ldp x19, xzr, [sp], #16 ldp x10, x9, [sp], #16 [alarm@alarm src]$ as -o leaf2.o leaf2.s [alarm@alarm src]$ ld -o leaf2 leaf2.o -lc [alarm@alarm src]$./leaf2 4 [alarm@alarm src]$ bl x30 ret LR *23 Programmer s guide 6-9
2017 ARM 22 long long int fact (long long int n) { if (n < 1) return (1); else return (n * fact(n - 1)); } n X0 fact: SUBI SP, SP, #16 // adjust stack for 2 items STUR LR, [SP, #8] // save the return address STUR X0, [SP, #0] // save the argument n SUBIS ZXR, X0, #1 // test for n < 1 B.GE L1 // if n >= 1 go to L1 ADDI X1, XZR, #1 // return 1 ADDI SP, SP, #16 // pop 2 items off stack BR LR // return to caller L1: SUBI X0, X0, #1 // n >= 1: argument gets (n - 1) BL fact // call fact with (n - 1) LDUR X0, [SP, #0] // return from BL: restore argument n LDUR LR, [SP, #8] // restore the return address ADDI SP, SP, #16 // adjust stack pointer to pop 2 items MUL X1, X0, X1 // return n * fact(n - 1) BR LR // return to the caller 11 fact.s 1.include "io.s" 2.text 3 4.global fact 5 fact: 6 stp x30, x0, [sp, #-16]! 7 subs xzr, x0, #1 8 b.ge L1 9 mov x1, #1 10 ldp x30, x0, [sp], #16 11 ret 12 L1: 13 sub x0, x0, #1 14 bl fact 15 ldp x30, x0, [sp], #16 16 mul x1, x0, x1 17 ret 18 19.global _start 20 _start: 21 mov x0, #10 // n = 10 22 bl fact 23 bl print1d
2017 ARM 23 24 exit C gcc -O3 -S 12 fact.s 1 : 2 fact: 3 cmp x0, 0 4 mov x1, 1 5 ble.l1 6.p2align 2 7.L3: 8 mul x1, x1, x0 9 subs x0, x0, #1 10 bne.l3 11.L1: 12 mov x0, x1 13 ret 14 : gcc / 4 4 / X19 X27 X28 (SP) X29 (FP) X30 (LR) X9 X15 X0 X7 Elaboration: (ry long long int fact(long long int n) { return fact_iter(n, 1); } long long int fact_iter(long long int n, long long int acc) { if (n > 0) return fact_iter(n - 1, acc * n); else return acc; } : fact (10, 1); :
2017 ARM 24 gcc -O3 -S 3.12 (2.9 ) 3.13 (2.10 ) LEGv8 32 64 3.14 : (2.11 ) 3.15 (2.12 ) 3.16 C (2.13 ) * 24 swap void swap(long long int v[], size_t k) { long long int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } X0 v X1 k temp X9 swap: lsl x10, x1, #3 // X10 = k * 8 add x10, x0, x10 // X10 = v + (k * 8) // X10 has the address of v[k] ldur x9, [x10, #0] // X9 (temp) = v[k] ldur x11, [x10, #8] // X11 = v[k + 1] // refers to next element of v stur x11, [x10, #0] // v[k] = X11 stur x9, [x10, #8] // v[k + 1] = X9 (temp) ret // return to calling routine void sort (long long int v[], size_t int n) { size_t i, j; *24
2017 ARM 25 for (i = 0; i < n; i += 1) { for (j = i-1; j >= 0 && v[j] > v[j+1]; j+= 1) { swap(v, j); } } } mov x19, xzr // i = 0 for1tst: cmp x19, x1 // compare X19 to X1 (i to n) b.ge exit1 // go to exit1 if X19 >= X1 (i>= n) : (body of the first for loop) : add X19, X19, #1 // i += 1 B for1tst // branch to test of outer loop exit1: sub x20, x19, #1 // j = i - 1 for2tst: cmp x20, xzr // compare X20 to 0 (j to 0) b.lt exit2 // go to exit2 if X20 < 0 (j < 0) lsl x10, x20, #3 // X10 = j * 8 add x11, x0, x10 // X11 = v + (j * 8) ldur x12, [x11, #0] // X12 = v[j] ldur x13, [x11, #8] // X13 = v[j + 1] cmp x12, x13 b.le exit2 // go to if X12 <= X13 (v[j] <= v[j + 1) : (body of the second for loop) : sub x20, x20, #1 // j -= 1 b for2tst exit2: swap(v,j) bl swap X0, X1 swap sort X21, X22 sort X0, X1 swap sort SP, LR, X19 X21 * 25 *25 2.28 LDUR STUR
2017 ARM 26 * 26 1 #include<stdio.h> 2 3 void swap(long long int v[], size_t k) 4 { 5 long long int temp; 6 temp = v[k]; 7 v[k] = v[k-1]; 8 v[k-1] = temp; 10 9 } 11 void sort(long long int v[], size_t n) 12 { 13 size_t i, j; 14 for(i = 0; i < n; i+=1){ 15 for(j = 1; j < n-i; j+=1){ 16 if (v[j] < v[j-1]){ 17 swap(v, j); 18 } 19 } 20 } 21 } 22 23 int main(void) 24 { 25 int i; 13 sort.c 26 long long int v[] = {1, 3, 5, 7, 9, 8, 6, 4, 2, 0}; 27 size_t n = 10; 28 sort(v, n); 29 for(i = 0; i < n; i++){ 30 printf("%d\n", v[i]); 31 } 32 return 0; 33 } 1.include "io.s" 2.text 3 4 swap: 5 lsl x10, x1, #3 6 add x10, x0, x10 7 8 ldur x9, [x10, #0] 9 ldur x11, [x10, #-8] 14 sort.s *26 swap for swap i v n
2017 ARM 27 10 11 stur x11, [x10, #0] 12 stur x9, [x10, #-8] 13 ret 14 15.global sort 16 sort: 17 stp xzr, x30, [sp, #-16]! 18 stp x22, x21, [sp, #-16]! 19 stp x20, x19, [sp, #-16]! 20 21 mov x21, x0 // preserve x0 to x21 22 mov x22, x1 // preserve x1 to x22 23 24 // outer loop 25 mov x19, xzr 26 for1tst: 27 cmp x19, x22 28 b.ge exit1 29 // : inner loop 30 mov x20, #1 31 for2tst: 32 sub x9, x22, x19 // X9 = n - i 33 cmp x20, x9 // j < n-i? 34 b.ge exit2 35 // : body of the inner loop 36 iftst: 37 lsl x10, x20, #3 38 add x11, x21, x10 39 ldur x12, [x11, #0] 40 ldur x13, [x11, #-8] 41 cmp x12, x13 // v[j] >= v[j-1]? 42 b.ge exitif 43 mov x0, x21 // first swap parameter is v 44 mov x1, x20 // second swap parameter is j 45 bl swap 46 exitif: 47 // : 48 add x20, x20, #1 49 b for2tst 50 exit2: 51 // : 52 add x19, x19, #1 53 b for1tst 54 exit1: 55 ldp x20, x19, [sp], 16 56 ldp x22, x21, [sp], 16 57 ldp xzr, x30, [sp], 16 58 ret 59 60.global _start 61 _start:
2017 ARM 28 62 adr x0, v 63 adr x20, k 64 ldur x1, [x20, #0] 65 bl sort 66 67 // print sorted numbers 68 adr x19, v 69 mov x20, xzr 70 adr x21, k 71 ldur x21, [x21, #0] 72 lsl x21, x21, #3 73 for3tst: 74 cmp x20, x21 75 b.ge exit3 76 add x22, x19, x20 77 ldur x1, [x22, #0] 78 bl print1d 79 add x20, x20, #8 80 b for3tst 81 exit3: 82 exit 83 84.data 85 v:.dword 1, 3, 5, 7, 9, 8, 6, 4, 2, 0 86 k:.dword 10 3.17 3.18 ( 3 ) 3.19 Introduction (3.1 ) 3.20 (3.2 ) 2 ALU 3.21 (3.3 ) 2 MUL, SMULH (signed multiply high), UMULH (unsigned multiply high) MUL 64 128 64 2 64 5 3.22 (3.4 ) 2 SDIV (signed), UDIV (unsigned) 5
2017 ARM 29 3.23 (3.5 ) IEEE754 32 F 6 7 : C float f2c (float fahr) { return ((5.0/9.0) * (fahr - 32.0)); } fahr S12 S0 15 1.include "io.s" 2.text 3 f2c: 4 adr x27, const5 5 ldr s16, [x27] // S16 = 5.0 6 adr x27, const9 7 ldr s18, [x27] // S18 = 9.0 f2c.s 5 LEGv8 3.12 MUL X1, X2, X3 X1 = X2 X3 128 64 SMULH X1, X2, X3 X1 = X2 X3 128 64 UMULH X1, X2, X3 X1 = X2 X3 128 64 SDIV X1, X2, X3 X1 = X2/X3 UDIV X1, X2, X3 X1 = X2/X3 6 LEGv8 p221 32 S0 S31, D0 D31, S? D? 32 2 62 Memory[0], Memory[4],, Memory[4,611,686,018,427,387,904] LEGv8 2 8 7 LEGv8 p221 FP add single FADDS S2, S4, S6 S2 = S4 + S6 FP add (single precision) FP subtract single FSUBS S2, S4, S6 S2 = S4 - S6 FP sub (single precision) FP multiply single FMULS S2, S4, S6 S2 = S4 S6 FP multiply (single precision) FP divide single FDIVS S2, S4, S6 S2 = S4 / S6 FP divide (single precision) FP add double FADDD D2, D4, D6 S2 = S4 + S6 FP add (double precision) FP subtract double FSUBD D2, D4, D6 S2 = S4 - S6 FP sub (double precision) FP multiply double FMULD D2, D4, D6 S2 = S4 S6 FP multiply (double precision) FP divide double FDIVD D2, D4, D6 S2 = S4 / S6 FP divide (double precision) FP compare single FCMPS S4, S6 Test S4 vs S6 FP compare single precision FP compare double FCMPD D4, S6 Test D4 vs D6 FP compare double precision load single FP LDURS S1, [X23, #100] S1 = Memory[X23 + 100] 32-bit data to FP register load double FP LDURD D1, [X23, #100] D1 = Memory[X23 + 100] 64-bit data to FP register store single FP STURS S1, [X23, #100] Memory[X23 + 100] = S1 32-bit data to memory store double FP STURD D1, [X23, #100] Memory[X23 + 100] = D1 64-bit data to memory
2017 ARM 30 8 fdiv s16, s16, s18 // S16 = 5.0 / 9.0 9 adr x27, const32 10 ldr s18, [x27] // S18 = 32.0 11 fsub s18, s12, s18 // S18 = fahr - 32.0 12 fmul s0, s16, s18 // S0 = (5.0/9.0) * (fahr - 32.0) 13 ret 14 15.global _start 16 _start: 17 adr x27, input 18 ldr s12, [x27] // S12 = 86.0 19 bl f2s 20 bl print0fs 21 exit 22 23.data 24 input: 25.float 86.0 26 const5: 27.float 5.0 28 const9: 29.float 9.0 30 const32: 31.float 32.0 fadds, faddd fadd ldr, str (Programmer s Guide 6-13) immediate mov io.s print0f, print0fs D0 S0 : 2 A, B, C C = C + A B DGEMM 32 32 void mm (double c[][], double a[][], double b[][]) { size_t i, j, k; for(i = 0; i < 32; i++) { for(j = 0; j < 32; j++){ for(k = 0; k < 32; k++){ c[i][j] = c[i][j] + a[i][k]*b[k][j]; } } } } X1, X2, X3 X19, X20, X21 * 27 TODO: TY: *27
2017 ARM 31 3.24 (3.6 ) SIMD TODO:
2017 ARM 32 A io.s 16 io.s 1 // Learned and modified from: 2 // https://stackoverflow.com/questions/39845288/cant-print-sum-in-armv8-assembly 3.text 4 5 // macro 6 // exit 7.macro exit 8.exit\@: 9 mov x8, #93 // exit see /usr/include/asm-generic/unistd.h 10 svc #0 11.endm 12 13.text 14.global print1d 15.type print1d, %function 16 print1d: 17 stp x29, x30, [sp, #-16]! 18 adr x0, print1d_fmt 19 bl printf 20 ldp x29, x30, [sp], #16 21 ret 22.data 23 print1d_fmt:.string "%d\n" 24 25.text 26.global print0f 27.type print0f, %function 28 print0f: 29 stp x29, x30, [sp, #-16]! 30 adr x0, print0f_fmt 31 bl printf 32 ldp x29, x30, [sp], #16 33 ret 34.global print0fs 35.type print0fs, %function 36 print0fs: 37 stp x29, x30, [sp, #-16]! 38 adr x0, print0f_fmt 39 fcvt d0, s0 40 bl printf 41 ldp x29, x30, [sp], #16 42 ret 43.data 44 print0f_fmt:.string "%f\n" 45 46.text 47.global print0s
2017 ARM 33 48.type print0s, %function 49 print0s: 50 stp x29, x30, [sp, #-16]! 51 bl puts 52 ldp x29, x30, [sp], #16 53 ret