[SPO600] Optimizing Python for AArch64 1 – Project Selection

This posting is about the selection and first stage of SPO600 Project. I Chose Python.

Why Python?

My first profiling target was php. For the php, it was hard to find the right way for profiling. For example, gprof profiling was not easy. I tried to find typical way for performance profiling, but it was hard to find. Actually I couldn’t find. In the mailing list, I couldn’t get answer.

Python is widely used hi-level language that are used for both client and server. And the language’s intention, highly readable language was attractive. It is called as one of the easiest programming language to learn. I wanted to look at how this language was designed.

In terms of contributor support, python has tons of resource from this site. Also, for the performance benchmarking, it has The Grand Unified Python Benchmark Suite, which is a collection of benchmarks for all Python implementations. Analysing this suit itself will be helpful to start benchmarking. Using this Suite and gprof will be helpful.

gprof profiling

To do gprof profiling, configuration with profiling option is needed.

./configure  --disable-shared --enable-profiling
make

This option automatically put -pg option to Makefile. At the end of the make it produces gmon.out. I guess it performs a test. Following is the generated image from the test profiing.

pythonprof

The Grand Unified Python Benchmark Suite

For standard profiling, what I found was “The Grand Unified Python Benchmark Suite”. I didn’t fully understand how to use it. I tried to run one of the the script using python executable that is already installed. Performance output is :

benchmarks-1bd1437ea49b/performance/pybench$ python pybench.py
-------------------------------------------------------------------------------
PYBENCH 2.0
-------------------------------------------------------------------------------
* using CPython 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.time

Calibrating tests. Please wait...
done.

Running 10 round(s) of the suite at warp factor 10:

* Round 1 done in 2.390 seconds.
* Round 2 done in 2.340 seconds.
* Round 3 done in 2.280 seconds.
* Round 4 done in 2.317 seconds.
* Round 5 done in 2.277 seconds.
* Round 6 done in 2.300 seconds.
* Round 7 done in 2.436 seconds.
* Round 8 done in 2.364 seconds.
* Round 9 done in 2.355 seconds.
* Round 10 done in 2.702 seconds.

-------------------------------------------------------------------------------
Benchmark: 2015-03-17 00:38:58
-------------------------------------------------------------------------------

    Rounds: 10
    Warp:   10
    Timer:  time.time

    Machine Details:
       Platform ID:    Linux-3.13.0-46-generic-x86_64-with-Ubuntu-14.04-trusty
       Processor:      x86_64

    Python:
       Implementation: CPython
       Executable:     /usr/bin/python
       Version:        2.7.6
       Compiler:       GCC 4.8.2
       Bits:           64bit
       Build:          Mar 22 2014 22:59:56 (#default)
       Unicode:        UCS4


Test                             minimum  average  operation  overhead
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:     47ms     50ms    0.10us    0.066ms
           BuiltinMethodLookup:     33ms     35ms    0.03us    0.077ms
                 CompareFloats:     44ms     47ms    0.04us    0.088ms
         CompareFloatsIntegers:     56ms     58ms    0.06us    0.067ms
               CompareIntegers:     44ms     47ms    0.03us    0.133ms
        CompareInternedStrings:     34ms     35ms    0.02us    0.335ms
                  CompareLongs:     34ms     35ms    0.03us    0.077ms
                CompareStrings:     31ms     32ms    0.03us    0.224ms
                CompareUnicode:     40ms     41ms    0.05us    0.170ms
    ComplexPythonFunctionCalls:     42ms     43ms    0.22us    0.111ms
                 ConcatStrings:     47ms     53ms    0.11us    0.133ms
                 ConcatUnicode:     43ms     50ms    0.17us    0.093ms
               CreateInstances:     39ms     44ms    0.40us    0.094ms
            CreateNewInstances:     31ms     34ms    0.40us    0.091ms
       CreateStringsWithConcat:     42ms     44ms    0.04us    0.221ms
       CreateUnicodeWithConcat:     29ms     30ms    0.08us    0.088ms
                  DictCreation:     32ms     34ms    0.08us    0.088ms
             DictWithFloatKeys:     43ms     45ms    0.05us    0.167ms
           DictWithIntegerKeys:     43ms     45ms    0.04us    0.223ms
            DictWithStringKeys:     35ms     39ms    0.03us    0.221ms
                      ForLoops:     24ms     28ms    1.11us    0.019ms
                    IfThenElse:     32ms     33ms    0.02us    0.168ms
                   ListSlicing:     46ms     47ms    3.33us    0.025ms
                NestedForLoops:     34ms     36ms    0.02us    0.010ms
      NestedListComprehensions:     39ms     40ms    3.33us    0.023ms
          NormalClassAttribute:     37ms     41ms    0.03us    0.113ms
       NormalInstanceAttribute:     28ms     31ms    0.03us    0.111ms
           PythonFunctionCalls:     37ms     38ms    0.11us    0.066ms
             PythonMethodCalls:     40ms     42ms    0.19us    0.035ms
                     Recursion:     49ms     50ms    1.00us    0.113ms
                  SecondImport:     44ms     45ms    0.45us    0.044ms
           SecondPackageImport:     40ms     42ms    0.42us    0.044ms
         SecondSubmoduleImport:     55ms     57ms    0.57us    0.044ms
       SimpleComplexArithmetic:     35ms     36ms    0.04us    0.088ms
        SimpleDictManipulation:     41ms     42ms    0.04us    0.120ms
         SimpleFloatArithmetic:     41ms     42ms    0.03us    0.135ms
      SimpleIntFloatArithmetic:     34ms     34ms    0.03us    0.149ms
       SimpleIntegerArithmetic:     34ms     34ms    0.03us    0.132ms
      SimpleListComprehensions:     34ms     34ms    2.84us    0.023ms
        SimpleListManipulation:     30ms     31ms    0.03us    0.144ms
          SimpleLongArithmetic:     29ms     30ms    0.05us    0.074ms
                    SmallLists:     37ms     38ms    0.06us    0.088ms
                   SmallTuples:     37ms     38ms    0.07us    0.100ms
         SpecialClassAttribute:     38ms     39ms    0.03us    0.110ms
      SpecialInstanceAttribute:     60ms     61ms    0.05us    0.113ms
                StringMappings:     52ms     56ms    0.22us    0.111ms
              StringPredicates:     46ms     50ms    0.07us    0.648ms
                 StringSlicing:     23ms     25ms    0.04us    0.188ms
                     TryExcept:     28ms     31ms    0.01us    0.166ms
                    TryFinally:     31ms     34ms    0.21us    0.089ms
                TryRaiseExcept:     28ms     31ms    0.48us    0.088ms
                  TupleSlicing:     41ms     45ms    0.17us    0.016ms
               UnicodeMappings:     31ms     33ms    0.92us    0.143ms
             UnicodePredicates:     34ms     37ms    0.07us    0.781ms
             UnicodeProperties:     43ms     47ms    0.12us    0.651ms
                UnicodeSlicing:     35ms     41ms    0.08us    0.166ms
                   WithFinally:     48ms     53ms    0.33us    0.090ms
               WithRaiseExcept:     60ms     65ms    0.81us    0.111ms
-------------------------------------------------------------------------------
Totals:                           2245ms   2376ms

This output seems to be usefull. However, this benchmark directory(pybench) is said that “unreliable, unrepresentative benchmark” by README.txt.

I tried to use the python executable I built with profiling option. However, this pybench.py script didn’t work with the latest version of python. I tried to fix the script, but, there were to many syntax errors. I guess many syntax that used in the script was deprecated. Python executable could have option for use of deprecated syntax.

The difference of version is :

$ python --version
Python 2.7.6
$ /home/hosung/python/cpython/python --version
Python 3.5.0a2+

Conclusion

So far, I chose the project: Python, built it, and built with profiling. I didn’t decide which part I am going to optimize. next step will be :

  • Figuring out which version of source I need to use
  • Figuring out Which profiling script I need to use
  • Getting profiling result from AArch64 machine.
  • Picking up the part I am going to optimize.

As a profiling tool, gprof seems to be appropriate because it is built in on the configuration and I could get easily the output.

Advertisements

[SPO600] Assembler Lab 2 – Aarch64

In the previous posting, I wrote a simple assembly language for x86 machine. In this posting, I will write ARM(Aarch64) assembly language to meet the same requirement.

ARM Assembly Language Summary

This summary is my understanding of aarch64 Register and Instruction Quick Start.

Registers

  • r0 ~ r30 : to refer generally to the registers
  • x0 ~ x30 : 64-bit-wide access, w0 ~ w30 : 32-bit-wide access
  • r31 : rsp when stack operation, otherwise rzr(zero register)
    rzr returns 0 when read and discards data when written

syscall/function call

  • r0 ~ r7 : arguments and return values
  • r8 : syscall number
  • r9 ~ r15 : temporary values
  • r16 ~ r18 : used for intra-procedure-call and platform values (avoid)
  • r19 ~ r28 : called routine preserve
  • r29, r30 : used for the frame register and link register (avoid)

Instructions

  • add r0,r1,r2 : r0 = r1 + r2
  • add r0,r1,99 : r0 = r1 + 99
  • adr r0,label : r0 = address label : move the address of the message. 64bit absolute address -> 32bit relative address calculated by pc + offset
  • beq label : if equal branch label
  • bne label : if not equal branch label
  • blt label : if less branch label
  • bgt label : if grater branch label
  • cmp r0,r1 : compares r0 and r1 and set flags
  • cmp r0,99 : compares r0 and 99 and set flags
  • ldr r0,[r1,0] : r1 + (0 * size) = r0 , size : 8bytes(64bit), 4bytes(32bit)
    basically calculate the offset from r1 by pointer size
  • ldr w0,[r1,0] : same as above, read 32 bits only
  • ldrb w0,[r1,0] : same as above, read 8 bits only
  • ldur r0,[r1,0] : (r1 + 0) = r0, (u : unscaled)
  • mov r0,r1 : r0 = r1
  • mov r0,99 : r0 = 99
  • str r0,[r1,0] : *(r1 + (0 * size)) = r0, size : 8bytes(64bit), 4bytes(32bit)
  • strb w0,[r1,0] : same as above, read 8 bits only
  • stur r0,[r1,0] : *(r1 + 0) = r0, (u : unscaled)
  • svc 0 : syscall
  • msub r0,r1,r2,r3 : r0 = r3 – (r1 * r2), useful for calculating remainders
  • madd r0,r1,r2,r3 : r0 = r3 + (r1 * r2)
  • mul r0,r1,r2 : r0 = r1 * r2
  • push r0 : push r0
  • pop r0 : pop r0
  • udiv r0,r1,r2 : r0 = r1 / r2, no remainder, use msub

Aarch64 assembly coding

Requirement

Print following output using loop

Loop: 00
Loop: 01
Loop: 02
Loop: 03
 ~
Loop: 29
Loop: 30

Full source code

.text
.globl _start
    start = 0           /* starting value for the loop */
    max = 31            /* loop exits */
    num = 48            /* 48 : 0 in ASCII */

_start:
    mov     x3,0        /* x3 : counter */

loop:
    /* put number */
    mov     x5,10
    udiv    x7,x3,x5    /* x4 = x3 / 10 */
    msub    x6,x5,x7,x3 /* x6 = x3 - (x5 * x4) */

    adr     x1,msg      /* message location (memory address) */
    add     x7,x7,num   /* atoi */
    add     x6,x6,num   /* atoi */
    strb    w7,[x1,6]   /* msg + 6byte = 10 */
    strb    w6,[x1,7]   /* msg + 8byte = 1 */

    /* put \n */
    mov     w7,10       /* ascii 10 is new line */
    strb    w7,[x1,8]   /* msg + 8 = '\n' */

    mov     x0, 1       /* file descriptor: 1 is stdout */
    mov     x2, len     /* message length (bytes) */

    mov     x8, 64      /* write is syscall #64 */
    svc     0           /* invoke syscall */

    add     x3,x3,1     /* increment index */
    cmp     x3,max      /* compare if it is done */
    bne     loop        /* loop if max != r8 */

    mov     x0, 0       /* status -> 0 */
    mov     x8, 93      /* exit is syscall #93 */
    svc     0           /* invoke syscall */

.data
msg:    .ascii      "Loop:    "
len=    . - msg

Explanation

Most of the lines except instructions are the same as x86 version.

    mov     x5,10
    udiv    x7,x3,x5    /* x4 = x3 / 10 */
    msub    x6,x5,x7,x3 /* x6 = x3 - (x5 * x4) */

These lines calculate quotient and remainder. The differences with x86 version is that it needs additional instruction to get remainder : msub. Whereas, in x86 div instruction stores quotient in %rax and remainder in %rdx.

    adr     x1,msg      /* message location (memory address) */
    add     x7,x7,num   /* itoa */
    add     x6,x6,num   /* itoa */
    strb    w7,[x1,6]   /* msg + 6byte = 10 */
    strb    w6,[x1,7]   /* msg + 8byte = 1 */

These lines change each digits to ascii characters by adding 48. And put them in the string buffer [6] and [7]. I copied only 8 bytes by using strb instruction.

    mov     w7,10       /* ascii 10 is new line */
    strb    w7,[x1,8]   /* msg + 8 = '\n' */

In this line I reused w7 for new line character. And copied it [8] in the buffer.

    mov     x0, 1       /* file descriptor: 1 is stdout */
    mov     x2, len     /* message length (bytes) */

    mov     x8, 64      /* write is syscall #64 */
    svc     0           /* invoke syscall */

In ARM assembly, writing is system call 64, and system call is svc 0 instead of syscall

Conclusion

  • ARM assembly syntax is easier to write because of similar argument order with c syntax.
  • In terms of udiv and msub instruction, the fact that I doesn’t have to think about loading a value to a particular register is convenient. Also, the fact that the return value itself is a part of instruction is easier to follow the flow.
  • In this case, ARM assembly code line was 3 lines shorter than x86 assembly code.

[SPO600] Assembler Lab 1 – x86

This posting deals with x86 assembly language summary and explanation of simple assembly language code that prints from 1 to 30. Summary of x86 assembly language relies on X86 64 Register and Instruction Quick Start page and some internet sources.

x86 Assembly Language Summary

Registers

  • rax, rbx, rcx, rdx : register extended
  • rbp : base pointer
  • rsp : stack pointer
  • rsi : source index (source for data copies)
  • rdi : destination index (destination for data copies)
  • r8, r9, r10, r11, r12, r13, r14, r15 : new registers
  • r0d~r15d are the lowermost 32 bits of each register
  • r0w~r15w are the lowermost 16 bits of each register
  • r0l~r15l are the lowermost 8 bits of each register

syscall/function call

  • First six arguments are in rdi, rsi, rdx, rcx, r8d, r9d; remaining arguments are on the stack
  • syscall number is in rax
  • return value is in rax

Instructions

  • add %r10,%r11 : r11 += r10
  • sub %r10,%r11 : r11 -= r10
  • cmp %r10,%r11 : r11 – r10 -> EFLAGS
  • cmp $99,%r11 : r11 – 99 -> EFLAGS
  • EFLAGS : Carry flag (CF) – bit 0 (lease significant bit)
    Overflow flag (OF) – bit 11
    Parity flag (PF) – bit 2
    Sign flag (SF) – bit 7
    Zero flag (ZF) – bit 6
  • div %r10 : rax = rax / r10, rdx = remainder (rdx must be 0)
  • inc %r10 : r10++
  • jmp label : goto label
  • je label : if equal(ZF==1) goto label
  • jne label : if not equal(ZF==0) goto label
  • jl label : if less (SF!=OF), goto label
  • jg label : if greater (ZF==0&&SF==OF), goto label
  • mov %r10,%r11 : r11 = r10
  • mov $99,%r10 : r10 = 99
  • mov %r10,(%r11) : *r11 = r10
  • mov (%r10),%r11 : r11 = *r10
  • mul %r10 : rax = r10 * rax, rdx = overflow
  • push %r10 : push r10
  • pop %r10 : pop r10
  • syscall : syscall. 32-bit mode : int $0x80

x86 assembly coding

Requirement

Print following output using loop.

Loop: 00
Loop: 01
Loop: 02
Loop: 03
 ~
Loop: 29
Loop: 30

Full source code

Following is full source code

.text
.globl    _start
    start = 0                   /* starting value for the loop index */
    max = 31                    /* loop exits when the index hits this number*/
    num = 48                    /* 48 : 0 in ASCII */

_start:
    mov     $start,%r15         /* loop index : i = 0 */

loop:
    /*put number */
    mov %r15,%r11               /*r11 = r15*/

    mov $0,%rdx                 /*init rdx*/
    mov %r11,%rax               /*rax = r11, to div*/
    mov $10,%r13                /*r13 = 10*/
    div %r13                    /*rax = rax/r10*/

    mov $msg,%r12               /*r12 = msg*/
    add $num,%rax               /*bin->ascii*/
    add $num,%rdx               /*bin->ascii*/
    mov %rax,6(%r12)            /*msg + 6 = */
    mov %rdx,7(%r12)            /*msg + 7 = */

    /*put \n*/
    mov $10,%r11
    mov %r11,8(%r12)            /*msg + 8 = '\n'*/

    /*print loop :*/
    mov $len,%rdx               /* msg length */
    mov $msg,%rsi               /* msg location */

    mov $1,%rdi                 /* file descriptor stdout */
    mov $1,%rax                 /* sys_write 1 */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii "loop:    "      /* message and enough space for 2 digits and \n*/
    len = . -msg

Explanation

The very first line .text says that following code is going to .text section, which will be instruction code section. And normal starting label of the application.
start is the value for loop index; it starts from 0 and increments up to 30.
max is the value for loop index limitation.
num is 48, which is '0' in ASCII code. It is used for converting from binary number to ASCII character.

.section .data

msg:    .ascii "loop:    "      /* message and enough space for 2 digits and \n*/
    len = . -msg

At line 47, there is msg string : "loop: ". It contains "loop:" string followed by 4 spaces. The spaces are used to set 2 bytes of number and 1 byte of newline character (0x0a).

    mov %r15,%r11               /*r11 = r15*/

    mov $0,%rdx                 /*init rdx*/
    mov %r11,%rax               /*rax = r11, to div*/
    mov $10,%r13                /*r13 = 10*/
    div %r13                    /*rax = rax/r10*/

At line 12 %r15 has incremented loop index, and it is set to %r11.
Line 14 set %rdx with 0 for div instruction.
When "div %13" is executed, the index is divided by 10, as a result, %rax has the quotient and %rdx has the remainder. If the index is 12, %rax is 1 and %rdx is 2.

    mov $msg,%r12               /*r12 = msg*/
    add $num,%rax               /*bin->ascii*/
    add $num,%rdx               /*bin->ascii*/
    mov %rax,6(%r12)            /*msg + 6 = */
    mov %rdx,7(%r12)            /*msg + 7 = */

At line 20 and 21, %rax and %rdx is converted to ASCII value by adding 48. And then, at line 22 and line 23, string value 6th and 7th position is set by the ASCII codes.

    mov $10,%r11
    mov %r11,8(%r12)            /*msg + 8 = '\n'*/

Lastly, new line character(0xa) is set at the last buffer of msg string.

    mov $len,%rdx               /* msg length */
    mov $msg,%rsi               /* msg location */

    mov $1,%rdi                 /* file descriptor stdout */
    mov $1,%rax                 /* sys_write 1 */
    syscall

Since all message including digits and newline character is set in the msg buffer, the buffer pointer is set to %rsi for printing.

Rest of the code is for incrementing index and checking loop escape condition.

Conclusion

  • In writing this kind of simple loop, assembly language is not very difficult.
  • Instead of using system call 1, “call printf” can be used. In this case, the string data will be “loop: %0d\n”. And the code will be shorter since we do not need to devide digits. However, printf function itself is very heavy. I guess, in terms of speed, this code will be faster.

Refrernce

[SPO600] Executable and Linkable Format (ELF)

As a preparation for SPO600 Assembler Lab, this posting deals with structure of Unix executable format.

Executable file used in this posting is built by as(GNU assembler) and ld(GNU linker). Sample source code is :

.text
.globl  _start

_start:
    movq    $len,%rdx           /* message length */
    movq    $msg,%rsi           /* message location */
    movq    $1,%rdi             /* file descriptor stdout */
    movq    $1,%rax             /* syscall sys_write */
    syscall

    movq    $0,%rdi             /* exit status */
    movq    $60,%rax            /* syscall sys_exit */
    syscall

.section .rodata

msg:    .ascii      "Hello, world!n"
    len = . - msg

ELF Format

Executable file format in Windows is called PE (Portable Executable). Since I have experience of dealing with Microsoft PE file, I could find basic structure of Unix executable file is very similar with Windows PE format. Meanwhile, standard executable binary format for Unix-like system is ELF format, and this file contains data to be loaded into the memory by the loader.

File Layout

ELF Header

Each ELF file has only one ELF Header at the very begging of file
In my linux system, the header file for elf is in /usr/include/linux/elf.h. The struct that refer to ELF header in the file is elf32_hdr or elf64_hdr.

#define EI_NIDENT 16

typedef struct elf64_hdr {
  unsigned char e_ident[EI_NIDENT]; /* ELF "magic number" */
  Elf64_Half e_type;
  Elf64_Half e_machine;
  Elf64_Word e_version;
  Elf64_Addr e_entry; /* Entry point virtual address */
  Elf64_Off e_phoff; /* Program header table file offset */
  Elf64_Off e_shoff; /* Section header table file offset */
  Elf64_Word e_flags;
  Elf64_Half e_ehsize;
  Elf64_Half e_phentsize;
  Elf64_Half e_phnum;
  Elf64_Half e_shentsize;
  Elf64_Half e_shnum;
  Elf64_Half e_shstrndx;
} Elf64_Ehdr;

e_ident array contains magic number of ELF format. If the file contains this data, it is ELF format. The very first 32bytes of executable file looks like this :

00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
00000010: 0200 3e00 0100 0000 7800 4000 0000 0000 ..>.....x.@.....

The magic number is 0x7f, ‘E’, ‘L’, ‘F’. ‘readelf’ utility shows parsed ELF header:

[hshwang2@australia x86_64]$  readelf -h hello-gas
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x400078
  Start of program headers:          64 (bytes into file)
  Start of section headers:          544 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         1
  Size of section headers:           64 (bytes)
  Number of section headers:         10
  Section header string table index: 7

Entry point address (0x400078) is from where the process starts executing. Start of program headers (64) is the offset of program header table and Start of section headers (544) is the offset of section header table. Because ‘Size of this header’ is 64, program header is right after the ELF header.

Program Header

Program header shows the segments used at run-time. Program header is an array of structure that describe every segment.

Following is the struct for section header in elf.h.

typedef struct elf64_phdr {
  Elf64_Word p_type;
  Elf64_Word p_flags;
  Elf64_Off p_offset; /* Segment file offset */
  Elf64_Addr p_vaddr; /* Segment virtual address */
  Elf64_Addr p_paddr; /* Segment physical address */
  Elf64_Xword p_filesz; /* Segment size in file */
  Elf64_Xword p_memsz; /* Segment size in memory */
  Elf64_Xword p_align; /* Segment alignment, file & memory */
} Elf64_Phdr;

readelf shows following output.

[hshwang2@australia x86_64]$ readelf -l hello-gas
Elf file type is EXEC (Executable file)
Entry point 0x400078
There are 1 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x00000000000000b4 0x00000000000000b4  R E    200000

 Section to Segment mapping:
  Segment Sections...
   00     .text .rodata 

This program contains only one program headers. Program header says which data is to be loaded in which memory address. In this case, 0xb4(180) bytes of data will be loaded at virtual address 0x400000.

Section Header

e_shoff member of ELF header points the section header. Section header lists the set of sections of the binary. That is array of el64_shdr structure.

typedef struct elf64_shdr {
  Elf64_Word sh_name; /* Section name, index in string tbl */
  Elf64_Word sh_type; /* Type of section */
  Elf64_Xword sh_flags; /* Miscellaneous section attributes */
  Elf64_Addr sh_addr; /* Section virtual addr at execution */
  Elf64_Off sh_offset; /* Section file offset */
  Elf64_Xword sh_size; /* Size of section in bytes */
  Elf64_Word sh_link; /* Index of another section */
  Elf64_Word sh_info; /* Additional section information */
  Elf64_Xword sh_addralign; /* Section alignment */
  Elf64_Xword sh_entsize; /* Entry size if section holds table */
} Elf64_Shdr;

Actual section header contains following data :

[hshwang2@australia x86_64]$ objdump -h hello-gas
hello-gas:     file format elf64-x86-64
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         0000002e  0000000000400078  0000000000400078  00000078  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .rodata       0000000e  00000000004000a6  00000000004000a6  000000a6  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .debug_aranges 00000030  0000000000000000  0000000000000000  000000c0  2**4
                  CONTENTS, READONLY, DEBUGGING
  3 .debug_info   00000074  0000000000000000  0000000000000000  000000f0  2**0
                  CONTENTS, READONLY, DEBUGGING
  4 .debug_abbrev 00000014  0000000000000000  0000000000000000  00000164  2**0
                  CONTENTS, READONLY, DEBUGGING
  5 .debug_line   00000046  0000000000000000  0000000000000000  00000178  2**0
                  CONTENTS, READONLY, DEBUGGING

.text section is where program code is. Using objdump utility, disassembled section can be seen :

[hshwang2@australia x86_64]$ objdump -S hello-gas
hello-gas:     file format elf64-x86-64
Disassembly of section .text:
0000000000400078 <_start>:
.text
.globl  _start

_start:
    movq    $len,%rdx           /* message length */
  400078:   48 c7 c2 0e 00 00 00    mov    $0xe,%rdx
    movq    $msg,%rsi           /* message location */
  40007f:   48 c7 c6 a6 00 40 00    mov    $0x4000a6,%rsi
    movq    $1,%rdi             /* file descriptor stdout */
  400086:   48 c7 c7 01 00 00 00    mov    $0x1,%rdi
    movq    $1,%rax             /* syscall sys_write */
  40008d:   48 c7 c0 01 00 00 00    mov    $0x1,%rax
    syscall
  400094:   0f 05                   syscall 

    movq    $0,%rdi             /* exit status */
  400096:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
    movq    $60,%rax            /* syscall sys_exit */
  40009d:   48 c7 c0 3c 00 00 00    mov    $0x3c,%rax
    syscall
  4000a4:   0f 05                   syscall 

When I compare this output from hello-gas.s source code, every line is the same except that some variables are changed to address.
$len is changed to the length of “Hello, world!\n”, 0xe (14).
$msg is changed to the address($0x4000a6) that will contain “Hello, world!\n” string data, which will be in .rodata section.

.rodata section contains strings that are contained in the code. objdump -s -j .rodata file command shows the hexdump of .rodata section.

[hshwang2@australia x86_64]$ objdump -s -j .rodata hello-gas
hello-gas:     file format elf64-x86-64

Contents of section .rodata:
 4000a6 48656c6c 6f2c2077 6f726c64 210a      Hello, world!.

As the code in the .text says (mov $0x4000a6,%rsi), “Hello, world!” string is in 0x4000a6.

When the executable file is actually loaded in the memory, the address can be different. That is done by the process called relocation. The example file doesn’t contain relocation information. Therefore, this file will be loaded at the address written in the ELF file.

Following image shows how ELF file is loaded into memory.

ELF diagram
Source : OSDev.org

Reference

[SPO600] Compiled C Lab – Compiler option and assembly

This posting investigates how a simple C code is translated to an assembly language relating to Compiled C Lab in SPO600 course in Seneca College. I especially focused on “(5) Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.” part in the lab.

This lab is performed in a X86_64 machine : australia.proximity.on.ca

First Sample : hello0.c

#include <stdio.h>

void output(char *c)
{
  printf(c);
}

int main()
{
  printf("Print in main\n");
  output("First Param\n");
}

After printing “Print in main” in main function called output() function with one char array. I compiled with -O0 optimization option : No Optimization. For easy analysis, assembly code is generated using -S option. And generated object dump file using objdump.

  • compile : g++ -O0 -o hello0 hello0.c
  • generate assembly : g++ -O0 -S -o hello0.asm hello0.c
  • object dump : objdump -d hello0 > hello0.dump

following is output and main main function part of hello0.dump

0000000000400646 <_Z6outputPc>:
  400646:   55                      push   %rbp
  400647:   48 89 e5                mov    %rsp,%rbp
  40064a:   48 83 ec 10             sub    $0x10,%rsp
  40064e:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
  400652:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  400656:   48 89 c7                mov    %rax,%rdi
  400659:   b8 00 00 00 00          mov    $0x0,%eax
  40065e:   e8 ad fe ff ff          callq  400510 <printf@plt>
  400663:   c9                      leaveq 
  400664:   c3                      retq   

0000000000400665 <main>:
  400665:   55                      push   %rbp
  400666:   48 89 e5                mov    %rsp,%rbp
  400669:   bf 20 07 40 00          mov    $0x400720,%edi
  40066e:   e8 bd fe ff ff          callq  400530 <puts@plt>
  400673:   bf 2e 07 40 00          mov    $0x40072e,%edi
  400678:   e8 c9 ff ff ff          callq  400646 <_Z6outputPc>
  40067d:   b8 00 00 00 00          mov    $0x0,%eax
  400682:   5d                      pop    %rbp
  400683:   c3                      retq   
  400684:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  40068b:   00 00 00 
  40068e:   66 90                   xchg   %ax,%ax

The same part of hello0.asm is :

_Z6outputPc:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    %rax, %rdi
    movl    $0, %eax
    call    printf
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   _Z6outputPc, .-_Z6outputPc
    .section    .rodata
.LC0:
    .string "Print in main"
.LC1:
    .string "First Paramn"
    .text
    .globl  main
    .type   main, @function
main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $.LC0, %edi
    call    puts
    movl    $.LC1, %edi
    call    _Z6outputPc
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

push %rbp and mov %rsp,%rbp two lines in every function starting point is not significant: typical code for stack pointer saving to manage function calling.

In main(), “Print in main” string address is moved to %edi register (movl $.LC0, %edi) and puts function is called(call puts). This is probably because only one parameter is used for printf function. Then “First Param” string address is moved to %edi(movl $.LC1, %edi), and putput() function is called(call _Z6outputPc). Both cases, %edi register is used to send a parameter.

in output(), these two line : mov %rdi,-0x8(%rbp) and mov -0x8(%rbp),%rax looks like for managing an argument. and then rax is used to send an argument to printf().

Second sample : hello2.c

#include <stdio.h>

void output(char *c, char *d, char *e, char *f, char *g)
{
  printf("text %s, %s, %s, %s, %s", c, d, e, f, g);
}

int main()
{
  printf("Print in main\n");
  output("First Param\n", "Second param\n", "Third param\n", "Forth param\n", "Fifth param\n");
}

This sample send 5 arguments to output function, and output function calls printf function with 6 arguments.

Following is object dump file :

0000000000400646 <_Z6outputPcS_S_S_S_>:
  400646:   55                      push   %rbp
  400647:   48 89 e5                mov    %rsp,%rbp
  40064a:   48 83 ec 30             sub    $0x30,%rsp
  40064e:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
  400652:   48 89 75 f0             mov    %rsi,-0x10(%rbp)
  400656:   48 89 55 e8             mov    %rdx,-0x18(%rbp)
  40065a:   48 89 4d e0             mov    %rcx,-0x20(%rbp)
  40065e:   4c 89 45 d8             mov    %r8,-0x28(%rbp)
  400662:   48 8b 7d d8             mov    -0x28(%rbp),%rdi
  400666:   48 8b 75 e0             mov    -0x20(%rbp),%rsi
  40066a:   48 8b 4d e8             mov    -0x18(%rbp),%rcx
  40066e:   48 8b 55 f0             mov    -0x10(%rbp),%rdx
  400672:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  400676:   49 89 f9                mov    %rdi,%r9
  400679:   49 89 f0                mov    %rsi,%r8
  40067c:   48 89 c6                mov    %rax,%rsi
  40067f:   bf 60 07 40 00          mov    $0x400760,%edi
  400684:   b8 00 00 00 00          mov    $0x0,%eax
  400689:   e8 82 fe ff ff          callq  400510 <printf@plt>
  40068e:   c9                      leaveq 
  40068f:   c3                      retq   

0000000000400690 <main>:
  400690:   55                      push   %rbp
  400691:   48 89 e5                mov    %rsp,%rbp
  400694:   bf 78 07 40 00          mov    $0x400778,%edi
  400699:   e8 92 fe ff ff          callq  400530 <puts@plt>
  40069e:   41 b8 86 07 40 00       mov    $0x400786,%r8d
  4006a4:   b9 93 07 40 00          mov    $0x400793,%ecx
  4006a9:   ba a0 07 40 00          mov    $0x4007a0,%edx
  4006ae:   be ad 07 40 00          mov    $0x4007ad,%esi
  4006b3:   bf bb 07 40 00          mov    $0x4007bb,%edi
  4006b8:   e8 89 ff ff ff          callq  400646 <_Z6outputPcS_S_S_S_>
  4006bd:   b8 00 00 00 00          mov    $0x0,%eax
  4006c2:   5d                      pop    %rbp
  4006c3:   c3                      retq   
  4006c4:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  4006cb:   00 00 00 
  4006ce:   66 90                   xchg   %ax,%ax

Following is generated assembly code :

.LC0:
    .string "text %s, %s, %s, %s, %s"
_Z6outputPcS_S_S_S_:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $48, %rsp
    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    %rdx, -24(%rbp)
    movq    %rcx, -32(%rbp)
    movq    %r8, -40(%rbp)
    movq    -40(%rbp), %rdi
    movq    -32(%rbp), %rsi
    movq    -24(%rbp), %rcx
    movq    -16(%rbp), %rdx
    movq    -8(%rbp), %rax
    movq    %rdi, %r9
    movq    %rsi, %r8
    movq    %rax, %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   _Z6outputPcS_S_S_S_, .-_Z6outputPcS_S_S_S_
    .section    .rodata
.LC1:
    .string "Print in main"
.LC2:
    .string "Fifth paramn"
.LC3:
    .string "Forth paramn"
.LC4:
    .string "Third paramn"
.LC5:
    .string "Second paramn"
.LC6:
    .string "First Paramn"
    .text
    .globl  main
    .type   main, @function
main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $.LC1, %edi
    call    puts
    movl    $.LC2, %r8d
    movl    $.LC3, %ecx
    movl    $.LC4, %edx
    movl    $.LC5, %esi
    movl    $.LC6, %edi
    call    _Z6outputPcS_S_S_S_
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endpro

Important part is following :

main:
    movl    $.LC2, %r8d     //Fifth param
    movl    $.LC3, %ecx     //Forth param
    movl    $.LC4, %edx     //Third param
    movl    $.LC5, %esi     //Second param
    movl    $.LC6, %edi //First Param
    call    _Z6outputPcS_S_S_S_
output:
    movq    %rdi, -8(%rbp)  //First Param
    movq    %rsi, -16(%rbp) //Second param
    movq    %rdx, -24(%rbp) //Third param
    movq    %rcx, -32(%rbp) //Forth param
    movq    %r8, -40(%rbp)  //Fifth param
    movq    -40(%rbp), %rdi //First Param
    movq    -32(%rbp), %rsi //Second param
    movq    -24(%rbp), %rcx //Forth param
    movq    -16(%rbp), %rdx //Third param
    movq    -8(%rbp), %rax  //Fifth param
    movq    %rdi, %r9       
    movq    %rsi, %r8       
    movq    %rax, %rsi      
    movl    $.LC0, %edi     //"text %s, %s, %s, %s, %s"
    movl    $0, %eax
    call    printf

In main, before calling output, 5th, 4th, 3rd, 2nd, and 1st arguments set to registers, r8d, cs, edx, esi, and edi, respectively. Then in output(), now rdi, rsi, rdx, rcx, and %r8 is used to store them in stack local memory. And then again these values are loaded to registers. After loading first argument for formatting (“text %s, %s, %s, %s, %s”), printf function is called.

Debugging, call stack analysis

Followings are debugging screenshot:
Screenshot from 2015-02-22 17:42:01

Screenshot from 2015-02-22 18:02:25

While executing main() function’s code starting from 0x4005d0, base pointer(rbp) was 0x7fffffffdc70, and stack pointer(rsp) was also 0x7fffffffdc70; the same address probably because there is no command line argument and local variable. Then, when output() function was executing, base pointer was 0x7fffffffdc60 and stack pointer was 0x7ffffffffdc30.

Stack looks like this :

Screenshot from 2015-02-22 20:39:21

[SPO600] Platform Specific Code : Array/Pointer Access Analysis

In the last posting, I compaired the time between big[i] = i; and *p++ = i;. Interesting point was the gap between two method in AArch64 was much smaller than the gap in X86. In this posting, I looked at assembly code of it.

Array element calculation

    for(int i = 0; i < ARR_SIZE; i++){
        big[i] = i;
    }

This code is translated to following assembly code in AArch64 :

.L3:
        ldrsw   x2, [x29,60]
        adrp    x0, big
        add     x0, x0, :lo12:big
        ldrsw   x1, [x29,60]
        str     x2, [x0,x1,lsl 3]
        ldr     w0, [x29,60]
        add     w0, w0, 1
        str     w0, [x29,60]

big[i] = i; part is line 6 : str x2, [x0,x1,lsl 3].
“lsl” is “logical shift left”; I it means << 3.
For example, binary 00000001(1) becomes 00001000(8).
8 bits are 1 byte. big[] is 1 byte char array, thus it make sense.
I am not sure how rest of the code works. But it seems clear that shift operation is used to calculate array element position.

Whereas, in X86 64 :

.L3:
        movl    -4(%rbp), %eax
        movslq  %eax, %rdx
        movl    -4(%rbp), %eax
        cltq
        movq    %rdx, big(,%rax,8)
        addl    $1, -4(%rbp)

big[i] = i; part is line 6 : movq %rdx, big(,%rax,8)
According to googling, big(,%rax,8) means something like M[0x402680 + (8 * RAX)]. Which means it uses multiplication to calculate the position.

Generally said that shift operation is faster than multiplication in CPU level. I guess the speed difference between AArch64 and X86 comes from this.

Pointer enumeration

    for(int i = 0; i < ARR_SIZE; i++){
        *p++ = i;
    }

This piece of code is translated to assembly code in AArch64 like this :

.L5:
        ldr     x0, [x29,48]
        add     x1, x0, 8
        str     x1, [x29,48]
        ldrsw   x1, [x29,44]
        str     x1, [x0]
        ldr     w0, [x29,44]
        add     w0, w0, 1
        str     w0, [x29,44]

And in X86 :

.L5:
        movq    -16(%rbp), %rax
        leaq    8(%rax), %rdx
        movq    %rdx, -16(%rbp)
        movl    -20(%rbp), %edx
        movslq  %edx, %rdx
        movq    %rdx, (%rax)
        addl    $1, -20(%rbp)

In both machines, there were no significant difference : most of them were moving and adding.

Conclusion

  • Code like *pointer++ = i is generated the same kind of assembly code in both AArch64 and X86.
  • Code like array[i++] = j is generated using shift operation and multiplication in AArch64 and X84, repectively. This cause speed difference.

Reference

Tips for compiling/gdb for assembly

  • Generating assembly code : $ g++ -S -o ar2.a array2.c
  • Building for debugging : g++ -g -o ar2 array2.c
  • Assembly mode in gdb : (gdb)layout asm
  • View registers : (gdb)info registers
  • View a register : (gdb)i r x1

[SPO600] Platform Specific Code : Array Access Optimization

In the last posting, I examined how array is arranged in memory. In this posting, I will test about the speed depending on array access method.

Sample Code

I wrote simple test code :

#include <stdio.h>
#include <time.h>

#define ARR_SIZE 400000000

long big[ARR_SIZE];

int main()
{   
    time_t start, end;

    start = clock();
    for(int i = 0; i < ARR_SIZE; i++){
        big[i] = i;
    }
    end = clock();
    printf("Delay time : %fn", (double)(end - start)/CLOCKS_PER_SEC);

    start = clock();
    long *p = big;
    for(int i = 0; i < ARR_SIZE; i++){
        *p++ = i;
    }
    end = clock();
    printf("Delay time : %fn", (double)(end - start)/CLOCKS_PER_SEC);
}

The source code makes 400 million long array; long in 64bit machine is 8 bytes.
First loop set array from big[0] to big[399999999] in order. Second loop do the same thing using pointer increment.

Result in AArch64 and X86

I tested 10 times in both machines. Followings are the result :

ARM64 Array ARM64 Pointer X86 Array X86 Pointer
2.540624 1.835648 3.344561 1.466308
2.541102 1.835669 3.383403 1.470134
2.541190 1.835670 3.323564 1.490613
2.541096 1.835970 3.339740 1.500834
2.541325 1.837497 3.362360 1.315499
2.540733 1.835697 3.400443 1.483274
2.541273 1.835647 3.382420 1.490314
2.541166 1.835666 3.286187 1.285057
2.540958 1.835688 3.274380 1.490684
2.541311 1.835644 3.331944 1.455013

When I draw a graph :

AArch64 :
ARMgraph

X86 :
X86graph

Conclusion

  • In this case, pointer access is faster and array access.
  • In AArch64 machine, pointer access was around 50% faster than array access.
  • In X86 machine, pointer access was more than 100% faster than array access.
  • Interestingly, AArch64 machine’s speed gap was smaller than X86 : pointer access was slower than x86 and array access was faster than x86.
  • In this case changing from array access to pointer access can be a solution for optimization.

Presentation for this topic