Calling Conventions

Before we go any further…

It is important to understand that this section isn’t a general purpose description of the present calling conventions. It merely explains the calling conventions for the parameter/return types supported by dyncall (not for e.g. unsupported types like SIMD data types (__m64, __m128, __m128i, __m128d), etc.).
We strongly advise the reader not to use this document as a general purpose calling convention reference.

x86 Calling Conventions

Overview

On this processor, a word is deﬁned to be 16 bits in size, a dword 32 bits and a qword 64 bits.

There are numerous diﬀerent calling conventions on the x86 processor architecture, like cdecl [8], MS fastcall [10], GNU fastcall [11], Borland fastcall [12], Watcom fastcall [13], Win32 stdcall [9], MS thiscall [14], GNU thiscall [15], the pascal calling convention [16] and a cdecl-like version for Plan9 [17] (dubbed plan9call by us), etc.

	# of regs	# regs to		cleanup	64bit args
Name	for params	# preserve	push order	by	via regs?

cdecl	0	4	←	caller	-
MS fastcall	2	4	←	callee	Y
GNU fastcall	2	4	←	callee	N
Borland fastcall	3	4	→	callee	N
Watcom fastcall	4	2-6	←	callee	N
win32 stdcall	0	4	←	callee	-
MS thiscall	1	4	←	callee	N
GNU thiscall	0	4	←	caller	-
pascal	0	4	→	callee	-
plan9call	0	0	←	caller	-

Table 10: short x86 calling convention comparison

dyncall support

Currently cdecl, stdcall, fastcall (MS and GNU), thiscall (MS and GNU) and plan9call are supported.
Dyncall can also be used to issue syscalls on Linux and *BSD by using the syscall number as target parameter and selecting the correct mode.

cdecl

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	preserve
ecx	scratch
edx	scratch, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 11: Register usage on x86 cdecl calling convention

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
all arguments are pushed onto the stack (as dwords)
arguments > 64 bits are pushed as a sequence of dwords
aggregates (structs, unions) are pushed as a sequence of dwords
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning via the stack), and callee writes return value to this space; the ptr to the aggregate is returned in eax
return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers and aggregates (structs, unions) > 32 and <= 64 bits are returned via the eax and edx registers
return values > 64 bits (e.g. aggregates) are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit ﬁrst parameter (this means, on the stack)
ﬂoating point types are returned via the st0 register (except on Minix, where they are returned as integers are)

Stack layout

Stack directly after function prolog:

| |
| ... |
register save area------------------------| )
|------------------------| ||||
local data(|------------------------|) ||}
{ | arg n- 1 |} caller’s frame
parameter area ( | ... |) stack parameters ||||
|-arg-0-------------------| ||)
|-return address-----------| )
register save area------------------------| ||}
local data|------------------------| current frame
parameter area|------------------------| ||)
| ... |

Figure 1: Stack layout on x86 cdecl calling convention

MS fastcall

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	preserve
ecx	scratch, parameter 0
edx	scratch, parameter 1, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 12: Register usage on x86 fastcall (MS) calling convention

Parameter passing

stack parameter order: right-to-left
called function cleans up the stack
ﬁrst two integers/pointers (<= 32bit) are passed via ecx and edx (even if preceded by other arguments)
if ﬁrst argument is a 64 bit integer, it is passed via ecx and edx
all other parameters are pushed onto the stack (as dwords)
arguments > 64 bits are pushed as a sequence of dwords
aggregates (structs, unions) are pushed as a sequence of dwords, but are never split between registers and stack (if registers are still available and aggregate doesn’t ﬁt entirely into ecx and edx, it is passed via the stack and remaining registers are free for subsequent arguments)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

return values of pointer or integral type, as well as aggregates (structs, unions) <= 64 are returned via the eax and edx registers
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning via ecx), and callee writes return value to this space; the ptr to the aggregate is returned in eax
return values > 64 bits (e.g. aggregates) are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit ﬁrst parameter (always via the stack, never via a register)
ﬂoating point types are returned via the st0 register

Stack layout

Stack directly after function prolog:

| |
| ... |
register save area------------------------| )
|------------------------| ||||
local data(|------------------------|) ||}
{ | last arg |} caller’s frame
parameter area ( | ... |) stack parameters ||||
|-ﬁrst arg-passed via stack-| ||)
|-return address-----------| )
register save area------------------------| ||}
local data|------------------------| current frame
parameter area|------------------------| ||)
| ... |

Figure 2: Stack layout on x86 fastcall (MS) calling convention

GNU fastcall

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	preserve
ecx	scratch, parameter 0
edx	scratch, parameter 1, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 13: Register usage on x86 fastcall (GNU) calling convention

Parameter passing

stack parameter order: right-to-left
called function cleans up the stack
ﬁrst two integers/pointers (<= 32bit) are passed via ecx and edx (even if preceded by other arguments)
arguments > 32 bits are pushed onto the stack as a sequence of dwords (never passed via registers, any respective register is skipped and not used for subsequent args)
all other parameters are pushed onto the stack (as dwords)
aggregates (structs, unions) are pushed as a sequence of dwords, and never passed via registers (no matter their size, any respective register is skipped and not used for subsequent args)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
varargs are always passed via the stack

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers > 32 and <= 64 bits are returned via the eax and edx registers
aggregates (structs, unions) of any size are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit ﬁrst parameter (always via ecx), that same pointer is returned in eax
ﬂoating point types are returned via the st0

Stack layout

Stack directly after function prolog:

Figure 3: Stack layout on x86 fastcall (GNU) calling convention

Borland fastcall

Also called register convention by Borland. Registers and register usage

Name	Brief description

eax	scratch, parameter 0, return value
ebx	preserve
ecx	scratch, parameter 2
edx	scratch, parameter 1, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 14: Register usage on x86 fastcall (Borland) calling convention

Parameter passing

stack parameter order: left-to-right
called function cleans up the stack
ﬁrst three integers/pointers (with exception of method pointers) (<= 32bit) are passed via eax, ecx and edx (preceding or interleaved arguments that are not passed via registers are pushed onto the stack)
arguments > 32 bits are passed as a pointer to the value
aggregates (structs, unions) are pushed as a sequence of dwords, and never passed via registers (no matter their size)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
varargs are always passed via the stack
all other parameters are pushed onto the stack
the direction ﬂag is clear on entry and must be returned clear

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning via ecx), and callee writes return value to this space; the ptr to the aggregate is returned in eax
integers and aggregates (structs, unions) > 32 and <= 64 bits are returned via the eax and edx registers
ﬂoating point types are returned via the st0 register
return values > 32 bits (e.g. aggregates, long long, ...) are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit last parameter

Stack layout

Stack directly after function prolog:

| |
| ... |
register save area------------------------| )
|------------------------| ||||
local data(|------------------------|) ||}
{ | ﬁrst arg passed via stack |} caller’s frame
parameter area ( | ... |) stack parameters ||||
|-last arg----------------| ||)
|-return address-----------| )
register save area------------------------| ||}
local data|------------------------| current frame
parameter area|------------------------| ||)
| ... |

Figure 4: Stack layout on x86 fastcall (Borland) calling convention

Watcom fastcall

Registers and register usage

Name	Brief description

eax	scratch, parameter 0, return value
ebx	scratch when used for parameter, otherwise preserve, parameter 2
ecx	scratch when used for parameter, otherwise preserve, parameter 3
edx	scratch when used for parameter, otherwise preserve, parameter 1, return value
esi	scratch when used for return pointer, otherwise preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 15: Register usage on x86 fastcall (Watcom) calling convention

Parameter passing

stack parameter order: right-to-left
called function cleans up the stack
ﬁrst four integers/pointers (<= 32bit) are passed via eax, edx, ebx and ecx (even if preceded by other arguments)
arguments > 32 bits, as well as all subsequent arguments, are passed via the stack
aggregates (structs, unions) are passed as a pointer to the aggregate (a copy, if needed, to guarantee by-value semantics)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
all other parameters are pushed onto the stack

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers > 32 bits and <= 64 bits are returned via the eax and edx registers
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee via esi, and callee writes return value to this space; the ptr to the aggregate is returned in eax
aggregates (structs, unions) <= 32 bits are returned in eax
aggregates (structs, unions) > 32 bits are returned by the caller allocating the space and passing a pointer to the callee via esi, that same pointer is returned in eax

Stack layout

Stack directly after function prolog:

Figure 5: Stack layout on x86 fastcall (Watcom) calling convention

win32 stdcall

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	preserve
ecx	scratch
edx	scratch, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 16: Register usage on x86 stdcall calling convention

Parameter passing

stack parameter order: right-to-left
called function cleans up the stack
all parameters are pushed onto the stack (as dwords)
arguments > 64 bits are pushed as a sequence of dwords
aggregates (structs, unions) are pushed as a sequence of dwords
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
stack is usually 4 byte aligned (GCC >= 3.x seems to use a 16byte alignment)
the direction ﬂag is clear on entry and must be returned clear

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers > 32 and <= 64 bits are returned via the eax and edx registers
for aggregates and integer return values > 64 bits, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning via stack), and callee writes return value to this space; the ptr to the aggregate is returned in eax
ﬂoating point types are returned via the st0 register

Stack layout

Stack directly after function prolog:

Figure 6: Stack layout on x86 stdcall calling convention

MS thiscall

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	preserve
ecx	scratch, parameter 0
edx	scratch, return value
esi	preserve
edi	preserve
ebp	preserve
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 17: Register usage on x86 thiscall (MS) calling convention

Parameter passing

stack parameter order: right-to-left
called function cleans up the stack (except for variadic functions where the caller cleans up)
ﬁrst parameter (this pointer) is passed via ecx
all other parameters are pushed onto the stack
arguments > 64 bits are pushed as a sequence of dwords
aggregates (structs, unions) are pushed as a sequence of dwords
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers > 32 bits and <= 64 bits are returned via the eax and edx
aggregates (structs, unions) of any size are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit ﬁrst parameter, that same pointer is returned in eax
ﬂoating point types are returned via the st0 register

Stack layout

Stack directly after function prolog:

| |
| ... |
register save area------------------------| )
|------------------------| ||||
local data(|------------------------|) ||}
{ | arg n- 1 |} caller’s frame
parameter area ( | ... |) stack parameters ||||
|-arg-1-------------------| ||)
|-return address-----------| )
register save area------------------------| ||
local data|------------------------| }
parameter area|------------------------| || current frame
| .. | )
.

Figure 7: Stack layout on x86 thiscall (MS) calling convention

GNU thiscall

This is equivalent to the cdecl calling convention, with the ﬁrst parameter being the this pointer.

pascal

The best known uses of the pascal calling convention are the 16 bit OS/2 APIs, Microsoft Windows 3.x and Borland Delphi 1.x. It is a variation of stdcall, however, arguments are passed from left-to-right. Since this calling convention is for 16-bit APIs, it is not discussed in further detail, here.

plan9call

Registers and register usage

Name	Brief description

eax	scratch, return value
ebx	scratch
ecx	scratch
edx	scratch
esi	scratch
edi	scratch
ebp	scratch
esp	stack pointer
st0	scratch, ﬂoating point return value
st1-st7	scratch

Table 18: Register usage on x86 plan9call calling convention

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
all parameters are pushed onto the stack
all parameters are pushed onto the stack (as dwords)
arguments > 64 bits are pushed as a sequence of dwords
aggregates (structs, unions) are pushed as a sequence of dwords
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

return values of pointer or integral type (<= 32 bits) are returned via the eax register
integers > 32 bits and aggregates (structs, unions) of any size are returned by the caller allocating the space and passing a pointer to the callee as a new, implicit ﬁrst parameter, that same pointer is returned in eax
ﬂoating point types are returned via the st0 register (called F0 in plan9 8a’s terms)

Stack layout

Note there is no register save area at all. Stack directly after function prolog:

| |
| ... |
local data|------------------------| )
( |------------------------|) ||||
{ | arg n- 1 |} } caller’s frame
parameter area ( | ... |) stack parameters ||
|-arg-0-------------------| ||)
|-return address-----------| )
local data|------------------------| }
parameter area|------------------------| ) current frame
| ... |

Figure 8: Stack layout on x86 plan9call calling convention

Linux syscalls

Parameter passing

syscall is issued by triggering interrupt 80h
syscall number is set in eax
params are passed in the following registers in this order: ebx, ecx, edx, esi, edi, ebp
for more than six arguments, ebx points to the list of further arguments (not used in practice, as Linux syscalls use a maximum of 5 arguments)
register eax holds the return value

*BSD syscalls

Parameter passing

syscall is issued by triggering interrupt 80h
syscall number is set in eax
params are passed on the stack as with the cdecl calling convention

x64 Calling Conventions

Overview

The x64 (64bit) architecture designed by AMD is based on Intel’s x86 (32bit) architecture, supporting it natively. It is sometimes referred to as x86-64, AMD64, or, cloned by Intel, EM64T or Intel64.
On this processor, a word is deﬁned to be 16 bits in size, a dword 32 bits and a qword 64 bits. Note that this is due to historical reasons (terminology didn’t change with the introduction of 32 and 64 bit processors).
The x64 calling convention for MS Windows [25] diﬀers from the SystemV x64 calling convention [26] used by Linux/*BSD/... Note that this is not the only diﬀerence between these operating systems. The 64 bit programming model in use by 64 bit windows is LLP64, meaning that the C types int and long remain 32 bits in size, whereas long long becomes 64 bits. Under Linux/*BSD/... it’s LP64.

Compared to the x86 architecture, the 64 bit versions of the registers are called rax, rbx, etc.. Furthermore, there are eight new general purpose registers r8-r15.
dyncall support

Currently, the MS Windows and System V calling conventions are supported.
Dyncall can also be used to issue syscalls on System V platforms by using the syscall number as target parameter and selecting the correct mode.

MS Windows

Registers and register usage

Name	Brief description

rax	scratch, return value
rbx	permanent
rcx	scratch, parameter 0 if integer or pointer
rdx	scratch, parameter 1 if integer or pointer
rdi	permanent
rsi	permanent
rbp	permanent, may be used as frame pointer
rsp	stack pointer
r8-r9	scratch, parameter 2 and 3 if integer or pointer
r10-r11	scratch, permanent if required by caller (used for syscall/sysret)
r12-r15	permanent
xmm0	scratch, ﬂoating point parameter 0, ﬂoating point return value
xmm1-xmm3	scratch, ﬂoating point parameters 1-3
xmm4-xmm5	scratch, permanent if required by caller
xmm6-xmm15	permanent

Table 19: Register usage on x64 MS Windows platform

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst 4 integer/pointer parameters are passed via rcx, rdx, r8, r9 (from left to right), others are pushed on stack (there is a spill area for the ﬁrst 4)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
aggregates (structs and unions) < 64 bits are passed like equal-sized integers
ﬂoat and double parameters are passed via xmm0l-xmm3l
ﬁrst 4 parameters are passed via the correct register depending on the parameter type - with mixed ﬂoat and int parameters, some registers are left out (e.g. ﬁrst parameter ends up in rcx or xmm0, second in rdx or xmm1, etc.)
parameters in registers are right justiﬁed
parameters < 64bits are not zero extended - zero the upper bits contiaining garbage if needed (but they are always passed as a qword)
parameters > 64 bits are passed by via a pointer to a copy (for aggregate types, that caller-allocated memory must be 16-byte aligned)
if callee takes address of a parameter, ﬁrst 4 parameters must be dumped (to the reserved space on the stack) - for ﬂoating point parameters, value must be stored in integer AND ﬂoating point register
caller cleans up the stack, not the callee (like cdecl)
stack is always 16byte aligned - since return address is 64 bits in size, stacks with an odd number of parameters are already aligned
ellipsis calls take ﬂoating point values in int and ﬂoat registers (single precision ﬂoats are promoted to double precision as required by ellipsis calls)
if size of parameters > 1 page of memory (usually between 4k and 64k), chkstk must be called

Return values

return values of pointer, integral or aggregate (structs and unions) type (<= 64 bits) are returned via the rax register
ﬂoating point types are returned via the xmm0 register
for any other type > 64 bits (or for non-trivial C++ aggregates of any size), a hidden ﬁrst parameter, with an address to the return value is passed (for C++ thiscalls it is passed as second parameter, after the this pointer)

Stack layout

Stack frame is always 16-byte aligned. Stack directly after function prolog:

| |
| ... |
register save area------------------------| )
|------------------------| ||||
local data(|------------------------|) ||||
|||| | arg n- 1 |} ||||
|||| | ... |) stack parameters ||}
{ | arg 4 |) caller’s frame
parameter area || | r9 or xmm3 |||} ||||
|||| | r8 or xmm2 | spill area ||||
||( | rdx or xmm1 |||) ||||
|-rcx-or xmm0-------------| ||)
|-return address-----------| )
register save area------------------------| ||}
local data|------------------------| current frame
parameter area|------------------------| ||)
| ... |

Figure 9: Stack layout on x64 Microsoft platform

System V (Linux / *BSD / MacOS X)

Registers and register usage

Name	Brief description

rax	scratch, return value, special use for varargs (in al, see below)
rbx	permanent
rcx	scratch, parameter 3 if integer or pointer
rdx	scratch, parameter 2 if integer or pointer, return value
rdi	scratch, parameter 0 if integer or pointer
rsi	scratch, parameter 1 if integer or pointer
rbp	permanent, may be used as frame pointer
rsp	stack pointer
r8-r9	scratch, parameter 4 and 5 if integer or pointer
r10-r11	scratch
r12-r15	permanent
xmm0-xmm1	scratch, ﬂoating point parameters 0-1, ﬂoating point return value
xmm2-xmm7	scratch, ﬂoating point parameters 2-7
xmm8-xmm15	scratch
st0-st1	scratch, 16 byte ﬂoating point return value
st2-st7	scratch

Table 20: Register usage on x64 System V (Linux/*BSD)

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst 6 integer/pointer parameters are passed via rdi, rsi, rdx, rcx, r8, r9
ﬁrst 8 ﬂoating point parameters <= 64 bits are passed via xmm0l-xmm7l
parameters in registers are right justiﬁed
parameters that are not passed via registers are pushed onto the stack (with their sizes rounded up to qwords)
parameters < 64bits are not zero extended - zero the upper bits contiaining garbage if needed (but they are always passed as a qword)
integer/pointer parameters > 64 bit are passed via 2 registers
if callee takes address of a parameter, number of used xmm registers is passed silently in al (passed number doesn’t need to be exact but an upper bound on the number of used xmm registers)
aggregates (structs, unions (and arrays within those)) follow a more complicated logic (the following only considers ﬁeld types supported by dyncall):
- non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
- aggregates > 16 bytes are always passed entirely via the stack
- all other aggregates are classiﬁed per qword, by looking at all ﬁelds occupying all or part of that qword, recursively
  - if any ﬁeld would be passed via the stack, the entire qword will
  - otherwise, if any ﬁeld would be passed like an integer/pointer value, the entire qword will
  - otherwise the qword is passed like a ﬂoating point value
- after qword classiﬁcation, the logic is:
  - if any qword is classiﬁed to be passed via the stack, the entire aggregate will
  - if the size of the aggregate is > 2 qwords, it is passed via the stack (except for single ﬂoating point values > 128bits)
  - all others are passed qword by qword according to their classiﬁcation, like individual arguments
  - however, an aggregate is never split between registers and the stack, if it doesn’t ﬁt into available registers it is entirely passed via the stack (freeing such registers for subsequent arguments)
stack is always 16byte aligned - since return address is 64 bits in size, stacks with an odd number of parameters are already aligned
no spill area is used on stack, iterating over varargs requires a speciﬁc va_list implementation

Return values

return values of pointer or integral type are returned via the rax register (and rdx if needed)
ﬂoating point types are returned via the xmm0 register (and xmm1 if needed)
aggregates are ﬁrst classiﬁed in the same way as when passing them by value, then:
- for aggregates that would be passed via the stack (or for non-trivial C++ aggregates of any size), a hidden pointer to a non-shared, caller provided space is passed as hidden, ﬁrst argument; this pointer will be returned via rax
- otherwise, qword by qword is passed, using rax and rdx for integer/pointer qwords, and xmm0 and xmm1 for ﬂoating point ones
ﬂoating point values > 64 bits are returned via st0 and st1

Stack layout

Stack frame is always 16-byte aligned. A 128 byte large zone beyond the location pointed to by the stack pointer is referred to as ”red zone”, considered to be reserved and not be modiﬁed by signal or interrupt handlers (useful for temporary data not needed to be preserved across calls, and for optimizations for leaf functions). Stack directly after function prolog:

| |
| ... |
register save area|-----------------------| )
|-----------------------| ||||
local data (with padding()----------------------| ) ||}
{ | arg n- 1 | } caller’s frame
parameter area( | ... | )stack parameters ||||
|-arg-6------------------| ||)
|-return-address-----------| )
register save area|-----------------------| ||}
local data|-----------------------| current frame
parameter area|-----------------------| ||)
| ... |

Figure 10: Stack layout on x64 System V (Linux/*BSD)

System V syscalls

Parameter passing

syscall is issued via the syscall instruction
kernel destroys registers rcx and r11
syscall number is set in rax
params are passed in the following registers in this order: rdi, rsi, rdx, rcx, r8, r9
no stack in use, meaning syscalls are in theory limited to six arguments
register rax holds the return value (values in between -4095 and -1 indicate errors)

PowerPC (32bit) Calling Conventions

Overview

Word size is 32 bits
Big endian (MSB) and litte endian (LSB) operating modes.
Processor operates on ﬂoats in double precision ﬂoating point arithmetc (IEEE-754) values directly (single precision is converted on the ﬂy)
Apple macos/Mac OS X/Darwin PPC is speciﬁed in ”Mac OS X ABI Function Call Guide”[32]. It uses Big Endian (MSB)
Linux PPC 32-bit ABI is speciﬁed in ”LSB for PPC”[33] which is based on ”System V ABI”. It uses Big Endian (MSB)
PowerPC EABI is deﬁned in the ”PowerPC Embedded Application Binary Interface 32-Bit Implementation”[34]
There is also the ”PowerOpen ABI”[36], a nearly identical version of it is used in AIX

dyncall support

Dyncall and dyncallback are supported for PowerPC (32bit) Big Endian (MSB), for Darwin’s and System V’s calling convention.
Dyncall can also be used to issue syscalls by using the syscall number as target parameter and selecting the correct mode.

Mac OS X/Darwin

Registers and register usage

Name	Brief description

gpr0	scratch
gpr1	stack pointer
gpr2	scratch
gpr3,gpr4	return value, parameter 0 and 1 for integer or pointer, scratch
gpr5-gpr10	parameter 2-7 for integer or pointer parameters, scratch
gpr11	preserve
gpr12	branch target for dynamic code generation
gpr13-31	preserve
fpr0	scratch
fpr1	ﬂoating point return value, ﬂoating point parameter 0 (always double precision)
fpr2-fpr13	ﬂoating point parameters 1-12 (always double precision)
fpr14-fpr31	preserve
v0-v1	scratch
v2-v13	vector parameters
v14-v19	scratch
v20-v31	preserve
lr	link-register, scratch
ctr	count-register, scratch
cr0-cr7	conditional register ﬁelds, each 4-bit wide (cr0-cr1 and cr5-cr7 are scratch)

Table 21: Register usage on Darwin PowerPC 32-Bit

Parameter passing

stack grows down
stack parameter order: right-to-left
caller cleans up the stack
the ﬁrst 8 integer parameters are passed in registers gpr3-gpr10
the ﬁrst 13 ﬂoating point parameters are passed in registers fpr1-fpr13
64 bit arguments are passed as if they were two 32 bit arguments, without skipping registers for alignment (this means passing half via a register and half via the stack is allowed)
if a ﬂoat parameter is passed via a register, gpr registers are skipped for subsequent integer parameters (based on the size of the ﬂoat - 1 register for single precision and 2 for double precision ﬂoating point values)
the caller pushes subsequent parameters onto the stack
for every parameter passed via a register, space is reserved in the stack parameter area (in order to spill the parameters if needed - e.g. varargs)
ellipsis calls take ﬂoating point values in int and ﬂoat registers (single precision ﬂoats are promoted to double precision as required by ellipsis calls)
all nonvector parameters are aligned on 4-byte boundaries
vector parameters are aligned on 16-byte boundaries
composite parameters with size of 1 or 2 bytes occupy low-order bytes of their 4-byte area. INCONSISTENT with other 32-bit PPC binary interfaces. In AIX and mac OS 9, padding bytes always follow the data structure
composite parameters 3 bytes or larger in size occupy high-order bytes
integer parameters < 32 bit are right-justiﬁed (meaning occupy higher-address bytes) in their 4-byte slot on the stack, requiring extra-care for big-endian targets
aggregates (struct, union) with only one (non-aggregate / non-array) ﬁeld are passed as if the ﬁeld itself would be passed
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
all other aggregates are passed as a sequence of words (like integer parameters)

Return values

return values of integer <= 32bit or pointer type use gpr3
64 bit integers use gpr3 and gpr4 (hiword in gpr3, loword in gpr4)
ﬂoating point values are returned via fpr1
for all aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in gpr3), and callee writes return value to this space; the ptr to the aggregate is returned in gpr3

Stack layout

Stack frame is always 16-byte aligned. Prolog opens frame with additional, ﬁxed space for a linkage area, to hold a number of values (not all of them are required to be saved, though). Stack directly after function prolog:

| |
| ... |
register save area-------------------------| )
|-------------------------| ||||
local data(|-------------------------|) ||||
|||| | last arg |} ||||
||{ | ... |) stack parameters ||||
parameter area | 9th word of arg data |) ||||
|||| | gpr10 |} ||}
||( | ... |) spill area (as needed) caller’s frame
( |-gpr3---------------------| ||||
|||| | reserved | ||||
||{ | reserved | ||||
linkage area | reserved | ||||
|||| | return address (callee saved) ||||
||( | condition reg (callee saved) | ||)
|-parent stack frame-pointer-| )
register save area-------------------------| ||}
local data|-------------------------| current frame
parameter area|-------------------------| ||)
linkage area| ... |

Figure 11: Stack layout on ppc32 Darwin

System V PPC 32-bit

Status Registers and register usage

Name	Brief description

r0	scratch
r1	stack pointer, preserve
r2	system-reserved
r3-r4	parameter passing and return value, scratch
r5-r10	parameter passing, scratch
r11-r12	scratch
r13	small data area pointer register
r14-r30	local variables, preserve
r31	used for local variables or environment pointer, preserve
f0	scratch
f1	parameter passing and return value, scratch
f2-f8	parameter passing, scratch
f9-13	scratch
f14-f31	local variables, preserve
cr0-cr7	conditional register ﬁelds, each 4-bit wide (cr0-cr1 and cr5-cr7 are scratch)
lr	link register, scratch
ctr	count register, scratch
xer	ﬁxed-point exception register, scratch
fpscr	ﬂoating-point Status and Control Register

Table 22: Register usage on System V ABI PowerPC Processor

Parameter passing

Stack pointer (r1) is always 16-byte aligned. The EABI diﬀers here - it is 8-byte alignment
8 general-purpose registers (r3-r10) for integer and pointer types
8 ﬂoating-pointer registers (f1-f8) for ﬂoat (promoted to double) and double types
Additional arguments are passed on the stack directly after the back-chain and saved return address (8 bytes structure) on the callers stack frame
64-bit integer data types are passed in general-purpose registers as a whole in two 32-bit general purpose registers (an odd and an even e.g. r3 and r4), skipping an even integer register or passed on the stack; they are never splitted into a register and stack part
Ellipsis calls set CR bit 6
integer parameters < 32 bit are right-justiﬁed (meaning occupy high-order bytes) in their 4-byte area, requiring extra-care for big-endian targets
no spill area is used on stack, iterating over varargs requires a speciﬁc va_list implementation
aggregates (struct, union) and types > 64 bits are passed indirectly, as a pointer to the data (or a copy of it, if necessary to avoid modiﬁcation)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

32-bit integers use register r3, 64-bit use registers r3 and r4 (hiword in r3, loword in r4)
ﬂoating-point values are returned using register f1
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in gpr3), and callee writes return value to this space; the ptr to the aggregate is returned in gpr3
aggregates (struct, union) <= 64 bits use gpr3 and gpr4
for all other aggregates and types > 64 bits, a secret ﬁrst parameter with an address to a caller allocated space is passed to the function (in gpr3), which is written to by the callee

Stack layout

Stack frame is always 16-byte aligned. Stack directly after function prolog:

| |
| ... |
register save area-------------------------| )
|-------------------------| ||||
local data(|-------------------------|) ||||
{ | last arg |} } caller’s frame
parameter area ( | ... |) stack parameters ||
|-ﬁrst arg-passed via stack--| ||||
|-return address-(callee saved) ||)
|-parent stack frame-pointer-| )
register save area-------------------------| ||}
local data|-------------------------| current frame
parameter area|-------------------------| ||)
| ... |

Figure 12: Stack layout on System V ABI for PowerPC 32-bit calling convention

System V PPC 32-bit / Linux Standard Base version

This is in essence the same as the System V PPC 32-bit calling convention, but diﬀers for aggregate return values:

for all aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in gpr3), and callee writes return value to this space; the ptr to the aggregate is returned in gpr3

System V syscalls

Parameter passing

syscall is issued via the sc instruction
kernel destroys registers r13
syscall number is set in r0
params are passed in registers r3 through r10
no stack in use, meaning syscalls are in theory limited to eight arguments
register r3 holds the return value, overﬂow ﬂag in conditional register cr0 signals errors in syscall

PowerPC (64bit) Calling Conventions

Overview

Word size is 32 bits for historical reasons
Doublework size is 64 bits.
Big endian (MSB) and litte endian (LSB) operating modes.
Apple Mac OS X/Darwin PPC is speciﬁed in ”Mac OS X ABI Function Call Guide”[32]. It uses Big Endian (MSB).
Linux PPC 64-bit ABI is speciﬁed in ”64-bit PowerPC ELF Application Binary Interface Supplement”[37] which is based on ”System V ABI”.

dyncall support

Dyncall and dyncallback are supported for PowerPC (64bit) Big Endian and Little Endian ELF ABIs on System V systems. Mac OS X is not supported.
Dyncall can also be used to issue syscalls by using the syscall number as target parameter and selecting the correct mode.

PPC64 ELF ABI

Registers and register usage

Name	Brief description

gpr0	scratch
gpr1	stack pointer
gpr2	TOC base ptr (oﬀset table and data for position independent code), scratch
gpr3	return value, parameter 0 for integer or pointer, scratch
gpr4-gpr10	parameter 1-7 for integer or pointer parameters, scratch
gpr11	env pointer if needed, scratch
gpr12	used for exception handling and glink code, scratch
gpr13	used for system thread ID, preserve
gpr14-31	preserve
fpr0	scratch
fpr1-fpr4	ﬂoating point return value, ﬂoating point parameter 0-3 (always double precision)
fpr5-fpr13	ﬂoating point parameters 4-12 (always double precision)
fpr14-fpr31	preserve
v0-v1	scratch
v2-v13	vector parameters
v14-v19	scratch
v20-v31	preserve
lr	link-register, scratch
ctr	count-register, scratch
xer	ﬁxed point exception register, scratch
fpscr	ﬂoating point status and control register, scratch
cr0-cr7	conditional register ﬁelds, each 4-bit wide (cr0-cr1 and cr5-cr7 are scratch)

Table 23: Register usage on PowerPC 64-Bit ELF ABI

Parameter passing

stack grows down
stack parameter order: right-to-left
caller cleans up the stack
stack is always 16 byte aligned
the stack pointer must be atomically updated (to avoid any timing window in which an interrupt can occur with a partially updated stack), usually with the stdu (store doubleword with update) instruction
the ﬁrst 8 integer parameters are passed in registers gpr3-gpr10
the ﬁrst 13 ﬂoating point parameters are passed in registers fpr1-fpr13
preserved registers are saved using a deﬁned order (from high to low addresses): fpr* (64bit aligned), gpr*, VRSAVE save word (32 bits), padding for alignment (4 or 12 bytes), v* (128bit aligned)
if a ﬂoating point parameter is passed via a register, a gpr registers is skipped for subsequent integer parameters
the caller pushes subsequent parameters onto the stack
single precision ﬂoating point values use the second word in a doubleword
a quad precision ﬂoating point argument is passed as two consecutive double precision ones
integer types < 64 bit are sign or zero extended and use a doubleword
ellipsis calls take ﬂoating point values in int and ﬂoat registers (single precision ﬂoats are promoted to double precision as required by ellipsis calls)
space for all potential gpr* register passed arguments is reserved in the stack parameter area (in order to spill the parameters if needed - e.g. varargs), meaning a minimum of 64 bytes to hold gpr3-gpr10
all nonvector parameters are aligned on 8-byte boundaries
vector parameters are aligned on 16-byte boundaries
integer parameters < 64 bit are right-justiﬁed (meaning occupy higher-address bytes) in their 8-byte slot on the stack, requiring extra-care for big-endian targets
aggregates (struct, union) are passed as a sequence of doublewords (following above rules for doublewords)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

return values of integer <= 32bit or pointer type use gpr3 and are zero or sign extended depending on their type
64 bit integers use gpr3
ﬂoating point values are returned via fpr1
for any aggregate (struct, union), the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in gpr3), and callee writes return value to this space; the ptr to the aggregate is returned in gpr3

Stack layout

Stack frame is always 16-byte aligned. Stack directly after function prolog:

| |
| ... |
register save area-------------------------| )
|-------------------------| ||||
local data(|-------------------------|) ||||
|||| | last arg |} ||||
||{ | ... |) stack parameters ||||
parameter area | arg 8 |) ||||
|||| | gpr10 |} ||}
||( | ... |) spill area (as needed) caller’s frame
( |-gpr3---------------------| ||||
|||| | TOC ptr reg | ||||
||{ | reserved | ||||
linkage area | reserved | ||||
|||| | return address (callee saved) ||||
||( | condition reg (callee saved) | ||)
|-parent stack frame-pointer-| )
register save area-------------------------| ||}
local data|-------------------------| current frame
parameter area|-------------------------| ||)
linkage area| ... |

Figure 13: Stack layout on ppc64 ELF ABI

System V syscalls

Parameter passing

syscall is issued via the sc instruction
kernel destroys registers r13
syscall number is set in r0
params are passed in registers r3 through r10
no stack in use, meaning syscalls are in theory limited to eight arguments
register r3 holds the return value, overﬂow ﬂag in conditional register cr0 signals errors in syscall

ARM32 Calling Conventions

Overview

The ARM32 family of processors is based on the Advanced RISC Machines (ARM) processor architecture (32 bit RISC). The word size is 32 bits (and the programming model is LLP64).
Basically, this family of microprocessors can be run in 2 major modes:

Mode	Description

ARM	32bit instruction set
THUMB	compressed instruction set using 16bit wide instruction encoding

For more details, take a look at the ARM-THUMB Procedure Call Standard (ATPCS) [18], the Procedure Call Standard for the ARM Architecture (AAPCS) [19], as well as Debian’s ARM EABI port [23] and hard-ﬂoat [24] wiki pages.

dyncall support

Currently, the dyncall library supports the ARM and THUMB mode of the ARM32 family (ATPCS [18], EABI [23], the ARM hard-ﬂoat (armhf) [23] varian, as well as Apple’s calling convention based on the ATPCS), excluding manually triggered ARM-THUMB interworking calls.
Also supported is armhf, a calling convention with register support to pass ﬂoating point numbers. FPA and the VFP (scalar mode) procedure call standards, as well as some instruction sets accelerating DSP and multimedia application like the ARM Jazelle Technology (direct Java bytecode execution, providing acceleration for some bytecodes while calling software code for others), etc., are not supported by the dyncall library.

ATPCS ARM mode

Registers and register usage

In ARM mode, the ARM32 processor has sixteen 32 bit general purpose registers, namely r0-r15:

Name	Alias	Brief description

r0	a1	parameter 0, scratch, return value
r1	a2	parameter 1, scratch, return value
r2,r3	a3,a4	parameters 2 and 3, scratch
r4-r9	v1-v6	permanent
r10	sl	permanent
r11	fp	frame pointer, permanent
r12	ip	scratch
r13	sp	stack pointer, permanent
r14	lr	link register, permanent
r15	pc	program counter (note: due to pipeline, r15 points to 2 instructions ahead)

Table 24: Register usage on arm32

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst four words are passed using r0-r3
subsequent parameters are pushed onto the stack (in right to left order, such that the stack pointer points to the ﬁrst of the remaining parameters)
if the callee takes the address of one of the parameters and uses it to address other parameters (e.g. varargs) it has to copy - in its prolog - the ﬁrst four words to a reserved stack area adjacent to the other parameters on the stack
parameters <= 32 bits are passed as 32 bit words
64 bit parameters are passed as two 32 bit parts (even partly via the register and partly via the stack, although this doesn’t seem to be speciﬁed in the ATPCS)
aggregates (struct, union) are passed by value (after rounding up the size to the nearest multiple of 4), as a sequence of words (splitting across registers and stack is allowed)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
keeping the stack eight-byte aligned can improve memory access performance and is required by LDRD and STRD on ARMv5TE processors which are part of the ARM32 family, so, in order to avoid problems one should always align the stack (tests have shown, that GCC does care about the alignment when using the ellipsis)

Return values

return values <= 32 bits use r0
64 bit return values use r0 and r1
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0
aggregates (struct, union) <= 32 bits are returned like an integer (in r0)
aggregates (struct, union) > 32 bits the caller allocates space for the return value on the stack in its frame and passes a pointer to it in r0
for all other aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|-----------------------| )
|-----------------------| ||||
local da(ta|-----------------------| ) } caller’s frame
|||| |last arg | } ||
|||| |... | )stack parameters ||)
{ |5th-word of arg data---| ) )
parameter area|| |r3 | ||} |||
|||| |r2 | spill area (if needed) ||||
||( |r1 | ||) |}
|r0---------------------| | current frame
register save area (with return address)----------------------| ||||
local data|-----------------------| |||)
parameter area|... |

Figure 14: Stack layout on arm32

ATPCS THUMB mode

Status Registers and register usage

In THUMB mode, the ARM32 processor family supports eight 32 bit general purpose registers r0-r7 and access to high order registers r8-r15:

Name	Alias	Brief description

r0	a1	parameter 0, scratch, return value
r1	a2	parameter 1, scratch, return value
r2,r3	a3,a4	parameters 2 and 3, scratch
r4-r6	v1-v3	permanent
r7	v4	frame pointer, permanent
r8-r11	v5-v8	permanent
r12	ip	scratch
r13	sp	stack pointer, permanent
r14	lr	link register, permanent
r15	pc	program counter (note: due to pipeline, r15 points to 2 instructions ahead)

Table 25: Register usage on arm32 thumb mode

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst four words are passed using r0-r3
subsequent parameters are pushed onto the stack (in right to left order, such that the stack pointer points to the ﬁrst of the remaining parameters)
if the callee takes the address of one of the parameters and uses it to address other parameters (e.g. varargs) it has to copy - in its prolog - the ﬁrst four words to a reserved stack area adjacent to the other parameters on the stack
parameters <= 32 bits are passed as 32 bit words
64 bit parameters are passed as two 32 bit parts (even partly via the register and partly via the stack, although this doesn’t seem to be speciﬁed in the ATPCS)
aggregates (struct, union) are passed by value (after rounding up the size to the nearest multiple of 4), as a sequence of words (splitting across registers and stack is allowed)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
keeping the stack eight-byte aligned can improve memory access performance and is required by LDRD and STRD on ARMv5TE processors which are part of the ARM32 family, so, in order to avoid problems one should always align the stack (tests have shown, that GCC does care about the alignment when using the ellipsis)

Return values

return values <= 32 bits use r0
64 bit return values use r0 and r1
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0
aggregates (struct, union) <= 32 bits are returned like an integer (in r0)
aggregates (struct, union) > 32 bits the caller allocates space for the return value on the stack in its frame and passes a pointer to it in r0
for all other aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0

Stack layout

Stack directly after function prolog:

Figure 15: Stack layout on arm32 thumb mode

EABI (ARM and THUMB mode)

The ARM EABI is very similar to the ABI outlined in ARM-THUMB procedure call standard (ATPCS) [18] - however, the EABI requires the stack to be 8-byte aligned at function entries, as well as for 64 bit parameters. The latter are aligned on 8-byte boundaries on the stack and 2-registers for a parameter passed via register. In order to achieve such an alignment, a register might have to be skipped for parameters passed via registers, or 4-bytes on the stack for parameters passed via the stack. Refer to the Debian ARM EABI port wiki for more information [23].

ARM on Apple’s iOS (Darwin) Platform (ARM and THUMB mode)

The iOS runs on ARMv6 (iOS 2.0) and ARMv7 (iOS 3.0) architectures. Both, ARM and THUMB are available, code is usually compiled in THUMB mode.

Register usage

Name	Alias	Brief description

r0		parameter 0, scratch, return value
r1		parameter 1, scratch, return value
r2,r3		parameters 2 and 3, scratch
r4-r6		permanent
r7		frame pointer, permanent
r8		permanent
r9		permanent (iOS 2.0) / scratch (since iOS 3.0)
r10-r11		permanent
r12		scratch, intra-procedure scratch register (IP) used by dynamic linker
r13	sp	stack pointer, permanent
r14	lr	link register, permanent
r15	pc	program counter (note: due to pipeline, r15 points to 2 instructions ahead)
cpsr		program status register
d0-d7		scratch, aliases s0-s15, on ARMv7 also as q0-q3; not accessible from Thumb mode on ARMv6
d8-d15		permanent, aliases s16-s31, on ARMv7 also as q4-q7; not accesible from Thumb mode on ARMv6
d16-d31		only available in ARMv7, aliases q8-q15
fpscr		VFP status register

Table 26: Register usage on ARM Apple iOS

Parameter passing and Return values

The ABI is based on the AAPCS but with the following important diﬀerences:

in ARM mode, r7 is used as frame pointer instead of r11 (so both, ARM and THUMB mode use the same convention)
r9 does not need to be preserved on iOS 3.0 and greater

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|---------------------------| )
|---------------------------| ||||
local da(ta|---------------------------|) } caller’s frame
|||| |last arg |} ||
|||| |... |) stack parameters ||)
{ |5th-word of arg data-@@@verify|) )
parameter area|| |r3 |||} |||
|||| |r2 | spill area (if needed) ||||
||( |r1 |||) |}
|r0-------------------------| | current frame
register save area (with return address)--------------------------| ||||
local data|---------------------------| |||)
parameter area|... |

Figure 16: Stack layout on arm32 (Apple)

ARM hard ﬂoat (armhf)

Most debian-based Linux systems on ARMv7 (or ARMv6 with FPU) platforms use a calling convention referred to as armhf, using 16 32-bit ﬂoating point registers of the FPU of the VFPv3-D16 extension to the ARM architecture. Refer to the debian wiki for more information [24].

Code is little-endian, rest is similar to EABI with an 8-byte aligned stack, etc..

Register usage

Name	Alias	Brief description

r0	a1	parameter 0, scratch, non ﬂoating point return value
r1	a2	parameter 1, scratch, non ﬂoating point return value
r2,r3	a3,a4	parameters 2 and 3, scratch
r4-r9	v1-v6	permanent
r10	sl	permanent
r11	fp	frame pointer, permanent
r12	ip	scratch, intra-procedure scratch register (IP) used by dynamic linker
r13	sp	stack pointer, permanent
r14	lr	link register, permanent
r15	pc	program counter (note: due to pipeline, r15 points to 2 instructions ahead)
cpsr		program status register
s0		ﬂoating point argument, ﬂoating point return value, single precision
d0		ﬂoating point argument, ﬂoating point return value, double precision, aliases s0-s1
s1-s15		ﬂoating point arguments, single precision
d1-d7		aliases s2-s15, ﬂoating point arguments, double precision
fpscr		VFP status register

Table 27: Register usage on armhf

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst four non-ﬂoating-point words are passed using r0-r3
out of those, 64bit parameters use 2 registers, either r0,r1 or r2,r3 (skipped registers are left unused)
ﬁrst 16 single-precision, or 8 double-precision arguments are passed via s0-s15 or d0-d7, respectively (note that since s and d registers are aliased, already used ones are skipped)
subsequent parameters are pushed onto the stack (in right to left order, such that the stack pointer points to the ﬁrst of the remaining parameters)
note that as soon one ﬂoating point parameter is passed via the stack, subsequent single precision ﬂoating point parameters are also pushed onto the stack even if there are still free S* registers
ﬂoat and double vararg function parameters (no matter if in ellipsis part of function, or not) are passed like int or long long parameters, vfp registers aren’t used
if the callee takes the address of one of the parameters and uses it to address other parameters (e.g. varargs) it has to copy - in its prolog - the ﬁrst four words (for ﬁrst 4 integer arguments) to a reserved stack area adjacent to the other parameters on the stack
parameters <= 32 bits are passed as 32 bit words
aggregates (struct, union) with 1 to 4 identical ﬂoating-point members (either ﬂoat or double) are passed ﬁeld-by-ﬁeld, except if passed as a vararg
aggregates that could be passed via ﬂoating point register are never split across those and the stack, so if not enough registers are available an aggregate is passed entirely via the stack (implying above rule that any still unused ﬂoat registers will be skipped for any subsequent arg)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
all other aggregates (struct, union), after rounding up the size to the nearest multiple of 4, are passed as a sequence of dwords, like integers (splitting across registers and stack is allowed)
callee spills, caller reserves spill area space, though

Return values

non ﬂoating point return values <= 32 bits use r0
non ﬂoating point 64-bit return values use r0 and r1
ﬂoating point return value uses s0 (for ﬂoat) or d0 (for double), respectively
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0
aggregates (struct, union) with 1 to 4 identical ﬂoating-point members are returned in s0-s3 (for ﬂoat) or d0-d3 (for double), respectively
all other aggregates <= 32 bits are returned via r0
for all other aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in r0), and callee writes return value to this space; the ptr to the aggregate is returned in r0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|-----------------------| )
|-----------------------| ||||
local da(ta|-----------------------| ) } caller’s frame
|||| |last arg | } ||
|||| |... | )stack parameters ||)
{ |ﬁrst arg passed via-stack| ) )
parameter area|| |r3 | ||} |||
|||| |r2 | spill area (if needed) ||||
||( |r1 | ||) |}
|r0---------------------| | current frame
register save area (with return address)----------------------| ||||
local data|-----------------------| |||)
parameter area|... |

Figure 17: Stack layout on arm32 armhf

Architectures

The ARM architecture family contains several revisions with capabilities and extensions (such as thumb-interworking, more vector registers, ...) The following table sums up the most important properties of the various architecture standards, from a calling convention perspective.

Arch	Platforms	Details

ARMv4
ARMv4T	ARM 7, ARM 9, Neo FreeRunner (OpenMoko)
ARMv5	ARM 9E	BLX instruction available
ARMv6		No vector registers available in thumb
ARMv7	iPod touch, iPhone 3GS/4, Raspberry Pi 2	VFP, armhf convention on some platforms
ARMv8	iPhone 6 and higher	64bit support

Table 28: Overview of ARM Architecture, Platforms and Details

ARM64 Calling Conventions

Overview

ARMv8 introduced the AArch64 calling convention. ARM64 chips can be run in 64 or 32bit mode, but not by the same process. Interworking is only intra-process.
The word size is deﬁned to be 32 bits, a dword 64 bits. Note that this is due to historical reasons (terminology didn’t change from ARM32).
For more details, take a look at the Procedure Call Standard for the ARM 64-bit Architecture [20].
dyncall support

The dyncall library supports the ARM 64-bit AArch64 PCS ABI, as well as Apple’s and Microsoft’s conventions which are derived from it, for both, calls and callbacks.

AAPCS64 Calling Convention

Registers and register usage

ARM64 features thirty-one 64 bit general purpose registers, namely r0-r30, which are referred to as either x0-x30 for 64bit access, or w0-w30 for 32bit access (with upper bits either cleared or sign extended on load).
Also, there is sp/xzr/wzr, a register with restricted use, used for the stack pointer in instructions dealing with the stack (sp) or a hardware zero register for all other instructions xzr/wzr, and pc, the program counter. Additionally, there are thirty-two 128 bit registers v0-v31, to be used as SIMD and ﬂoating point registers, referred to as q0-q31, d0-d31 and s0-s31, respectively (in contrast to AArch32, those do not overlap multiple narrower registers), depending on their use:

Name	Brief description

x0-x7	parameters, scratch, return value
x8	indirect result location pointer
x9-x15	scratch
x16	permanent in some cases, can have special function (IP0), see doc
x17	permanent in some cases, can have special function (IP1), see doc
x18	reserved as platform register, advised not to be used for handwritten, portable asm, see doc
x19-x28	permanent
x29	permanent, frame pointer
x30	permanent, link register
sp	permanent, stack pointer
pc	program counter
v0-v7	scratch, ﬂoat parameters, return value
v8-v15	lower 64 bits are permanent, scratch
v16-v31	scratch
xzr	zero register, always zero

Table 29: Register usage on arm64

Parameter passing

stack parameter order: right-to-left
caller cleans up the stack
ﬁrst 8 integer arguments are passed using x0-x7
ﬁrst 8 ﬂoating point arguments are passed using d0-d7
subsequent parameters are pushed onto the stack
if the callee takes the address of one of the parameters and uses it to address other parameters (e.g. varargs) it has to copy - in its prolog - the ﬁrst 8 integer and 8 ﬂoating-point registers to a reserved stack area adjacent to the other parameters on the stack (only the unnamed integer parameters require saving, though)
aggregates (struct, union) with 1 to 4 identical ﬂoating-point members (either ﬂoat or double) are passed ﬁeld-by-ﬁeld (8-byte aligned if passed via stack), except if passed as a vararg
other aggregates (struct, union) > 16 bytes in size are passed indirectly, as a pointer to a copy (if needed)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
all other aggregates (struct, union), after rounding up the size to the nearest multiple of 8, are passed as a sequence of dwords, like integers
aggregates are never split across registers and stack, so if not enough registers are available an aggregated is passed via the stack (for aggregates that would’ve been passed as ﬂoating point values, any still unused ﬂoat registers will be skipped for any subsequent arg)
stack is required throughout to be eight-byte aligned

Return values

integer return values use x0
ﬂoating-point return values use d0
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee via x8, and callee writes return value to this space; the ptr to the aggregate is returned in x0
aggregates (struct, union) that would be passed via registers if passed as a ﬁrst param, are returned via those registers
for aggregates not returnable via registers (e.g. if regs exhausted, or > 16b, ...), the caller allocates space, passes pointer to it to the callee through x8, and callee writes return value to this space (note that this is not a hidden ﬁrst param, as x8 is not used for passing params); the ptr to the aggregate is returned in x0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|-----------------------| )
|-----------------------| ||||
local da(ta|-----------------------| ) } caller’s frame
|||| |arg n- 1 | } ||
|||| |... | )stack parameters ||)
|||| |arg-8------------------| ) )
{ |x7 | ||| |||
parameter area|| |... | |||} ||||
|||| |x? (ﬁrst unnamed reg) | spill area (if needed) ||||
|||| |q7 | ||| |}
||( |... | |||) | current frame
|q0---------------------| ||||
register save area (with return address)----------------------| ||||
local data|-----------------------| |||)
parameter area|... |

Figure 18: Stack layout on arm64

Apple’s ARM64 Function Calling Convention

Overview

Apple’s ARM64 calling convention is based on the AAPCS64 standard, however, diverges in some ways. Only the diﬀerences are listed here, for more details, take a look at Apple’s oﬃcial documentation [21].

arguments passed via stack use only the space they need, but are subject to type alignment requirements (which is 1 byte for char and bool, 2 for short, 4 for int and 8 for every other type)
caller is required to sign and zero-extend arguments smaller than 32bits
empty aggregates (allowed in C++, but non-standard in C, however compiler extensions exist) as parameters:
- allowed to be ignored in C
- allowed to be ignored in C++, if aggregate is trivial, otherwise it’s treated as an aggregate with one byte ﬁeld

Microsoft’s ARM64 Function Calling Convention

Overview

Microsoft’s ARM64 calling convention is based on the AAPCS64 standard, however, diverges for variadic functions. Only the diﬀerences are listed here, for more details, take a look at Microsoft’s oﬃcial documentation [22].

variadic function calls do not use any SIMD or ﬂoating point registers (for ﬁxed and variable args), meaning ﬁrst 8 params are passed via x0-x7, the rest via the stack
a function that returns an aggregate indirectly via a pointer passed to via x8 does not seem to be required to put that address in x0 on return (but should be safe to do so)

MIPS32 Calling Conventions

Overview

Multiple revisions of the MIPS Instruction set exist, namely MIPS I, MIPS II, MIPS III, MIPS IV, MIPS32 and MIPS64. Nowadays, MIPS32 and MIPS64 are the main ones used for 32-bit and 64-bit instruction sets, respectively.
Given MIPS processors are often used for embedded devices, several add-on extensions exist for the MIPS family, for example:

MIPS-3D: simple ﬂoating-point SIMD instructions dedicated to common 3D tasks.
MDMX: (MaDMaX) more extensive integer SIMD instruction set using 64 bit ﬂoating-point registers.
MIPS16e: adds compression to the instruction stream to make programs take up less room (allegedly a response to the THUMB instruction set of the ARM architecture).
MIPS MT: multithreading additions to the system similar to HyperThreading.

Unfortunately, there is actually no such thing as ”The MIPS Calling Convention”. Many possible conventions are used by many diﬀerent environments such as O32[38], O64[39], N32[40], N64[40], EABI[41] and NUBI[42].
dyncall support

Currently, dyncall supports for MIPS 32-bit architectures the widely-used O32 calling convention (for all four combinations of big/little-endian, and soft/hard-ﬂoat targets), as well as EABI (little-endian/hard-ﬂoat, which is used on the Homebrew SDK for the Playstation Portable). dyncall currently does not support MIPS16e (contrary to the like-minded ARM-THUMB, which is supported). Both, calls and callbacks are supported.

MIPS EABI 32-bit Calling Convention

Register usage

Name	Alias	Brief description

$0	$zero	hardware zero, scratch
$1	$at	assembler temporary, scratch
$2-$3	$v0-$v1	integer results, scratch
$4-$11	$a0-$a7	integer arguments, or double precision ﬂoat arguments, scratch
$12-$15,$24	$t4-$t7,$t8	integer temporaries, scratch
$25	$t9	integer temporary, address of callee for PIC calls (by convention), scratch
$16-$23	$s0-$s7	preserve
$26,$27	$kt0,$kt1	reserved for kernel
$28	$gp	global pointer, preserve
$29	$sp	stack pointer, preserve
$30	$s8/$fp	frame pointer (some assemblers name it $fp), preserve
$31	$ra	return address, preserve
hi, lo		multiply/divide special registers
$f0,$f2		ﬂoat results, scratch
$f1,$f3,$f4-$f11,$f20-$f23		ﬂoat temporaries, scratch
$f12-$f19		single precision ﬂoat arguments, scratch

Table 30: Register usage on MIPS32 EABI calling convention

Parameter passing

Stack grows down
Stack parameter order: right-to-left
Caller cleans up the stack
ﬁrst 8 integers (<= 32bit) are passed in registers $a0-$a7
ﬁrst 8 single precision ﬂoating point arguments are passed in registers $f12-$f19
64-bit stack arguments are always aligned to 8 bytes
64-bit integers or double precision ﬂoats are passed in two general purpose registers starting at an even register number, skipping one odd register
if either integer or ﬂoat registers are used up, the stack is used
if the callee takes the address of one of the parameters and uses it to address other unnamed parameters (e.g. varargs) it has to copy - in its prolog - the the argument registers to a reserved stack area adjacent to the other parameters on the stack (only the unnamed integer parameters require saving, though)
ﬂoat registers don’t seem to ever need to be saved that way, because ﬂoats passed to an ellipsis function are promoted to doubles, which in turn are passed in a? register pairs, so only $a0-$a7 are need to be spilled
aggregates (struct, union) <= 32bit are passed like an integer
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
all other aggregates (struct, union) are passed indirectly, as a pointer to a copy (if needed, and for vararg arguments required to be copied by the caller) of the struct

Return values

results are returned in $v0 (32-bit), $v0 and $v1 (64-bit), $f0 or $f0 and $f2 (2 × 32 bit ﬂoat e.g. complex)
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in %a0), and callee writes return value to this space; the ptr to the aggregate is returned in %v0
aggregates (struct, union) <= 64bit are returned like an integer (aligned within the register according to endianness)
all other aggregates (struct, union) are returned in a space allocated by the caller, with a pointer to it passed as ﬁrst parameter to the function called (meaning in %a0); the ptr to the aggregate is returned in %v0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|-----------------------| )
|-----------------------| ||||
local da(ta|-----------------------| ) } caller’s frame
|||| |last arg | } ||
||{ |... | )stack parameters ||)
parameter area |ﬁrst arg passed via-stack| ) )
|||| |$a7 | } |||
||( |... | )spill area (if needed) |||}
|$a?-(ﬁrst-unnamed-reg)---| current frame
register save area (with return address)----------------------| |||
local data|-----------------------| |||)
parameter area|... |

Figure 19: Stack layout on MIPS EABI 32-bit calling convention

MIPS O32 32-bit Calling Convention

Register usage

Name	Alias	Brief description

$0	$zero	hardware zero
$1	$at	assembler temporary
$2-$3	$v0-$v1	return value (only integer on hard-ﬂoat targets), scratch
$4-$7	$a0-$a3	ﬁrst arguments (only integer on hard-ﬂoat targets), scratch
$8-$15,$24	$t0-$t7,$t8	temporaries, scratch
$25	$t9	temporary, holds address of called function for PIC calls (by convention)
$16-$23	$s0-$s7	preserved
$26,$27	$k0,$k1	reserved for kernel
$28	$gp	global pointer, preserved by caller
$29	$sp	stack pointer, preserve
$30	$s8/$fp	frame pointer (some assemblers name it $fp), preserve
$31	$ra	return address, preserve
hi, lo		multiply/divide special registers
$f0-$f3		only on hard-ﬂoat targets: ﬂoat return value, scratch
$f4-$f11,$f16-$f19		only on hard-ﬂoat targets: ﬂoat temporaries, scratch
$f12-$f15		only on hard-ﬂoat targets: ﬁrst ﬂoating point arguments, scratch
$f20-$f31		only on hard-ﬂoat targets: preserved

Table 31: Register usage on MIPS O32 calling convention

Parameter passing

Stack grows down
Stack parameter order: right-to-left
Caller cleans up the stack
Caller is required to always leave a 16-byte spill area for $a0-$a3 at the end of its frame, to be used and spilled to by the callee, if needed
The diﬀerent stack areas (local data, register save area, parameter area) are each aligned to 8 bytes
generally, ﬁrst four 32bit arguments are passed in registers $a0-$a3, respectively (only on hard-ﬂoat targets: see below for exceptions if ﬁrst arg is a ﬂoat)
subsequent parameters are passed vie the stack
64-bit params passed via registers are passed using either two registers (starting at an even register number, skipping an odd one if necessary), or via the stack using an 8-byte alignment
only on hard-ﬂoat targets: if the very ﬁrst call argument is a ﬂoat, up to 2 ﬂoats or doubles can be passed via $f12 and $f14, respectively, for ﬁrst and second argument
only on hard-ﬂoat targets: if any arguments are passed via ﬂoat registers, skip $a0-$a3 for subsequent arguments as if the values were passed via them
only on hard-ﬂoat targets: note that if the ﬁrst argument is not a ﬂoat, but the second, it’ll get passed via the $a? registers
single precision ﬂoat parameters (32 bit) are right-justiﬁed in their 8-byte slot on the stack on big endian targets, as they aren’t promoted
aggregates (struct, union) are passed as a sequence of words like integers, no matter the ﬁelds or if hard-ﬂoat target (splitting across registers and stack is allowed)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

results are returned in $v0 and $v1, with $v0 for all values < 64bit (only integer on hard-ﬂoat targets)
only on hard-ﬂoat targets: ﬂoating point results are returned in $f0 (32-bit ﬂoat), or $f0 and $f3 (64bit ﬂoat)
aggregates (struct, union) are returned in a space allocated by the caller, with a pointer to it passed as ﬁrst parameter to the function called (meaning in %a0); the ptr to the aggregate is returned in %v0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area (with return address)----------------------| )
|-----------------------| ||||
local data (and padding()|-----------------------| ||||
|||| |padding (if needed) | ) ||||
|||| |last arg | } ||}
||{ |... | )stack parameters caller’s frame
parameter area |ﬁrst arg passed via stack| ) ||||
|||| |$a3 | ||} ||||
|||| |$a2 | spill area ||||
||( |$a1 | ||) ||)
|$a0--------------------| )
register save area|-----------------------| }
local data|-----------------------| ) current frame
parameter area|... |

Figure 20: Stack layout on MIPS O32 calling convention

MIPS64 Calling Conventions

Overview

There are two main ABIs in use for MIPS64 chips, N64[40] and N32[40]. Both are basically the same, except that N32 uses ILP32 as programming model (32-bit pointers and long integers), whereas N64 uses LP64 (64-bit pointers and long integers). All registers of a MIPS64 chip are considered to be 64-bit wide, even for the N32 calling convention.
The word size is deﬁned to be 32 bits, a dword 64 bits. Note that this is due to historical reasons (terminology didn’t change from MIPS32).
Other than that there are correspoding 64-bit versions other MIPS32 ABIs, e.g. the EABI[41] and O64[39]. dyncall support

For MIPS 64-bit machines, dyncall supports the N64 calling conventions for calls and callbacks (for all four combinations of big/little-endian, and soft/hard-ﬂoat targets). The N32 calling convention might work - it used to, but hasn’t been tested, recently.

MIPS N64 Calling Convention

Register usage

Name	Alias	Brief description

$0	$zero	hardware zero
$1	$at	assembler temporary, scratch
$2-$3	$v0-$v1	return value (only integers on hard-ﬂoat targets), scratch
$4-$11	$a0-$a7	ﬁrst arguments (only integers on hard-ﬂoat targets), scratch
$12-$15,$24	$t4-$t7,$t8	temporaries, scratch
$25	$t9	temporary, address callee for all PIC calls (by convention), scratch
$16-$23	$s0-$s7	preserve
$26,$27	$kt0,$kt1	reserved for kernel
$28	$gp	global pointer, preserve
$29	$sp	stack pointer, preserve
$30	$s8	frame pointer, preserve
$31	$ra	return address, preserve
hi, lo		multiply/divide special registers
$f0,$f2		only on hard-ﬂoat targets: ﬂoat return values, scratch
$f1,$f3,$f4-$f11		only on hard-ﬂoat targets: ﬂoat temporaries, scratch
$f12-$f19		only on hard-ﬂoat targets: ﬂoat arguments, scratch
$f20-$f23		only on hard-ﬂoat targets: ﬂoat temporaries, scratch
$f24-$f31		only on hard-ﬂoat targets: preserved

Table 32: Register usage on MIPS N64 calling convention

Parameter passing

Stack grows down
Stack parameter order: right-to-left
Caller cleans up the stack
generally, ﬁrst 8 params >= 64-bit are passed via registers
for hard-ﬂoat targets: register arguments are passed via $a0-$a7 for integers and $f12-$f19 for ﬂoats - with mixed ﬂoat and int parameters, some registers are left out (e.g. ﬁrst parameter ends up in $a0 or $f12, second in $a1 or $f13, etc.)
for soft-ﬂoat targets: register arguments are passed via $a0-$a7
subsequent arguments are pushed onto the stack
all stack entries are 64-bit aligned
all stack regions are 16-byte aligned
if the callee takes the address of one of the parameters and uses it to address other unnamed parameters (e.g. varargs) it has to copy - in its prolog - the the argument registers to a reserved stack area adjacent to the other parameters on the stack (only the unnamed integer parameters require saving, though)
ﬂoat arguments passed in the variable part of a vararg call are passed like integers, meaning ﬂoat registers don’t ever need to be saved that way, so only $a0-$a7 are need to be spilled
quad precision ﬂoat arguments are passed in even-odd register pairs, skipping one register if needed
integer parameters < 64 bit are right-justiﬁed (meaning occupy higher-address bytes) in their 8-byte slot on the stack, requiring extra-care for big-endian targets
single precision ﬂoat parameters (32 bit) are left-justiﬁed in their 8-byte slot on the stack, but are right justiﬁed in fp-registers on big endian targets, as they aren’t promoted (actually, oﬃcial docs says ”undecided”, but real world implementations seem to use what is described here)
aggregates (struct, union) are passed as a sequence of dwords in (integer registers and the stack), with the following particularities:
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
- if a dword happens to be a double precision ﬂoating point struct ﬁeld, it is passed in a ﬂoating point register
- array and union ﬁelds are always passed like integers (even if their type is ﬂoat or double)
- splitting an argument across registers and the stack is ﬁne

Return values

results are returned in $v0, and for a second one $v1 is used
only on hard-ﬂoat targets: ﬂoating point results are returned in $f0 (and $f2 if needed)
only on hard-ﬂoat targets: structs with only one or two ﬂoating point ﬁelds are returned in $f0 (and $f2 if necessary), ﬁeld-by-ﬁeld
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in %a0), and callee writes return value to this space; the ptr to the aggregate is returned in %v0
any other aggregates (struct, union) <= 16 bytes are returned via registers $v0 (and $v1 if necessary), dword-by-dword
all other aggregates (struct, union) >16 bytes are returned in a space allocated by the caller, with a pointer to it passed as ﬁrst parameter to the function called (meaning in %a0); the ptr to the aggregate is returned in %v0

Stack layout

Stack directly after function prolog:

| |
|... |
register save area|-----------------------| )
|-----------------------| ||||
local da(ta|-----------------------| ) } caller’s frame
|||| |arg n- 1 | } ||
||{ |... | )stack parameters ||)
parameter area |arg-8------------------| ) )
|||| |$a7 | } |||
||( |... | )spill area (if needed) |||}
|$a?-(ﬁrst-unnamed-reg)---| current frame
register save area (with return address)----------------------| |||
local data|-----------------------| |||)
parameter area|... |

Figure 21: Stack layout on MIPS N64 calling convention

MIPS N32 Calling Convention

Despite what one might think given the name, this is a MIPS 64-bit calling convention. As mentioned in the overview of this chapter, it is nearly identical to the N64 one, the diﬀerences being:

uses ILP32 as programming model instead of LP64
ﬂoating point registers $f20-$f23 are to be preserved

SPARC Calling Conventions

Overview

The SPARC family of processors is based on the SPARC instruction set architecture, which comes in basically three revisions, V7, V8[29][27] and V9[30][28]. The former two are 32-bit whereas the latter refers to the 64-bit SPARC architecture (see next chapter). SPARC uses big endian byte order.
The word size is deﬁned to be 32 bits. dyncall support

dyncall fully supports the SPARC 32-bit instruction set (V7 and V8), for calls and callbacks.

SPARC (32-bit) Calling Convention

Register usage

32 single ﬂoating point registers (f0-f31, usable as 8 quad precision q0,q4,q8,...,q28, 16 double precision d0,d2,d4,...,d30)
32 32-bit integer/pointer registers out of a bigger (vendor/model dependent) number that are accessible at a time (8 are global ones (g*), whereas the remaining 24 form a register window with 8 input (i*), 8 output (o*) and 8 local (l*) ones)
calling a function shifts the register window, the old output registers become the new input registers (old local and input ones are not accessible anymore)

Name	Alias	Brief description

%g0	%r0	Read-only, hardwired to 0
%g1-%g7	%r1-%r7	Global
%o0,%o1 and %i0,%i1	%r8,%r9 and %r24,%r25	Output and input argument registers, return value
%o2-%o5 and %i2-%i5	%r10-%r13 and %r26-%r29	Output and input argument registers
%o6 and %i6	%r14 and %r30, %sp and %fp	Stack and frame pointer
%o7 and %i7	%r15 and %r31	Return address (caller writes to o7, callee uses i7)
%l0-%l7	%r16-%r23	preserve
%f0,%f1		Floating point return value
%f2-%f31		scratch

Table 33: Register usage on sparc calling convention

Parameter passing

stack grows down
stack parameter order: right-to-left
caller cleans up the stack
stack always aligned to 8 bytes
ﬁrst 6 integers/pointers and ﬂoats are passed independently in registers using %o0-%o5
for every other argument the stack is used
all arguments <= 32 bit are passed as 32 bit values
64 bit arguments are passed like two consecutive <= 32 bit values (which allows for an argument to be split between the stack and %i5)
aggregates (struct, union) of any size, as well as quad precision values are passed indirectly as a pointer to a copy of the aggregate (like: struct s2 = s; callee(&s2);)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate
minimum stack size is 64 bytes, b/c stack pointer must always point at enough space to store all %i* and %l* registers, used when running out of register windows
if needed, register spill area is adjacent to parameters

Return values

results are expected by caller to be returned in %o0/%o1 (after reg window restore, meaning callee writes to %i0/%i1) for integers
%f0/%f1 are used for ﬂoating point values
aggregates (struct, union) and quad precision values are returned in a space allocated by the caller, with a pointer to it passed as an additional, hidden stack parameter (always at %sp+64 for the caller, see below); that pointer is returned in %o0

Stack layout

Stack directly after function prolog:

| |
|... |
local data (and padding) |-----------------------| )
( |-----------------------|) ||||
|||| |arg n-1 |} ||||
|||| |... |) stack parameters ||||
{ |7th word of arg data |) }caller’s frame
parameter area || |%o5 |} ||
|||| |... |) spill area ||||
||( |%o0 | ||||
|struct/union-return pointer| ||)
register save area (%i* and %l*)---------------------| )
local data (and padding) |-----------------------| }
parameter area |-----------------------| )current frame
|... |

Figure 22: Stack layout on sparc32 calling convention

SPARC64 Calling Conventions

Overview

The SPARC family of processors is based on the SPARC instruction set architecture, which comes in basically three revisions, V7, V8[29][27][31] and V9[30][28][31]. The former two are 32-bit (see previous chapter) whereas the latter refers to the 64-bit SPARC architecture. SPARC uses big endian byte order, however, V9 supports also little endian byte order, but for data access only, not instruction access.

There are two proposals, one from Sun and one from Hal, which disagree on how to handle some aspects of this calling convention.
dyncall support

dyncall fully supports the SPARC 64-bit instruction set (V9), for calls and callbacks.

SPARC (64-bit) Calling Convention

32 double precision ﬂoating point registers (d0,d2,d4,...,d62, usable as 16 quad precision ones q0,q4,q8,...q60, and also ﬁrst half of them are usable as 32 single precision registers f0-f31)
32 64-bit integer/pointer registers out of a bigger (vendor/model dependent) number that are accessible at a time (8 are global ones (g*), whereas the remaining 24 form a register window with 8 input (i*), 8 output (o*) and 8 local (l*) ones)
calling a function shifts the register window, the old output registers become the new input registers (old local and input ones are not accessible anymore)
stack and frame pointer are oﬀset by a BIAS of 2047 (see oﬃcial doc for reasons)

Name	Alias	Brief description

%g0	%r0	Read-only, hardwired to 0
%g1-%g7	%r1-%r7	Global
%o0-%o3 and %i0-%i3	%r8-%r11 and %r24-%r27	Output and input argument registers, return value
%o4,%o5 and %i4,%i5	%r12,%r13 and %r28,%r29	Output and input argument registers
%o6 and %i6	%r14 and %r30, %sp and %fp	Stack and frame pointer (NOTE, oﬀset with a BIAS of 2047)
%o7 and %i7	%r15 and %r31	Return address (caller writes to o7, callee uses i7)
%l0-%l7	%r16-%r23	preserve
%d0,%d2,%d4,%d6		scratch, Floating point arguments, return value
%d8,%d10,...,%d14		scratch, Floating point arguments
%d16,%d18,...,%d30		scratch (preserve for Hal), Floating point arguments
%d32,%d34,...,%d62		scratch (preserve for Hal)

Table 34: Register usage on sparc64 calling convention

Parameter passing

stack grows down
stack parameter order: right-to-left
caller cleans up the stack
stack frame is always aligned to 16 bytes
ﬁrst 6 integers are passed in registers using %o0-%o5
ﬁrst 8 quad precision ﬂoating point args (or 16 double precision, or 32 single precision) are passed in ﬂoating point registers (%q0,%q4,...,%q28 or %d0,%d2,...,%d30 or %f0-%f31, respectively)
for every other argument the stack is used
single precision ﬂoating point args are passed in odd %f* registers, and are ”right aligned” in their 8-byte space on the stack
for every argument passed, corresponding %o*, %f* register or stack space is skipped (e.g. passing a double as 3rd call argument, %d4 is used and %o2 is skipped)
all arguments <= 64 bit are passed as 64 bit values
minimum stack size is 128 bytes, b/c stack pointer must always point at enough space to store all %i* and %l* registers, used when running out of register windows
if needed, register spill area (both, integer and ﬂoat arguments are spilled in order) is adjacent to parameters
structs with only one ﬁeld are passed as if the param would be the ﬁeld itself
structs <= 16 bytes (which have more than one ﬁeld) are passed ﬁeld-by-ﬁeld, however evaluated as a sequence of 8-byte parameter slots
- note that due to aggregate alignment rules, any ﬂoating point value is either the entire slot (for double precision) or exactly one half
- ﬁelds are left justiﬁed in register or stack slots
- integers in a slot are passed as such (either via %o* registers or the stack)
- single precision ﬂoats (using half of the slot) use even numbered %f* registers when they occupy the left half, odd numbered ones otherwise (no register skipping logic applied within a slot)
- splitting struct ﬁelds between registers and stack is allowed
unions <= 16 bytes passed by-value are passed like integers in left-justiﬁed 8-byte slots (either via %o* registers or the stack)
aggregates (struct, union) and types > 16 bytes are passed indirectly, as a pointer to a correctly aligned copy of the data (that copy can be avoided under certain conditions)
non-trivial C++ aggregates (as deﬁned by the language) of any size, are passed indirectly via a pointer to a copy of the aggregate

Return values

results are expected by caller to be returned in %o0-%o3 (after reg window restore, meaning callee writes to %i0-%i3) for integers
%d0,%d2,%d4,%d6 are used for ﬂoating point values
for non-trivial C++ aggregates, the caller allocates space, passes pointer to it to the callee as a hidden ﬁrst param (meaning in %o0), and callee writes return value to this space; the ptr to the aggregate is returned in the same register (after reg window restore)
the ﬁelds of aggregates (struct, union) <= 32 bytes are returned via registers mentioned above (which are assigned following the same logic as when passing the aggregate as a ﬁrst argument to a function)
aggregates (struct, union) >32 bytes are returned in a space allocated by the caller, with a pointer to it passed as ﬁrst parameter to the function called (meaning in %o0)

Stack layout

Stack directly after function prolog:

| |
|... |
local data (and padding) |-----------------------| )
( |-----------------------|) ||||
|||| |arg n-1 |} ||||
||{ |... |) stack parameters ||}
parameter area |arg 6 |) caller’s frame
|||| |%o5 |} ||||
||( |... |) spill area ||||
|%o0--------------------| ||)
register save area (%i* and %l*)---------------------| )
local data (and padding) |-----------------------| }
parameter area |-----------------------| )current frame
|... |

Figure 23: Stack layout on sparc64 calling convention

index