Mercurial > pub > dyncall > bindings
view R/rdyncall/vignettes/FLI.Rnw @ 63:9b6cdffd30dd
- further fixes of inccorect overflow errors for int (and long on LLP64 systems)
* prev commit had bugs
* added overflow tests for also int, long, long long (for both, lp64 and llp64)
- while at it, fixing a reference leak when not using python with utf8 caching
author | Tassilo Philipp |
---|---|
date | Sun, 19 May 2024 15:33:18 +0200 |
parents | 0cfcc391201f |
children |
line wrap: on
line source
\documentclass[11pt]{article} \usepackage[round]{natbib} \usepackage{hyperref} \usepackage{amsmath} \usepackage{fancyvrb} \usepackage{verbatim} \usepackage{alltt,graphicx} \usepackage{fullpage} \bibliographystyle{abbrvnat} \newcommand{\file}[1]{{`\normalfont\textsf{#1}'}} \newcommand{\strong}[1]{\texorpdfstring% {{\normalfont\fontseries{b}\selectfont #1}}% {#1}} \let\pkg=\strong \newcommand\code{\bgroup\@codex} \def\@codex#1{\texorpdfstring% {{\normalfont\ttfamily\hyphenchar\font=-1 #1}}% {#1}\egroup} \newenvironment{smallverbatim}{\small\verbatim}{\endverbatim} \newenvironment{example}{\begin{alltt}}{\end{alltt}} \newenvironment{smallexample}{\begin{alltt}\small}{\end{alltt}} \begin{document} \title{Foreign Library Interface} %\VignetteIndexEntry{Foreign Library Interface} \author{by Daniel Adler} \maketitle \abstract{ We present an improved Foreign Function Interface (FFI) for R to call arbitary native functions without the need for C wrapper code. Further we discuss a dynamic linkage framework for binding standard C libraries to R across platforms using a universal type information format. The package \pkg{rdyncall} comprises the framework and an initial repository of cross-platform bindings for standard libraries such as (legacy and modern) \emph{OpenGL}, the family of \emph{SDL} libraries and \emph{Expat}. The package enables system-level programming using the R language; sample applications are given in the article. We outline the underlying automation tool-chain that extracts cross-platform bindings from C headers, making the repository extendable and open for library developers. } \section{Introduction} \begin{table*} \centering \label{tab:libs} \begin{tabular}{l|l|c|c|c} lib/dynport & description & functions & constants & aggregate types \\ \hline \code{gl} & opengl & 337 & 3253 & - \\ \code{glu} & opengl utility & 59 & 154 & - \\ \code{r} & r library & 238 & 700 & 27 \\ \code{sdl} & audio/video/ui abstraction & 203 & 465 & 51 \\ \code{sdl\_image} & pixel format loaders & 29 & - & - \\ \code{sdl\_mixer} & music format loaders and playing & 63 & 12 & - \\ \code{sdl\_ttf} & font format loaders & 35 & 9 & - \\ \code{cuda} & gpu programming & 387 & 665 & 84 \\ \code{expat} & xml parsing framework & 65 & 70 & - \\ \code{glew} & gl extensions & 1465 & - & - \\ \code{gl3} & opengl 3 (strict) & 324 & 838 & 1 \\ \code{opencl} & gpu programming & 78 & 260 & 10 \\ \code{stdio} & standard i/o & 76 & 3 & - \\ \end{tabular} \caption{overview of available dynports for portable c libraries} \end{table*} We present an improved Foreign Function Interface (FFI) for R that significantly reduces the amount of C wrapper code needed to interface with C. We also introduce a \emph{dynamic} linkage that binds the C interface of a pre-compiled library (\emph{as a whole}) to an interpreted programming environment \citep{Oust97a} such as R - hence th name \emph{Foreign Library Interface}. Table 1 gives a list of the C libraries currently supported across major R platforms. For each library supported, abstract interface specifications are declared in a compact platform-neutral text-based format stored in so-called \emph{DynPort} files on a local repository. %between high-level interpreted programming environments %and native pre-compiled C libraries that uses a compact text-based %interface and type information format that makes this method work across platforms. R \citep{R:Ihaka+Gentleman:1996} was choosen as the first language to implement a proof-of-concept implementation for this approach. This article describes the \pkg{rdyncall} package which implements a complete toolkit of low-level facilities that can be used as an alternative FFI to interface with the C programming language. And further, it enables direct and quick access to the common C libraries from R without compilation. The project was motivated by the fact that high-quality software solutions implemented in portable C are often not available in interpreter-based languages such as R. The pool of freely available C libraries is quite large and represents an invaluable resource for software development. For example, OpenGL \citep{Board05} is the most portable and standard interface to accelerated graphics hardware for developing real-time graphics software. The combination of OpenGL with the \emph{Simple DirectMedia Layer} (SDL) \citep{SDL} core and extension libraries offers a foundation framework for developing interactive multimedia applications that can run on a multitude of platforms. Other libraries such as the Expat XML Parser \citep{www:expat} provide a parser framework for processing very large XML documents. And even the C library of R contains high-quality statistical functions that are useful in context of other languages as well. To make use of these libraries within high-level languages, \emph{language bindings} to the library must be written as an extension to the language, a task that requires deep familiarity of the internals of both the library and the interpreter. Depending on the complexity of the library, the amount of work needed to wrap the interface can be very large (Table \ref{tab:libs} gives the counts of functions, constants and types that need to be wrapped). Rather than having to write a separate binding for each \emph{library and language} combination, we research a dynamic binding approach that is adaptable to interpreters and works cross-platform without additional compilation of wrapper layers. Once the binding specification for a library has been specified, that library becomes automatically accessible to all interpreters that implement such a framework outlined here. Extension techniques offered by the language interpreter, such as a \emph{Foreign Function Interface} (FFI), are the fundamental technology for bridging the dynamic interpreter with statically pre-compiled code. In the case of R the built-in FFI function \code{.C} provides a fairly basic call gate to C code with strong limitations; additional wrapper code has to be written in addition to interface with standard C libraries. \pkg{rdyncall} contributes an improved FFI for R that offers a \emph{flexible} and \emph{type-safe} interface with support for almost all C types without requiring additional C wrappers. Based on this FFI, the package contains a proof-of-concept implementation of a \emph{Foreign Library Interface} that enables \emph{direct} and \emph{dynamic} interoperability with foreign C Libraries (including shared library code and the Application Programming Interface specified in C headers) from within the R interpreter. For each C library supported, abstract interface specification are declared in a compact platform-neutral text-based format stored in a so-called \emph{DynPort} file located in a local repository within the package. Table \ref{tab:libs} gives a sample list of available bindings that come with the package. Users gain access to C libraries from R using the front-end function \code{dynport(}\emph{portname}\code{)}, which processes a \emph{DynPort} file to load the C library\footnote{Pre-compiled libraries need to be installed, OS-specific installation notes are given in the documentation of the package.}, and wrap the C interface as a newly attached R environment \footnote{Note \pkg{rdyncall} version 0.7.4 and below uses R name space objects \citep{RNameSpace} as dynport containers. This has changed starting with version 0.7.5 due to restrictions for packages hosted on CRAN not to use internal functions. Since there is no public interface for the creation of name space objects currently in R, \pkg{rdyncall} uses ordinary environment objects for now. This disables the use of the double colon operator (\code{::}) to refer to dynport objects; unloading is done using \code{detach(dynport:<PORTNAME>)}.} that uses the same symbolic names of the C API. R code that uses C interfaces via \emph{DynPort}s might look very familiar to C user code. This article motivates the topic with a comparison of the built-in and contributed FFI by means of a simple use case. This leads to a detailed description of the improved FFI. Then follows an overview of the package and a brief tour through the framework with details on the handling of foreign C data types and wrapping R functions as callbacks. Two sample applications are given using OpenGL, SDL and Expat. The article ends with a brief description of the implementation based on C libraries from the \emph{DynCall} project \citep{dyncall} and the tool-chain that was used to create the repository of \emph{DynPort} files. \section{Foreign Function Interfaces} FFIs provide the backbone of a language to interface with foreign code. Depending on the design of this service, it can largely unburden developers from writing additional wrapper code. In this section, we compare the built-in FFI with the improved FFI provided by \pkg{rdyncall} using a simple example that sketches the different work flow paths for making an R binding to a function from a foreign C library. \subsection{FFI of base R} Suppose that we wish to invoke the C function \code{sqrt} of the C Standard Math library. The function is declared as follows in C: \begin{verbatim} double sqrt(double x); \end{verbatim} R offers a number of functions to call pre-compiled code from within the R interpreter. While \code{.Call} and \code{.External} are designed for interoperability with \emph{extension} code, \code{.C} and \code{.Fortran} seem to offer the most low-level interoperability with \emph{foreign} code. But \code{.C} has also very strict conversion rules and strong limitations regarding argument and return-types: \code{.C} passes R arguments as C pointers and C return types are not supported, so only C \code{void} functions, which are procedures, can be called. Given these limitations, we are not able to invoke the foreign \code{sqrt} function directly and need some intermediate wrapper code written in C that obeys the rules of the \code{.C} interface: \begin{smallverbatim} #include <math.h> void R_C_sqrt(double * ptr_to_x) { double x = ptr_to_x[0], ans; ans = sqrt(x); ptr_to_x[0] = ans; } \end{smallverbatim} We assume that the wrapper code is deployed as a shared library in a package named \emph{testsqrt} which links to the C math library. \footnote{We omit here the details such as registering C functions which is described in detail in the R Manual '\emph{Writing R Extensions}' \citep{RExt}.}. Then we load the \emph{testsqrt} package and call the C wrapper function directly via \code{.C}. \begin{example} > library(testsqrt) > .C("R_C_sqrt", 144, PACKAGE="testsqrt") [[1]] [1] 12 \end{example} To make \code{sqrt} available as a public function, an additional R wrapper layer is added, that does type-safety checks before issuing the \code{.C} call. \begin{smallverbatim} sqrtViaC <- function(x) { x <- as.numeric(x) # type(x) should be C double. # make sure length > 0: length(x) <- max(1, length(x)) .C("R_C_sqrt", x, PACKAGE="example") } \end{smallverbatim} As an alternative, R also provides high-level C extension interfaces such as \code{.Call} and \code{.External}, that give access to R internals at C level and enable to make type-safety checks within C: \begin{smallverbatim} #include <R.h> #include <Rinternals.h> #include <math.h> SEXP R_Call_sqrt(SEXP x) { SEXP ans = R_NilValue, tmp; PROTECT( tmp = coerceVector(x, REALSXP) ); if (LENGTH(tmp) > 0) { double y = REAL(tmp)[0], result; result = sqrt(y); ans = ScalarReal(result); } UNPROTECT(1); return ans; } \end{smallverbatim} Now the corresponding R wrapper shrinks into a simple delegate: \begin{example} > sqrtViaCall <- function(x) + .Call("R_Call_sqrt", x, PACKAGE="example") \end{example} The third alternative, via \code{.External}, is omitted here; it has a different argument passing scheme, but the C and R wrapper implementations would look very similar. We can conclude that - in realistic settings - the built-in FFI of R almost always needs support by a wrapper layer written in C. The "foreign" in FFI is in fact relegated to the C wrapper layer. Moreover the R FFI can be viewed as an \emph{extension} interface for calling pre-compiled code written in a \emph{foreign} language within the context of the R implementation, rather than a direct invocation interface for code from a \emph{foreign} context such as an ordinary C library. \subsection{FFI of rdyncall} \begin{table*} \begin{center} \begin{tabular}{ll|ll} \hline \hline Type& Sign. & Type & Sign. \\ \hline \verb@void@ & \verb@v@ & \verb@bool@ & \verb@B@ \\ \verb@char@ & \verb@c@ & \verb@unsigned char@ & \verb@C@ \\ \verb@short@ & \verb@s@ & \verb@unsigned short@ & \verb@S@ \\ \verb@int@ & \verb@i@ & \verb@unsigned int@ & \verb@I@ \\ \verb@long@ & \verb@j@ & \verb@unsigned long@ & \verb@J@ \\ \verb@long long@ & \verb@l@ & \verb@unsigned long long@ & \verb@L@ \\ \verb@float@ & \verb@f@ & \verb@double@ & \verb@d@ \\ \verb@void*@ & \verb@p@ & \verb@struct@ \emph{name} \verb@*@ & \verb@*<@\emph{name}\verb@>@ \\ \emph{type}\verb@*@ & \verb@*@... & \verb@const char*@ & \verb@Z@ \\ \hline \hline \end{tabular} \end{center} \caption{\label{tab:signature} C/C++ Types and Signatures} \end{table*} \pkg{rdyncall} provides an improved FFI for R that is accessible via the function \code{.dyncall}. In contrast to the built-in R FFI which uses a C wrapper layer, the \code{sqrt} function is invoked dynamically and directly by the interpreter at run-time. Whereas the C math library was loaded implicitly via the example package, it now has to be loaded explicitly. R offers functions to deal with shared libraries at run-time, but the location has to be specified as an absolute pathname which is platform-specific. For now, let us assume that the example is done on Mac OS X where the C math library is located at \file{/usr/lib/libm.dylib}. A platform-portable solution is discussed in the next section on \emph{Portable loading of shared library}. \begin{example} > libm <- dyn.load("/usr/lib/libm.dylib") > sqrtAddr <- libm$sqrt$address \end{example} We first need to load the R package \pkg{rdyncall}: \begin{example} > library(rdyncall) \end{example} Finally, we invoke the foreign C function \code{sqrt} \emph{directly} via \code{.dyncall}: \begin{example} > .dyncall(sqrtAddr, "d)d", 144) [1] 12 \end{example} Let us review the last call, as it pinpoints the core solution for a direct invocation of foreign code within R: The first argument specifies the address of the foreign code, given as an external pointer. The second argument is a \emph{call signature} that specifies the argument- and return types of the target C function. This string \verb@"d)d"@ specifies that the foreign function expects a \code{double} scalar argument and returns a \code{double} scalar value in correspondence to the C declaration of \code{sqrt}. Arguments following the call signature are passed to the foreign function using the call signature for type-safe conversion to C types. In this case we pass \code{144} as a C \code{double} argument type as first argument and receive a C \code{double} value converted to an R \code{numeric}. \subsection{Call Signatures} The introduction of a type descriptor for foreign functions is a key component that makes the FFI flexible and type-safe. The format of the call signature has the following pattern: \begin{center} \emph{argument-types} \verb@')'@ \emph{return-type} \end{center} The signature can be derived from the C function declaration: Argument types are specified first, in a left-to-right order, and are terminated by the \verb@')'@ symbol followed by a single return type signature. Almost all fundamental C types are supported and there is no real restriction regarding the number of arguments supported to issue a call. Table \ref{tab:signature} gives an overview of supported C types and the corresponding text encoding; Table \ref{tab:signature_examples} provides some examples of C functions and call signatures. \begin{table*} \center \begin{tabular}{l|l} C function declaration & dyncall type signature \\ \hline \verb@void rsort_with_index(double*,int*,int n)@ & \verb@*d*ii)v@ \\ \verb@SDL_Surface * SDL_SetVideoMode(int,int,int,Uint32_t)@ & \verb@iiiI)*<SDL_Surface>@ \\ \verb@void glClear(GLfloat,GLfloat,GLfloat,GLfloat)@ & \verb@ffff)v@ \\ \end{tabular} \caption{\label{tab:signature_examples} Some examples of C functions and corresponding signatures} \end{table*} Now, let us define a public and type-safe R wrapper function that hides the details of the foreign function call by passing the formal argument place holder "\code{...}" as third argument to \code{.dyncall}: \begin{example} > sqrtViaDynCall <- function(...) + .dyncall(sqrtAddress, "d)d", ...) \end{example} Although there is no further guard code, this interface is type-safe and the user can do no harm by inadvertently using a wrong set and/or type of arguments due to the built-in type-checks. Compared to the R wrapper code using \code{.C}, no explicit cast of the arguments via \code{as.numeric} is required, because automatic coercion rules for fundamental types are implemented as dictated by the call signature. For example, \code{integer} R values are implicitly casted to \code{double} automatically: \begin{smallverbatim} > sqrtViaDyncall(144L) [1] 12 \end{smallverbatim} A certain level of type-safety is achieved here as well: All arguments to be passed to C are first checked against the call signature. If any incompatibility is detected, such as a wrong number of arguments, empty atomic vectors or incompatible type mappings, the invocation is aborted and an error is reported before risking an application crash: \begin{smallverbatim} > sqrtViaDyncall(1,2) Error in .dyncall(sqrtAddress, "d)d", ...) : Too many arguments for signature 'd)d'. > sqrtViaDyncall() Error in .dyncall(sqrtAddress, "d)d", ...) : Not enough arguments for function-call signature 'd)d'. > sqrtViaDyncall(NULL) Error in .dyncall(sqrtAddress, "d)d", ...) : Argument type mismatch at position 1: expected double convertible value > sqrtViaDyncall("144") Error in .dyncall(sqrtAddress, "d)d", ...) : Argument type mismatch at position 1: expected double convertible value \end{smallverbatim} In contrast to the R FFI, where the argument conversion is dictated solely by the R argument type at call-time in a one-way fashion, the introduction of an additional specification with a call signature gives several advantages. \begin{itemize} \item Almost all possible C functions can be invoked by a single interface; no additional C wrapper is required. \item The built-in type-safety checks of passed arguments enhance stability and reduce assertion code in R wrappers significantly. \item A single call signature can work across platforms, given that the C function type remains constant across platforms. \item Given that our FFI is implemented in multiple languages, call signatures represent a portable type description for C libraries. \end{itemize} \section{Package Overview} Besides dynamic calling of foreign code, the package provides essential facilities for interoperability between the R and C programming languages. A high-level overview of components that make up the package is given in Figure \ref{fig:pkg_overview}. \begin{figure}[h] \centering \includegraphics[scale=0.44]{img_overview.pdf} \caption{\label{fig:pkg_overview} Package Overview} \end{figure} We already described the \code{.dyncall} FFI. It follows a brief description of portable loading of shared libraries using \code{dynfind}, installation of wrappers via \code{dynbind}, handling of foreign data types via \code{new.struct} and wrapping of R functions as C callbacks via \code{new.callback}. Finally the high-level \code{dynport} interface for accessing \emph{whole} C libraries is briefly discussed. The technical details at low-level of some components are described briefly in the section \emph{Architecture}. \subsection{Portable loading of shared libraries} The \emph{portable} loading of shared libraries across platforms is not trivial because the file path is different in Operating-Systems (OS). Referring back to the previous example, to load a particular library in a portable fashion, one would have to check the platform to locate the C library.\footnote{Possible C math library names are \file{libm.so}, \file{libm.so.6} and \file{MSVCRT.DLL} in locations such as \file{/lib}, \file{/usr/lib}, \file{/lib64}, \file{/lib/sparcv9}, \file{/usr/lib64}, \file{C:\textbackslash WINDOWS\textbackslash SYSTEM32} etc..} Although there is variation among the OSs, library file paths and search patterns have common structures. For example, among all the different locations, prefixes and suffixes, there is a part within a full library filename that can be taken as a \emph{short library name} or label. The function \code{dynfind} takes a list of short library names to locate a library using common search heuristics. For example, to load the Standard C Math library, one would either use the Microsoft Visual C Run-Time library labeled \file{msvcrt} on Windows or the C Math library labeled \file{m} or \file{m.so.6} otherwise. \begin{example} > mLib <- dynfind(c("msvcrt","m","m.so.6")) \end{example} \code{dynfind} also supports more exotic schemes, such as the Mac OS X Framework folders. Depending on the library, it is sometimes enough to have a single short filename - e.g. \code{"expat"} for the \emph{Expat} library. Internally, the dynamic linker interface of the OS is used via \code{.dynload} and symbols get resolved via \code{.dynsym}: \begin{example} > sqrtAddr <- .dynsym(mLib, "sqrt") \end{example} Although R already contains support for loading shared libraries and resolving of symbols, several issues have led to a reimplementation of this part: \begin{itemize} \item System paths are not considered when loading libraries via \code{dyn.load} of the package \pkg{base} but this is one part of the search heuristics. \item Automatic life-cycle management for loading and unloading of libraries is a desired goal. Unloading of libraries should be done automatically via finalizer code when no symbols are used anymore. External pointers resolved via \code{.dynsym} hold a reference to the loaded library. When all external pointers are garbage collected, the library handle is not referenced anymore and the finalizer can unload the library. \end{itemize} \subsection{Wrapping C libraries} Functional R interfaces to foreign code can be defined with small R wrapper functions, which effectively delegates to \code{.dyncall}. Each function interface is parameterized by a target address and a matching call signature. Since APIs often consist of hundreds of functions (see Table \ref{tab:libs}), \code{dynbind} can create and install a batch of function wrappers for a library with a single call by using a \emph{library signature} that consists of concatenated function names and signatures separated by semicolons. For example, to install wrappers to the C functions \code{sqrt}, \code{sin} and \code{cos} from the math library, one could use: \begin{example} > dynbind( c("msvcrt","m","m.so.6"), + "sqrt(d)d;sin(d)d);cos(d)d;" ) \end{example} The function call has the side-effect that three R wrapper functions are created and stored in an environment which defaults to the global environment. Let us review the \code{sin} wrapper (on the 64-bit Version of R running on Mac OS X 10.6): \begin{example} > sin function (...) .dyncall.default(<pointer: 0x7fff81fd13f0>, "d)d)", ...) \end{example} The wrapper directly uses the address of the resolved \code{sin} symbol. In addition, the wrappers uses \code{.dyncall.default}, which is a concrete selector of a particular calling convention, as outlined below. \subsection{Calling Conventions} Calling conventions specify how arguments and return values are passed across sub-routines and functions at machine level. This information is vital for interfacing with the binary interface of C libraries. The package has support for multiple calling conventions. Calling conventions are controlled by \code{.dyncall} via the named argument \code{callmode} to specify a non-default calling convention. Most current OSs and platforms only have support for a single \code{"default"} calling convention at run-time. An important exception is the Microsoft Windows platform on the 32-bit \emph{i386} processor architecture: While the default C calling convention on \emph{i386} is \code{"cdecl"} (which is the \code{"default"} on \emph{i386}), system shared libraries from Microsoft such as \file{KERNEL32.DLL}, \file{USER32.DLL} and the OpenGL library \file{OPENGL32.DLL} use the \code{"stdcall"} calling convention. Only on this platform, the \code{callmode} argument has an effect and selects the calling convention to be used when working on Microsoft Windows 32-Bit. All other platforms currently ignore this argument. \subsection{Handling of C Types in R} C APIs often make use of high-level C \verb@struct@ and \verb@union@ types for exchanging information. Thus, to make interoperability work at that level the handling of C type information is addressed by the package. Let us consider the following hypothetical example: A user-interface library has a function to set the 2D coordinates and dimension of a graphical output window. The coordinates are specified using a C \code{struct Rect} data type and the C function receives a pointer on that object: \begin{smallverbatim} void setWindowRect(struct Rect *pRect); \end{smallverbatim} The structure type is defined as follows: \begin{smallverbatim} struct Rect { short x, y; unsigned short w, h; }; \end{smallverbatim} Before we can issue a call, we have to allocate an object of that size and initialize the fields with values encoded in C types, which are not part of R data types. The framework provides helper functions and objects to deal with C data types in R. Type information objects can be created with a description of the C aggregate structure. First, we create a type information object in R for the \code{struct Rect} C data type via \code{parseStructInfos} using a \emph{structure type signature}. \begin{smallverbatim} > parserStructInfos("Rect{ssSS}x y w h;") \end{smallverbatim} After registration, an R object named \code{Rect} is installed, which contains C type information that corresponds to \code{struct Rect}. The format of a \emph{structure type signature} has the following pattern: \begin{center} \emph{Struct-name} \verb@'{'@ \emph{Field-types} \verb@'}'@ \emph{Field-names} \verb@';'@ \end{center} \emph{Field-types} use the same type signature encoding as that of \emph{call signatures} for argument and return types (Table \ref{tab:signature}). \emph{Field-names} consist of a list of white-space separated names, labeling each field component. An instance of a C type can be allocated via \code{new.struct}: \begin{smallverbatim} > r <- new.struct(Rect) \end{smallverbatim} Finally, the extraction (\verb@'$'@, \verb@'['@) and replacement(\verb@'$<-'@, \verb@'[<-'@) operators can be used to access structure fields symbolically. During value transfer between R and C, automatic conversion of values with respect to the underlying C field type takes place. \begin{smallverbatim} > r$x <- -10 ; r$y <- -20 ; r$w <- 40 ; r$h <- 30 \end{smallverbatim} In this example, R \code{numeric} values are converted on the fly to \code{signed}- and \code{unsigned short} integers (usually 16-bit values). When the object gets printed on the prompt, a detailed picture of the data object is given: \begin{smallverbatim} > r struct Rect { x: -10 y: -20 w: 40 h: 30 } \end{smallverbatim} At low-level, one can see that \code{r} is stored as an R \code{raw} vector object: \begin{smallverbatim} > r[] [1] f6 ff ec ff 28 00 1e 00 attr(,"struct") [1] "Rect" \end{smallverbatim} To follow the example, we issue a foreign function call to \code{setRect} via \code{.dyncall} and pass in the \code{r} object, assuming the library is loaded and the symbol is resolved and stored in an external pointer object named \code{setWindowRectAddr}: \begin{smallverbatim} > .dyncall( setWindowRectAddr, "*<Rect>)v", r) \end{smallverbatim} We make use of a typed pointer expression \code{'*<Rect>'} instead of the untyped pointer signature \code{'p'}, which would also work but does not prevent users from passing other objects that do not reference a \code{struct Rect} data object. Typed pointer expressions increase type-safety and use the pattern \verb@'*<@\emph{Type-Name}\verb@>'@. The invocation will be rejected if the argument passed in is not of C type \code{Rect}. As \code{r} is tagged with an attribute \code{struct} that refers to \code{Rect}, the call will be issued. Typed pointers can also occur as return types that - once the type information is available - permit the manipulation of returned objects in the same symbolic manner as above. C \verb@union@ types are supported as well but use the \code{parseUnionInfos} function instead for registration and a slightly different signature format: \begin{center} \emph{Union-name} \verb@'|'@ \emph{Field-types} \verb@'}'@ \emph{Field-names} \verb@';'@ \end{center} The underlying low-level C type read- and write operations and conversions from R data types are performed by the functions \code{.pack} and \code{.unpack}. These can be used for various low-level operations as well, such as dereferencing of pointers on pointers. R objects such as external pointers and atomic raw, integer and numeric vectors can be used as aggregate C types via the attribute \code{struct}. To \emph{cast} a type in the style of C, one can use \code{as.struct}. \subsection{Wrapping R functions as C callbacks} Some C libraries, such as user-interface toolkits and I/O processing frameworks, use \emph{callbacks} as part of their interface to enable registration and activation of user-supplied event handlers. A callback is a user-defined function that has a library-defined function type. Call-backs are usually registered via a registration function offered by the library interface and are activated later from within a library run-time context. \pkg{rdyncall} has support for wrapping ordinary R functions as C callbacks via the function \code{new.callback}. Callback wrappers are defined by a \emph{callback signature} and the user-supplied R function to be wrapped. \emph{Callback signatures} look very similar to \emph{call signatures} and should match the functional type of the underlying C callback. \code{new.callback} returns an external pointer that can be used as a low-level function pointer for the registration as a C callback. See Section \emph{Parsing XML using Expat} below for applications of callback. \subsection{Foreign Library Interface} At the highest level, \pkg{rdyncall} provides the front-end function \code{dynport} to dynamically setup an interface to a C Application Programming Interface. This includes loading of the corresponding shared C library and resolving of symbols. During the binding process, a new R environment (this was a name space \citep{RNameSpace} till version 0.7.4) will be populated with thin R wrapper objects that represent abstractions to C counter-parts such as functions, pointer-to-functions, type-information objects for C struct and union types and symbolic constant equivalents of C enums and macro defines. The mechanism aims to work across platforms, given that the corresponding shared libraries of a \emph{DynPort} have been installed in a system standard location on the host. An initial repository of \emph{DynPorts} is available in the package that provides bindings for several popular C APIs, see Table \ref{tab:libs} for examples of available bindings. \section{Sample Applications} We give two examples with different application contexts that demonstrate the direct usage of C APIs from within R through the \pkg{rdyncall} package. The R interface to C libraries looks very similar to the actual C API. For details on the usage of a particular C library, the programming manuals and documentation of the libraries should be consulted. Before loading R bindings via \code{dynport}, the shared library should have been installed onto the system. Currently this is to be done manually and the installation method depends on the target OS (See the manual page about the 'rdyncall-demos' for details on this). While \emph{OpenGL} is most often pre-installed on typical desktop-systems, \emph{SDL} and \emph{Expat} sometimes have to be installed explicitly. \subsection{OpenGL Programming in R} In the first example, we make use of the Simple DirectMedia Layer library (SDL) \citep{SDL} \citep{Pendleton:2003:GPS} \citep{www:sdl-alternative} and the Open Graphics Library (OpenGL) \citep{Board05} to implement a portable multimedia application skeleton in R. We first need to load bindings to SDL and OpenGL via dynports: \begin{example} > dynport(SDL) > dynport(GL) \end{example} Now we initialize the SDL library - in particular the video subsystem, and open a window surface with a dimension of $640 x 480$ in 32-bit color depths that has support for OpenGL rendering: \begin{smallverbatim} > SDL_Init(SDL_INIT_VIDEO) > surface <- SDL_SetVideoMode(640,480,32,SDL_OPENGL) \end{smallverbatim} Next, we implement the application loop which updates the display repeatedly and processes the event queue until a \emph{quit} request is issued by the user via the window close button. \begin{smallverbatim} > mainloop <- function() { ev <- new.struct(SDL_Event) quit <- FALSE while(!quit) { draw() while(SDL_PollEvent(ev)) { if (ev$type == SDL_QUIT) { quit <- TRUE } } } } \end{smallverbatim} SDL event processing is implemented by collecting events that occur in a queue. Once per update frame, typical SDL applications poll the queue by calling \code{SDL\_PollEvent} with a pointer to a user-allocated buffer of C type \code{union SDL\_Event}. Event records have a common type identifier which is set to \code{SDL\_QUIT} when a quit event has occurred e.g. when users press a close button on a window. Next, we implement our \code{draw} function making use of the OpenGL 1.1 API. We clear the background with a blue color and draw a light-green rectangle. \begin{smallverbatim} > draw <- function() { glClearColor(0,0,1,0) glClear(GL_COLOR_BUFFER_BIT) glColor3f(0.5,1,0.5) glRectf(-0.5,-0.5,0.5,0.5) SDL_GL_SwapBuffers() } \end{smallverbatim} Now we can run the application mainloop. \begin{smallverbatim} > mainloop() \end{smallverbatim} To stop the application, we hit the close button of the window. A similar example is also available via \code{demo(SDL)}. Here the \code{draw} function displays a rotating 3D cube depict in Figure \ref{fig:demo_SDL}. \begin{figure} \centering \includegraphics[scale=0.35]{img_SDL.png} \caption{\label{fig:demo_SDL} \code{demo(SDL)}} \end{figure} \code{demo(randomfield)} gives a slightly more scientific application of OpenGL and R: Random fields of 512x512 size are generated via blending of 5000 texture mapped 2D gaussian kernels. The \emph{frames per second} counter in the window title gives the number of matrices generated per second (see Figure \ref{fig:demo_randomfield}). When clicking on the animation window, the current frame and matrix is passed to R and plotted. While several dozens of matrices are computed per second using OpenGL, it takes several seconds to plot a single matrix in R using \code{image()}. \begin{figure} \centering \includegraphics[scale=0.35]{img_randomfield.png} \caption{\label{fig:demo_randomfield} \code{demo(randomfield)}} \end{figure} \subsection{Parsing XML using Expat} In the second example, we use the Expat XML Parser library \citep{www:expat} \citep{Kim:2001:TSJ} to implement a stream-oriented XML parser suitable for very large documents. The library, being very popular, is very likely to be already installed on many OS distributions - otherwise it is available from package repositories or can be built as a shared library from source. In Expat, custom XML parsers are implemented by defining functions that are registered as callbacks to be invoked on events that occur during parsing, such as the start and end of XML tags. In our second example, we create a simple parser skeleton that prints the start and end tag names. First we load R bindings for Expat via \code{dynport}. \begin{smallverbatim} > dynport(expat) \end{smallverbatim} Next we create an abstract parser object via the C function \code{XML\_ParserCreate} that receives one argument of type C string to specify a desired character encoding that overrides the document encoding declaration. We want to pass a null pointer (\code{NULL}) here. In the \code{.dyncall} FFI C null pointer values for pointer types are expressed via the R \code{NULL} value: \begin{smallverbatim} > p <- XML_ParserCreate(NULL) \end{smallverbatim} The C interface for registration of start and end-tag event handler callbacks is given below: \begin{smallverbatim} /* Language C, from file expat.h: */ typedef void (*XML_StartElementHandler) (void *userData, const XML_Char *name, const XML_Char **atts); typedef void (*XML_EndElementHandler) (void *userData, const XML_Char *name); void XML_SetElementHandler(XML_Parser parser, XML_StartElementHandler start, XML_EndElementHandler end); \end{smallverbatim} We implement the callbacks as R functions which print the event and tag name. They are wrapped as C callback pointers via \code{new.callback} using a matching \emph{callback signature}. The second argument \code{name} of type C string in both callbacks, \code{XML\_StartElementHandler} and \code{XML\_EndElementHandler}, is of primnary interest ; this argument passes over the XML tag name. C strings are handled in a special way by the \code{.dyncall} FFI, because they have to be copied as R \code{character} objects. The special type signature \code{'Z'} is used to denote a C string type. The other arguments are simply denoted as untyped pointers using \code{'p'}: \begin{smallverbatim} > start <- new.callback("pZp)v", function(ignored1,tag,ignored2) cat("Start tag:", tag, "\n") ) > end <- new.callback("pZ)v", function(ignored,tag) cat("Stop tag:", tag, "\n") ) > XML_SetElementHandler(p, start, end) \end{smallverbatim} To test the parser, we create a sample document stored in a \code{character} object named \code{text} and pass it to the parse function \code{XML\_Parse}: \begin{smallverbatim} > text <- "<hello> <world> </world> </hello>" > XML_Parse( p, text, nchar(text), 1) \end{smallverbatim} The resulting output is given below: \begin{smallverbatim} Start tag: hello Start tag: world End tag: world End tag: hello \end{smallverbatim} Expat supports processing of very large XML documents in a chunk-based manner by calling \code{XML\_Parse} several times, where the last argument is used as indicator for the final chunk of the document. \section{Architecture} The core implementation of the FFI, callbacks and loading of code are mainly based on the suite of libraries of the \emph{DynCall} project \citep{dyncall}. \subsection{Dynamic calls} The FFI offered by \pkg{rdyncall} is based on the \pkg{dyncall} library, which provides an abstraction for making arbitrary machine-level calls with support for multiple calling conventions and most C argument- and return-types. \footnote{\emph{Inline} structure types are currently not fully supported.} For each processor architecture, the supported calling conventions are abstracted in a \emph{Call Virtual Machine} (CallVM) object. The \pkg{dyncall} library offers a universal C interface that can be used from within scripting language interpreter contexts to build up a machine-level call in a structured manner. A CallVM comprises a state machine and a call kernel. The state machine is implemented in C and keeps track of internal buffers for pre-loading argument values that get arranged for specific storage locations, such as stack or special register sets according to the processor architecture and the chosen calling conventions. The actual invocation of a foreign function call is conducted by the Call Kernel - a small piece of code that is implemented in Assembly and that provides a generic call facility for a particular calling convention. It prepares machine-level calls by copying data to registers and to the call stack according to the relevant calling convention, and finally executes the machine call to a target address. From a scripting language interpreter perspective, the invocation of a foreign function call through the CallVM is conducted in three consecutive phases using the \pkg{dyncall} C API: \begin{enumerate} \item \emph{Setup Phase:} The desired calling convention has to be chosen which, in most cases, is just the \emph{default C} calling convention. However, more specialized and platform-specific calling conventions are available as well, in particular for the 32-Bit Windows OS. \item \emph{Argument Loading Phase:} Arguments are passed in a \emph{left-to-right} order according to the declaration of the C/C++ function/method type declaration. Argument values are stored in buffers according to the processor architecture and selected calling convention. \item \emph{Call and Return-Value Receive Phase:} A return-type specific call function is chosen and the target address of the foreign code is passed, which gets called via the Call Kernel. \end{enumerate} The architecture makes it straight-forward to implement a FFI for a dynamic language interpreter using a text parser for call signatures to drive the conversion of arguments and results. Similar FFIs with a text-based interface have been implemented for other language interpreters such as Ruby, Python and Lua. See the DynCall source repository \citep{dyncall}. Both the C interface of dyncall and the signature format use the abstract C/C++ type system and give no indication about the effective size of a particular type. In experiments with several C APIs bound via \pkg{rdyncall} it turns out that the signatures do work cross-platform, if the fundamental type definitions of the C API do not change across platforms. In our tests and the presented examples, a wide range of C APIs have this property and type signatures are valid across platforms even when switching between 32- and 64-bit platforms. \subsection{Dynamic callbacks} The \pkg{dyncallback} library provides a framework to implement dynamic callbacks for language interpreters to wrap scripting functions as C function pointers. The framework offers a universal C interface for callback handler that is implemented once for a particular interpreter. The handler receives callback calls from C and forwards the call, including conversion of arguments, to a scripting function. Handlers need to access machine-level arguments whose location can be on the stack, or in registers, depending on the processor architecture and calling convention. For that reason, the handler interface receives an abstract argument iterator that gives structured access to the arguments for passing over to the high-level language. Call-backs are created via an interface that pools a handler, language context, scripting function reference, callback type-information and other user data into a \emph{single} native C function pointer, such that even very low-level C callbacks without user-supplied user-data can be addressed with the underlying technique. \footnote{This includes callbacks for sort routines of the Standard C library which lack user-data.} \subsection{Portability and Stability} The requirements for porting the \emph{DynCall} libraries to a new processor and/or platform are high: The calling conventions of a target processor platform have to be studied in detail, state machines have to be implemented in C and a small amount of code has to be written in Assembly which can be even non-portable across build tools on the same platform. Nevertheless \pkg{dyncall} (as of version 0.7) has support for many processor architectures such as Intel i386 (x86), AMD 64 (x64), PowerPC 32-bit,ARM (including Thumb extension), MIPS 32/64-bit and SPARC 32/64-bit including support for several platform-, processor- and compiler-specific calling conventions. \pkg{dyncallback} also supports major processor architectures such as Intel i386 (x86), AMD 64 (x64) and ARM and offers partial support for PowerPC 32-bit (support for Mac OS X/Darwin). Besides the processor architecture, the libraries are also explicitly ported and tested on various OS such as Linux, Mac OS X, Windows, the BSD family, Solaris, Haiku, Minix and Plan9. Support for embedded platforms such as Playstation Portable, Nintendo DS and iPhone OS is available as well. \emph{DynCall} contains a suite of testing tools for quality assurance. Included are test-case generators written in Lua and Python. Extreme call and callback scenarios are tested here to ensure correct passing of arguments and results. Before a release, the libraries and tests are built for a large set of architectures on \pkg{DynOS} \citep{dynos} - a batch-build system using full system emulators such as \pkg{QEmu}\citep{qemu} and \pkg{GXEmul}\citep{gxemul} and various operating-system images to test release candidates and create pre-built binary releases of the library. \subsection{Text-based Signature Interfaces} A common property of the service interface presented here is the use of signature text formats. Signatures are used as descriptors for types, such as foreign function calls, callbacks and aggregate data types. The reasons that lead to the use of signatures as a high-level user-interface to interact with such services are given next: \begin{enumerate} \item Cross-language interface: Text format interfaces are available across high-level languages. Examples for cross-language text-based interfaces include regular expressions or \code{printf}-style formatted output descriptions. \item Developer-friendly: The simplicity and compactness of the text-format enables developers to bridge with foreign code in interactive and rapid development sessions. C type signatures can be derived by hand with minimum effort: Fundamental types are encoded with a single character and the upper-case encodes an \code{unsigned} type. \item Machine-neutral: In contrast to binary encoded type libraries, the data format is not affected by the endian model of the underlying platform. \item Parser-friendly: The signature format can be used as driver code to perform foreign function calls. Implementations of parsers match the sequential design of \pkg{dyncall}'s CallVM and \pkg{dyncallback}'s argument iterator interface. \end{enumerate} \subsection{Creation of DynPort files} In this section we describe the tool-chain that creates the universal bindings called \emph{DynPort}. The process described here is applied once on a build machine, the generated output is used later at run-time across platforms to drive the dynamic linkage and binding procedure. \emph{DynPort} files can be created automatically from C header files using a tool-chain as depicted in Figure \ref{fig:gen_dynport}. \begin{figure} \centering \includegraphics[scale=0.45]{img_gen_dynport.pdf} \caption{\label{fig:gen_dynport} Tool-chain to create \emph{DynPort} files from C headers} \end{figure} The tool-chain comprises several freely available components that are briefly described next: \pkg{GCC-XML} \citep{gccxml} is a modified version of the GCC compiler which translates C sources to XML document. \pkg{xsltproc}, distributed as part of the \pkg{libxslt} library \citep{libxslt}, is a XSLT processor that transforms XML documents to XML, text or binary formats according to style-sheets written in the \emph{XSL Transformations} \citep{Clark:01:XTV} language. To extract library binding specifications, a main C source file is created that consists of one or more \code{\#include} statements that reference library and/or system header files to process. The header files should have been previously installed on the build machine. In a preprocessing phase, the GNU C Macro Processor is used to process all \code{\#include} statements using standard system search paths to create a concatenated \emph{All-In-One} source file free of any \code{\#include} statements. GCC-XML transforms C header declarations to XML. A XSL style-sheet implements the transformation of XML to type signature formats using a XSLT processor. C Macro \code{\#define} statements are handled separately by a custom C Preprocessor implemented in C++ using the boost wave library \citep{boostwave}. An optional filter stage is used to include only elements with a certain pattern such as a common prefix usually found in many libraries e.g. '\code{SDL\_}'. In a last step, the various fragments are assembled into a single text-file which represents the \emph{DynPort} file. The overall build process is managed by \emph{make} files and a repository of recipes has been setup to extend support for additional dynports and libraries in a structured and coordinated way. \section{Summary and Outlook} This paper introduces the \pkg{rdyncall} package (Version 0.7.3 on CRAN as of this writing) that contributes an improved Foreign Function Interface for R. The FFI facilitates \emph{direct} invocation of foreign functions \emph{without} the need to compile additional wrapper in C. Based on the FFI, a dynamic cross-platform linkage framework to wrap and access \emph{whole} C interfaces of native libraries from R is discussed. Instead of \emph{compiling} bindings for every library-and-language combination, R bindings of a library are created dynamically at run-time in a data-driven manner via \emph{DynPort} files - a cross-platform universal type information format. C libraries are made accessible in R as though they were extension packages and the R interface looks very similar to that of C. This enables system-level programming in R and brings a new wave of possibilities for R developers such as using OpenGL directly in R across platforms as described in the example. An initial repository of \emph{DynPort}s for standard cross-platform portable C libraries comes with the package. The implementation is based on libraries from the \emph{DynCall} project that implement non-trivial facilities such as an abstraction to machine-level function calls supporting multiple calling conventions and the handling of C callbacks from within scripting language interpreter environments. The libraries have been ported across major R platforms. Work is in progress to support missing architectures in \pkg{dyncallback} such as PowerPC System V 32-bit, PowerPC 64-bit, and, 32/64-bit MIPS and SPARC architectures. The handling of foreign aggregate data types, which is currently implemented in R and C, is planned to be reimplemented in portable C as part of \emph{DynCall}, in cooperation with the developers of \emph{BridJ}\citep{bridj}. Currently, \emph{DynPort} files are written as R scripts with inline text chunks created from the \emph{DynPort} tool chain. For the Lua Programming Language \citep{SPE::IerusalimschyFF1996}, a similar framework named \pkg{luadyncall} is in development using a language-neutral format for \emph{DynPort} files. The need to install additional shared libraries still represents a hurdle for ordinary R users. We plan to find a common abstraction layer for installation systems, package managers and software distribution services across OS-distributions, and to integrate meta installation information into the \emph{DynPort} file format. The \emph{DynPort} facility in \pkg{rdyncall} consitutes an initial step in building up an infrastructure between scripting languages and C libraries. Analogous to the way in which R users enjoy quick access to the large pool of R software managed by CRAN, we envision an archive network in which C library developers can distribute their work across languages, and users get quick access to the pool of C libraries from within scripting languages via automatic installation of precompiled components and using universal type information for cross-platform and cross-language dynamic bindings. \bibliography{FLI} \end{document}