The goal of nanoarrow is to provide minimal useful bindings to the Arrow C Data and Arrow C Stream interfaces using the nanoarrow C library.
You can install the released version of nanoarrow from CRAN with:
install.packages("nanoarrow")You can install the development version of nanoarrow from GitHub with:
# install.packages("remotes")
remotes::install_github("apache/arrow-nanoarrow/r")If you can load the package, you’re good to go!
library(nanoarrow)The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the ArrowSchema which represents a data type of
an array, the ArrowArray which represents the values of an
array, and an ArrowArrayStream, which represents zero or
more ArrowArrays with a common ArrowSchema.
All three can be wrapped by R objects using the nanoarrow R package.
Use infer_nanoarrow_schema() to get the ArrowSchema
object that corresponds to a given R vector type; use
as_nanoarrow_schema() to convert an object from some other
data type representation (e.g., an arrow R package DataType
like arrow::int32()); or use na_XXX()
functions to construct them.
infer_nanoarrow_schema(1:5)
#> <nanoarrow_schema int32>
#> $ format : chr "i"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULL
as_nanoarrow_schema(arrow::schema(col1 = arrow::float64()))
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:<nanoarrow_schema double>
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULL
na_int64()
#> <nanoarrow_schema int64>
#> $ format : chr "l"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 2
#> $ children : list()
#> $ dictionary: NULLUse as_nanoarrow_array() to convert an object to an
ArrowArray object:
as_nanoarrow_array(1:5)
#> <nanoarrow_array int32[5]>
#> $ length : int 5
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 2
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#> $ dictionary: NULL
#> $ children : list()
as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULLYou can use as.vector() or as.data.frame()
to get the R representation of the object back:
array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
as.data.frame(array)
#> col1
#> 1 1.1
#> 2 2.2Even though at the C level the ArrowArray is distinct from the
ArrowSchema, at the R level we attach a schema wherever possible. You
can access the attached schema using
infer_nanoarrow_schema():
infer_nanoarrow_schema(array)
#> <nanoarrow_schema struct>
#> $ format : chr "+s"
#> $ name : chr ""
#> $ metadata : list()
#> $ flags : int 0
#> $ children :List of 1
#> ..$ col1:<nanoarrow_schema double>
#> .. ..$ format : chr "g"
#> .. ..$ name : chr "col1"
#> .. ..$ metadata : list()
#> .. ..$ flags : int 2
#> .. ..$ children : list()
#> .. ..$ dictionary: NULL
#> $ dictionary: NULLThe easiest way to create an ArrowArrayStream is from a list of
arrays or objects that can be converted to an array using
as_nanoarrow_array():
stream <- basic_array_stream(
list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
)
)You can pull batches from the stream using the
$get_next() method. The last batch will return
NULL.
stream$get_next()
#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
stream$get_next()
#> <nanoarrow_array struct[2]>
#> $ length : int 2
#> $ null_count: int 0
#> $ offset : int 0
#> $ buffers :List of 1
#> ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> $ children :List of 1
#> ..$ col1:<nanoarrow_array double[2]>
#> .. ..$ length : int 2
#> .. ..$ null_count: int 0
#> .. ..$ offset : int 0
#> .. ..$ buffers :List of 2
#> .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#> .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `3.3 4.4`
#> .. ..$ dictionary: NULL
#> .. ..$ children : list()
#> $ dictionary: NULL
stream$get_next()
#> NULLYou can pull all the batches into a data.frame() by
calling as.data.frame() or as.vector():
stream <- basic_array_stream(
list(
data.frame(col1 = c(1.1, 2.2)),
data.frame(col1 = c(3.3, 4.4))
)
)
as.data.frame(stream)
#> col1
#> 1 1.1
#> 2 2.2
#> 3 3.3
#> 4 4.4After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object.
The nanoarrow package implements as_nanoarrow_schema(),
as_nanoarrow_array(), and
as_nanoarrow_array_stream() for most arrow package types.
Similarly, it implements arrow::as_arrow_array(),
arrow::as_record_batch(),
arrow::as_arrow_table(),
arrow::as_record_batch_reader(),
arrow::infer_type(), arrow::as_data_type(),
and arrow::as_schema() for nanoarrow objects such that you
can pass equivalent nanoarrow objects into many arrow functions and vice
versa.