% \VignetteIndexEntry{Introduction to rsolr} % \VignetteDepends{nycflights13} % \VignetteKeywords{Solr} % \VignettePackage{rsolr} % \VignetteEngine{knitr::knitr} \documentclass[11pt]{article} \author{Michael Lawrence} \date{\today} \title{Introduction to rsolr} \begin{document} \maketitle \tableofcontents \section{Introduction} \label{sec-1} The \texttt{rsolr} package provides an idiomatic (R-like) and extensible interface between R and Solr, a search engine and database. Like an onion, the interface consists of several layers, along a gradient of abstraction, so that simple problems are solved simply, while more complex problems may require some peeling and perhaps tears. The interface is idiomatic, syntactically but also in terms of \emph{intent}. While Solr provides a search-oriented interface, we recognize it as a document-oriented database. While not entirely schemaless, its schema is extremely flexible, which makes Solr an effective database for prototyping and adhoc analysis. R is designed for manipulating data, so \texttt{rsolr} maps common R data manipulation verbs to the Solr database and its (limited) support for analytics. In other words, \texttt{rsolr} is for analysis, not search, which has presented some fun challenges in design. Hopefully it is useful --- we had not tried it until writing this document. We have interfaced with all of the Solr features that are relevant to data analysis, with the aim of implementing many of the fundamental data munging operations. Those operations are listed in the table below, along with how we have mapped those operations to existing and well-known functions in the base R API, with some important extensions. When called on \texttt{rsolr} data structures, those functions should behave analogously to the existing implementations for \texttt{data.frame}. Note that more complex operations, such as joining and reshaping tables, are best left to more sophisticated frameworks, and we encourage others to implement our extended base R API on top of such systems. After all, Solr is a search engine. Give it a break. \begin{center} \begin{tabular}{ll} Operation & R function\\ \hline Filtering & \texttt{subset}\\ Transformation & \texttt{transform}\\ Sorting & \texttt{sort}\\ Aggregation & \texttt{aggregate}\\ \end{tabular} \end{center} \section{Demonstration: nycflights13} \label{sec-2} \subsection{The Dataset} \label{sec-2-1} As part demonstration and part proof of concept, we will attempt to follow the introductory workflow from the \texttt{dplyr} vignette. The dataset describes all of the airline flights departing New York City in 2013. It is provided by the \texttt{nycflights13} package, so please see its documentation for more details. <>= if (identical(suppressWarnings(packageDescription("nycflights13")), NA)) { knitr::opts_chunk$set(eval = FALSE) message("Package nycflights13 not installed: not evaluating code chunks.") } @ <>= library(nycflights13) dim(flights) head(flights) @ \subsection{Populating a Solr core} \label{sec-2-2} The first step is getting the data into a Solr \emph{core}, which is what Solr calls a database. This involves writing a schema in XML, installing and configuring Solr, launching the server, and populating the core with the actual data. Our expectation is that most use cases of \texttt{rsolr} will involve accessing an existing, centrally deployed, usually read-only Solr instance, so those are typically not major concerns. However, to conveniently demonstrate the software, we need to violate all of those assumptions. Luckily, we have managed to embed an example Solr installation within \texttt{rsolr}. We also provide a mechanism for autogenerating a Solr schema from a \texttt{data.frame}. This could be useful in practice for producing a template schema that can be tweaked and deployed in shared Solr installations. Taken together, the process turns out to not be very intimidating. We begin by generating the schema and starting the demo Solr instance. Note that this instance is really only meant for demonstrations. You should not abuse it like the people abused the poor built-in R HTTP daemon. <>= library(rsolr) schema <- deriveSolrSchema(flights) solr <- TestSolr(schema) @ Next, we need to populate the core with our data. This requires a way to interact with the core from R. \texttt{rsolr} provides direct access to cores, as well as two high-level interfaces that represent a dataset derived from a core (rather than the core itself). The two interfaces each correspond to a particular shape of data. \emph{SolrList} behaves like a list, while \emph{SolrFrame} behaves like a table (data frame). \emph{SolrList} is useful for when the data are ragged, as is often the case for data stored in Solr. The Solr schema is so dynamic that we could trivially define a schema with a virtually infinite number of fields, and each document could have its own unique set of fields. However, since our data are tabular, we will use \emph{SolrFrame} for this exercise. <>= sr <- SolrFrame(solr$uri) @ Finally, we load our data into the Solr dataset: <>= sr[] <- flights @ This takes a while, since Solr has to generate all sorts of indices, etc. As \emph{SolrFrame} behaves much like a base R data frame, we can retrieve the dimensions and look at the head of the dataset: <>= dim(sr) head(sr) @ Comparing the output above the that of the earlier call to \texttt{head(flights)} reveals that the data are virtually identical. As Solr is just a search engine (on steroids), a significant amount of engineering was required to achieve that result. \subsection{Restricting by row} \label{sec-2-3} The simplest operation is filtering the data, i.e., restricting it to a subset of interest. Even a search engine should be good at that. Below, we use \texttt{subset} to restrict to the flights to those departing on January 1 (2013). <>= subset(sr, month == 1 & day == 1) @ Note how the records at the bottom contain missing values. Solr does not provide any facilities for missing value representation, but we mimic it by excluding those fields from those documents. We can also extract ranges of data using the canonical \texttt{window()} function: <>= window(sr, start=1L, end=10L) @ Or, as we have already seen, the more convenient: <>= head(sr, 10L) @ We could also call \texttt{:} to generate a contiguous sequence: <>= sr[1:10,] @ Unfortunately, it is generally infeasible to randomly access Solr records by index, because numeric indexing is a foreign concept to a search engine. Solr does however support retrieval by a key that has a unique value for each document. These data lack such a key, but it is easy to add one and indicate as such to \texttt{deriveSolrSchema()}. \subsection{Sorting} \label{sec-2-4} To sort the data, we just call \texttt{sort()} and describe the order by passing a formula via the \texttt{by} argument. For example, we sort by year, breaking ties with month, then day: <>= sort(sr, by = ~ year + month + day) @ To sort in decreasing order, just pass \texttt{decreasing=TRUE} as usual: <>= sort(sr, by = ~ arr_delay, decreasing=TRUE) @ \subsection{Restricting by field} \label{sec-2-5} Just as we can use \texttt{subset} to restrict by row, we can also use it to restrict by column: <