%\VignetteIndexEntry{Algorithms and Benchmarks for the coop Package} \documentclass[]{article} \input{./include/settings} \mytitle{Algorithms and Benchmarks for the coop Package} \mysubtitle{} \myversion{0.6-2} \myauthor{ \centering Drew Schmidt \\ \texttt{wrathematics@gmail.com} } \begin{document} \makefirstfew \section{Introduction}\label{introduction} In this document, we will introduce the algorithms underlying the \textbf{coop} package~\cite{coop}, and offer some benchmarks. In order to recreate the benchmarks here, one needs a compiler that supports \textbf{OpenMP}~\cite{openmp} and a high-performance \textbf{BLAS} library~\cite{lawson1979basic}. See the other \textbf{coop} package vignette \emph{Introducing coop: Fast Covariance, Correlation, and Cosine Operations}~\cite{coop2016guide} for details. We do not bother to go into details for the covariance and correlation algorithms, because they are obvious and uninteresting. \subsection{A Note on Sparse Operations}\label{a-note-on-sparse-operations} Of the three operations, only cosine similarity currently has a sparse implementation. The short reason why is that the other two operations require centering and/or scaling. To better understand the problem, consider the \(5\times 20\) matrix whose first row is a row of ones, and all other rows consist entirely of zeros: \begin{lstlisting}[language=rr] x <- matrix(0, 20, 5) x[1, ] <- 1 \end{lstlisting} The original matrix is, obviously, 95\% sparse: \begin{lstlisting}[language=rr] coop::sparsity(x) \end{lstlisting} But if we center, the data, it becomes 100\% dense: \begin{lstlisting}[language=rr] coop::sparsity(scale(x, T, F)) \end{lstlisting} \section{The Algorithms with Notes on Implementation}\label{the-algorithms-with-notes-on-implementation} For dense implementations, the performance should scale well, and the non-BLAS components will use multiple threads (if your compiler supports OpenMP) when the matrix has more than 1000 columns. Additionally, we try to use vector operations (using OpenMP's \texttt{simd} construct) for additional performance; but you need a compiler that supports a relatively modern OpenMP standard for this. \subsection{Dense Matrix Input}\label{dense-matrix-input} Given an \(m\times n\) matrix \(A\) (input) and an \(n\times n\) matrix \(C\) (preallocated output): \begin{enumerate} \item Compute the upper triangle of the crossproduct \texttt{C\ =\ t(A)\ \%*\%\ X} using a symmetric rank-k update (the \texttt{\_syrk} BLAS function). \item Iterate over the upper triangle of \(C\): \begin{enumerate} \item Divide its off-diagonal values by the square root of the product of its \(i\)'th and \(j\)'th diagonal entries. \item Replace its diagonal values with 1. \end{enumerate} \item Copy the upper triangle of \(C\) onto its lower triangle. \end{enumerate} The total number of floating point operations is: \begin{enumerate} \item \(mn(n+1)\) for the symmetric rank-k update. \item \(\frac{3n(n+1)}{2}\) for the rescaling operation. \end{enumerate} The algorithmic complexity is \(O(mn^2)\), and is dominated by the symmetric rank-k update. The storage complexity, ignoring the required allocation of outputs (namely the \(C\) matrix), is \(O(1)\). \subsection{Dense Vector-Vector Input}\label{dense-vector-vector-input} Given two \(n\)-length vectors \(x\) and \(y\) (inputs): \begin{enumerate} \item Compute \texttt{crossprod\ =\ t(x)\ \%*\%\ y} (using the \texttt{\_gemm} BLAS function). \item Compute the square of the Euclidean norms of \(x\) and \(y\) (using the \texttt{\_syrk} BLAS function). \item Divide \texttt{crossprod} from 1 by the square root of the product of the norms from 2. \end{enumerate} The total number of floating point operations is: \begin{enumerate} \item \(2n-1\) for the crossproduct. \item \(4*n-2\) for the two (square) norms. \item \(3\) for the division and square root/product. \end{enumerate} The algorithmic complexity is \(O(n)\). The storage complexity is \(O(1)\). \subsection{Sparse Matrix Input}\label{sparse-matrix-input} Given an \(m\times n\) sparse matrix \(A\) stored as a COO with row/column indices \(i\) and \(j\) \textbf{where they are sorted by columns first, then rows}, and corresponding data vector \(a\) (inputs), and given a preallocated \(n\times n\) dense matrix \(C\) (output): \begin{enumerate} \item Initialize \(C\) to 0. \item For each non-zero column \(j\) of the conceptually dense matrix \(A\) (call it \(x\)), find its first and final position in the COO storage. \begin{enumerate} \item If \(x\) is missing (its entries are all 0), set the \(j\)'th row and column of the lower triangle of \(C\) to \texttt{NaN} (for compatibility with dense routines). Go to 2. \item Otherwise, for each column \texttt{i\textgreater{}j} of \texttt{a} (call it \texttt{y}), find its first and final position in the COO storage. \item Compute the dot product of \(x\) and \(y\), \(x\cdot y\). \item If \(x\dot y > \epsilon\) (\texttt{epsilon=1e-10} for us): \begin{itemize} \item Compute the dot products of \(x\) with itself \(x\cdot x\) and \(y\) with itself \(y\cdot y\). \item Set the \((i, j)\)'th entry of \(C\) to \(\frac{x\cdot y}{\sqrt{x\cdot x}\sqrt{y\cdot y}}\). \end{itemize} \end{enumerate} \item Copy the lower triangle to the upper and set the diagonal to 1. \end{enumerate} The worst case runtime complexity occurs when the matrix is dense but stored as a sparse matrix, and is \(O(mn^2)\), the same as in the dense case. However, this will cause serious cache thrashing, and the performance will be abysmal. The function stores the \(j\)'th column data and its row indices in temporary storage for better cache access patterns. Best case, this requires 12 KiB of additional storage, with 8 for the data and 4 for the indices. Worse case (an all-dense column), this balloons up to \(12m\). The storage complexity is best case \(O(1)\), and worst case \(O(m)\). \section{Benchmarks}\label{benchmarks} The source code for all benchmarks presented here can be found in the source tree of this package under \texttt{inst/benchmarks/}, or in the binary installation under \texttt{benchmarks/}. All benchmarks were performed using: \begin{itemize} \item R 3.2.2 \item OpenBLAS \item gcc 5.2.1 \item 4 cores of a Core i5-2500K CPU @ 3.30GHz \end{itemize} Throughout the benchmarks, we will use the following packages and data: \begin{lstlisting}[language=rr] library(rbenchmark) reps <- 100 cols <- c("test", "replications", "elapsed", "relative") \end{lstlisting} \subsection{Dense Matrix Input}\label{dense-matrix-input-1} Compared to the version in the lsa package (as of 27-Oct-2015), this implementation performs quite well: \begin{lstlisting}[language=rr] m <- 2000 n <- 200 x <- matrix(rnorm(m*n), m, n) benchmark(coop::cosine(x), lsa::cosine(x), columns=cols, replications=reps) ## test replications elapsed relative ## 1 coop::cosine(x) 100 0.177 1.000 ## 2 lsa::cosine(x) 100 113.543 641.486 \end{lstlisting} \subsection{Dense Vector-Vector Input}\label{dense-vector-vector-input-1} Here the two perform identically: \begin{lstlisting}[language=rr] n <- 1000000 x <- rnorm(n) y <- rnorm(n) benchmark(coop::cosine(x, y), lsa::cosine(x, y), columns=cols, replications=reps) ## test replications elapsed relative ## 1 coop::cosine(x, y) 100 0.757 1.000 ## 2 lsa::cosine(x, y) 100 0.768 1.015 \end{lstlisting} \subsection{Sparse Matrix Input}\label{sparse-matrix-input-1} Benchmarking sparse matrix methods can be more challenging than with dense for a variety of reasons, chief among them being that the level of sparsity can make an enormous impact in performance. We present two cases here of varying levels of sparsity. First, we will generate a 0.1\% dense / 99.9\% sparse matrix: \begin{lstlisting}[language=rr] m <- 6000 n <- 250 dense <- coop:::dense_stored_sparse_mat(m, n, .001) sparse <- slam::as.simple_triplet_matrix(dense) \end{lstlisting} This gives us a fairly dramatic difference in storage: \begin{lstlisting}[language=rr] memuse::memuse(dense) ## 11.444 MiB memuse::memuse(sparse) ## 24.445 KiB \end{lstlisting} So the dense matrix needs roughly 479 times as much storage for the exact same data. In such very sparse cases, the sparse implementation will perform quite nicely: \begin{lstlisting}[language=rr] benchmark(dense=coop::cosine(dense), coop::cosine(sparse), columns=cols, replications=reps) ## test replications elapsed relative ## 1 dense 100 0.712 3.082 ## 2 sparse 100 0.231 1.000 \end{lstlisting} Note that this is a 3-fold speedup over our already highly optimized implementation. This is quite nice, especially considering the sparse implementation uses only one thread and limited vectorization, while the dense one uses 4 threads and vectorization. However, as the matrix becomes more dense (and it doesn't take much), dense methods begin to perform better: \begin{lstlisting}[language=rr] dense <- coop:::dense_stored_sparse_mat(m, n, .01) sparse <- slam::as.simple_triplet_matrix(dense) memuse::memuse(dense) ## 11.444 MiB memuse::memuse(sparse) ## 235.383 KiB benchmark(coop::cosine(dense), coop::cosine(sparse), as.matrix(sparse), columns=cols, replications=reps) benchmark(cosine(dense), cosine(sparse), as.matrix(sparse), columns=cols, replications=reps) ## test replications elapsed relative ## 1 dense 100 0.707 1.000 ## 2 sparse 100 2.076 2.936 \end{lstlisting} While the sparse implementation performs significantly worse than the dense one for this level of sparsity and data size, note that the memory usage for the dense case is greater than that of the sparse by a factor of 50. It is hard to give perfect advice for when to use a dense or sparse method, but a general rule of thumb is that if you have more than 5\% non-zero data, definitely use dense methods. For 1-5\%, there is a memory/runtime tradeoff worth considering; if you can comfortably store the matrix densely, then by all means use dense methods. For data \textless{}1\% dense, sparse methods will generally have better runtime performance than dense methods. \addcontentsline{toc}{section}{References} \bibliography{./include/coop} \bibliographystyle{plain} \end{document}