\documentclass[nojss]{jss} %% -- LaTeX packages and custom commands --------------------------------------- %% recommended packages \usepackage{orcidlink,thumbpdf,lmodern} %% another package (only for this demo article) \usepackage{tikz,bbm,amsmath,amssymb} %% new custom commands \newcommand{\class}[1]{`\code{#1}'} \newcommand{\fct}[1]{\code{#1()}} %% -- Article metainformation (author, title, ...) ----------------------------- %% - \author{} with primary affiliation (and optionally ORCID link) %% - \Plainauthor{} without affiliations %% - Separate authors by \And or \AND (in \author) or by comma (in \Plainauthor). %% - \AND starts a new line, \And does not. \author{ Laura Vana-G\"ur~\orcidlink{0000-0002-9613-7604}\\TU Wien \And Roman Parzer~\orcidlink{0000-0003-0893-3190}\\TU Wien \And Peter Filzmoser~\orcidlink{0000-0002-8014-4682}\\TU Wien } \Plainauthor{Laura Vana G\"ur, Roman Parzer, Peter Filzmoser} %% - \title{} in title case %% - \Plaintitle{} without LaTeX markup (if any) %% - \Shorttitle{} with LaTeX markup (if any), used as running title \title{\pkg{spareg}: Sparse Projected Averaged Regression in \proglang{R}} \Plaintitle{spareg: Sparse Projected Averaged Regression in R} \Shorttitle{Sparse Projected Averaged Regression in \proglang{R}} %% - \Abstract{} almost as usual \Abstract{ The \pkg{spareg} package for \proglang{R} builds ensembles of predictive generalized linear models for high-dimensional data. It employs an algorithm that combines variable screening and random projection techniques in each model of the ensemble to address the computational challenges of large predictor sets. By implementing this modeling approach in an accessible framework, \pkg{spareg} enables users to apply methods that have shown competitive predictive performance against state-of-the-art alternatives, while at the same time keeping computational costs low. Designed with extensibility in mind, \pkg{spareg} implements the screening and random projection techniques, as well as the generalized linear models employed in the ensemble as \proglang{S}3 classes with user-friendly constructor functions. This modular design allows users to seamlessly integrate and develop new procedures, making the package a versatile tool for high-dimensional applications. } \Keywords{random projection, variable screening, ensemble learning, \proglang{R}} \Plainkeywords{random projection, variable screening, ensemble learning, R} \Address{ Laura Vana-G\"ur\\ Computational Statistics (CSTAT)\\ Institute of Statistics and Mathematical Methods in Economics\\ TU Wien\\ Wiedner Hauptstra{\ss}e 8-10\\ 1040 Vienna, Austria\\ E-mail: \email{laura.vana.guer@tuwien.ac.at} Roman Parzer\\ Computational Statistics (CSTAT)\\ Institute of Statistics and Mathematical Methods in Economics\\ TU Wien\\ Wiedner Hauptstra{\ss}e 8-10\\ 1040 Vienna, Austria\\ E-mail: \email{romanparzer1@gmail.com} Peter Filzmoser\\ Computational Statistics (CSTAT)\\ Institute of Statistics and Mathematical Methods in Economics\\ TU Wien\\ Wiedner Hauptstra{\ss}e 8-10\\ 1040 Vienna, Austria\\ E-mail: \email{peter.filzmoser@tuwien.ac.at} } <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) library("ggplot2") library("spareg") is_paper <- FALSE tmpdir <- tempdir() @ \newif\ifPositive <>= if (is_paper) { cat("\\Positivetrue") } else { cat("\\Positivefalse") } @ % \VignetteIndexEntry{spareg: Sparse Projected Averaged Regression in R} % \VignetteDepends{robustbase,cellWise,VariableScreening,R.matlab,doParallel,doRNG,ggpubr} % \VignetteKeyword{random projection} % \VignetteKeyword{screening} % \VignetteKeyword{ensemble learning} \begin{document} \section{Introduction} \label{sec-intro} The \pkg{spareg} package for \proglang{R} \citep{RLanguage} offers functionality for estimating generalized linear models (GLMs) in high-dimensional settings, where the number of predictors $p$ can significantly exceed the number of observations $n$, i.e.,~$p>n$ or even $p\gg n$. To address the challenges of high dimensionality, the package implements a general algorithm which integrates variable screening methods with random projection techniques to effectively reduce the predictor space. More specifically, \pkg{spareg} builds an ensemble of GLMs where, in each model of the ensemble, i)~variables are first screened based on a measure of the utility of each predictor for the response, ii)~the selected variables are then projected to a lower dimensional space (smaller than $n$) using a random projection matrix, iii)~a GLM is estimated using the projected predictors. Finally, additional sparsity in the coefficients of the original variables can be introduced through a thresholding parameter, which, together with the number of models in the ensemble, can be chosen using a validation set or via cross-validation. The final coefficients are then obtained by averaging over the marginal models in the ensemble. The motivation of such an algorithm, which performs what we call a \emph{sparse projected averaged regression} (SPAR) for both discrete and continuous data in the GLM framework, lies in its computational efficiency. Random projection is a computationally-efficient method which linearly maps a set of points in high dimensions into a much lower-dimensional space and random projection matrices have been traditionally generated from suitable distributions in a data-agnostic fashion. By projecting the original predictors to a dimension lower than $n$, estimation of the models on the reduced predictors can be done using standard methods. However, random projection can suffer from noise accumulation for very large $p$, as too many irrelevant predictors are being considered for prediction purposes \citep{Dunson2020TargRandProj}. Therefore, the screening step is advisable in order to eliminate the influence of irrelevant variables before performing the random projection, while also reducing computational costs. The ensemble approach is motivated by the fact that, although combining variable screening with random projection effectively reduces the predictor set and computational costs, the variability introduced by random sampling can be mitigated by averaging the results from multiple iterations \citep{Thanei2017RPforHDR}. Different variants of the algorithm have been shown to perform well in terms of prediction power on a variety of data sets. For example, \cite{Dunson2020TargRandProj} employ the algorithm with the data-agnostic, sparse random projection of \cite{ACHLIOPTAS2003JL} combined with probabilistic marginal correlation screening. Furthermore, \cite{parzer2024glms} show that, when paired with a carefully constructed data-driven random projection, the algorithm performs superiorly in terms of predictions and variable ranking in settings with different degrees of sparsity in the coefficients and with correlated predictors. Given the variety of screening and random projection matrices to be possibly employed in the algorithm -- each with distinct advantages across different data settings -- package \pkg{spareg} \citep{pkg:spareg} offers a diverse selection of screening coefficients and multiple procedures for constructing random projection matrices and is available from the Comprehensive \proglang{R} Archive Network (CRAN) at \url{https://CRAN.R-project.org/package=spareg}. Designed for flexibility, the package provides a versatile framework that allows users to extend its screening and projection techniques, as well as the GLM models employed in the ensemble with custom procedures. This is achieved through \proglang{R}'s \proglang{S}3 classes and user-friendly constructor functions % , ensuring a seamless % and user-friendly integration. Users can easily incorporate various techniques % by either developing their own procedures or leveraging existing % \proglang{R} packages. The package provides methods such as \code{plot}, \code{predict}, \code{coef}, \code{print}, which allow users to more easily interact with the model output and analyze the results. The GLM framework, especially when combined with random projections which preserve information on the original coefficients \citep[such as the one in][]{parzer2024glms}, facilitates interpretability of the model output, allowing users to understand variable effects. While, to the best of our knowledge, \pkg{spareg} offers the first implementation of the described algorithm and no other package offers the same functionality for GLMs, few other \proglang{R} packages focus on building ensembles where the dimensionality of the predictors is reduced. Most notably, package \pkg{RPEnsemble} \citep{RPEnsembleR} implements the procedure in \cite{cannings2017random}, where ``carefully-selected'' random projections are used for projecting the predictors before they are employed in a classifier such as $k$~nearest neighbor, linear or quadratic discriminant analysis. On the other hand, package \pkg{RaSEn} \citep{pkg:RaSEn} implements an algorithm for ensemble classification and regression problems, where random subspaces are generated and the optimal one is chosen to train a weak learner on the basis of some criterion. Similarly, package \pkg{randomGLM} \citep{pkg:randomGLM} implements an ensemble predictor based on bootstrap aggregation (bagging) of GLMs whose covariates are first randomly selected and then further reduced using forward regression according to AIC criterion. For \proglang{Python} \citep{Python}, a similar workflow could be set up by building ensembles using the package \pkg{pycobra} \citep{pkg:pycobra} by integrating \pkg{scikit-learn} models \citep{pedregosa2011scikit}. Random projection tools are implemented in \pkg{sklearn.random\_projection} while GLM estimation on the projected predictors (both classical and penalized) can be done using \pkg{sklearn.linear\_model} module. Alternatively, ensembles can be built by using native stacking. Since the recent integration of the \proglang{H2O} machine learning models \citep{h2o} in \proglang{Stata} \citep{Stata}, it is also possible to build ensembles of machine learning models (including GLMs) via stacking, boosting or bagging, but random subspace methods or similar approaches to dimensionality reduction in GLMs are not implemented at the time of writing. % The rest of the paper is organized as follows: Section~\ref{sec-models} provides an overview of the methods and the methodological details of the implemented algorithm. The package is described in Section~\ref{sec-software} while Section~\ref{sec-extensibility} exemplifies how a new screening coefficient, a new random projection and a new marginal model can be integrated in the package. \ifPositive Section~\ref{sec-illustrations} contains two examples of employing the package on real data sets. \fi Section~\ref{sec-conclusion} concludes. \section{Methods} \label{sec-models} % % The package implements a procedure for building an ensemble % of GLMs where we employ screening and random projection % to the predictor matrix pre-model estimation for the purpose of % dimensionality reduction. Throughout the section we assume to observe high-dimensional data $\{(\boldsymbol{x}_i,y_i)\}_{i=1}^n$, where $\boldsymbol{x}_i\in\mathbb{R}^p$ is a predictor vector and $y_i\in\mathbb{R}$ is the response, with $p\gg n$. The predictor vectors are collected in the rows of the predictor matrix $X\in \mathbb R^{n\times p}$. \subsection{Variable screening} \label{sec-varscreen} % In high-dimensional modeling, the goal of variable screening is to reduce the predictor set by selecting a small subset of variables with a strong \emph{utility} to the response variable. This initial selection enables more efficient downstream analyses by discarding less relevant predictors early in the modeling process, thus reducing computational costs and potential noise accumulation stemming from irrelevant variables \citep[see e.g.,][]{Dunson2020TargRandProj}. The field of variable screening is an active area of research, with numerous methods developed to address different data settings. While a comprehensive review is beyond the scope of this paper, this section provides a concise and selective overview of key approaches for variable screening in high-dimensional settings. % A classic approach is sure independence screening (SIS), proposed by \cite{Fan2007SISforUHD}, which uses the vector of marginal empirical correlations $\hat{\boldsymbol\omega}=(\hat\omega_1,\ldots ,\hat\omega_p)^\top\in\mathbb{R}^p$, $\hat\omega_j=\text{Cor}(\boldsymbol X_{j},\boldsymbol y)$, where $\boldsymbol y$ is the $(n\times 1)$ vector of responses and $\boldsymbol X_{j}$ is the $j$-th column of the matrix of predictors, to screen predictors in a linear regression setting by selecting the variable set $\mathcal{A}_\delta$ which contains all variables for which $|\hat\omega_j|>\delta$, where $\delta$ is a suitable threshold. % Under certain technical conditions, this screening coefficient has the \emph{sure screening property}, i.e.,~that the set of truly active variables is included in $\mathcal{A}_{\delta_n}$ with probability converging to one as $n\to \infty$. Extensions to SIS include modifications for GLMs \citep{Fan2010sisglms}, where screening is performed based on the log-likelihood $\ell(.)$ or the slope coefficient of the GLM containing only $\boldsymbol X_{j}$ as a predictor: $\hat\omega_j=: \text{argmin}_{\beta_j\in\mathbb{R}}\text{min}_{{\beta_0}\in \mathbb{R}}\sum_{i=1}^n -\ell(\beta_j,\beta_0;y_i,x_{ij})$, where $x_{ij}$ is the $j$-th entry of the vector $\boldsymbol x_i$. However, both mentioned approaches face limitations related to the required technical conditions which can rule out practically possible scenarios where an important variable is marginally uncorrelated to the response due to their multicollinearity. To tackle these issues, \cite{fan2009ultrahigh} propose to use an iterative procedure where SIS is applied subsequently on the residuals of the model estimated in a previous step. Additionally, in a linear regression setting, \cite{cho2012high} propose using the tilted correlation, i.e.,~the correlation of a tilted version of $\boldsymbol X_{j}$ with $\boldsymbol y$ where the effect of other variables is reduced. \cite{Wang2015HOLP} propose to take into account the correlation among the predictors by employing the high-dimensional ordinary least squares projection (HOLP) $\hat{\boldsymbol w}=X^\top (XX^\top)^{-1}\boldsymbol y$, which is a ridge estimator where the penalty term converges to zero. For GLMs, joint feature screening via a sparsity-restricted maximum likelihood estimator -- under an $\ell_0$ norm -- has been proposed in \cite{SMLE2014}. In order to tackle potential model misspecification, a rich stream of literature focuses on developing semi- or non-parametric alternatives to SIS. For linear regression, approaches include using the ranked correlation \citep{zhu2011model}, (conditional) distance correlation \citep{li2012feature,wang2015conditional} or quantile correlation \citep{ma2016robust}. For GLMs, \cite{fan2011nonparametric} extend \cite{Fan2010sisglms} by fitting a generalized additive model with B-splines. Further extensions for discrete (or categorical) outcomes include the fused Kolmogorov filter \citep{mai2013kolmogorov}, the mean conditional variance, i.e.,~the expectation in $\boldsymbol X_{j}$ of the variance in the response of the empirical distribution function of each predictor conditional on the response \citep{cui2015model}. \cite{ke2023sufficient} propose a model free method where the contribution of each individual predictor is quantified marginally and conditionally in the presence of the control variables as well as the other candidates by reproducing-kernel-based $R^2$ and partial $R^2$ statistics. The \proglang{R} landscape for variable screening techniques is very rich. An overview of some notable packages on the Comprehensive \proglang{R} Archive Network (CRAN) includes the following packages. Package \pkg{SIS} \citep{SISR}, which implements the (iterative) sure independence screening procedure and its extensions, as detailed in \cite{Fan2007SISforUHD,Fan2010sisglms,fan2010high}. This package also provides functionality for estimating a penalized generalized linear model or a cox regression model for the variables selected by the screening procedure. Package \pkg{VariableScreening} \citep{pkg:VariableScreening} offers screening methods for independent and identically distributed (iid) data, varying-coefficient models, and longitudinal data and includes techniques such as sure independent ranking and screening (SIRS), which ranks the predictors by their correlation with the rank-ordered response, or distance correlation sure independence screening (DC-SIS), a non-parametric extension of the correlation coefficient. Package \pkg{MFSIS} \citep{pkg:MFSIS} provides a collection of model-free screening techniques including SIRS, DC-SIS, the fused Kolmogorov filter \citep{mai2015fusedkolmogorov} the projection correlation method using knock-off features \citep{liu2020knockoff}, among others. Additional packages implement specific procedures but their review is beyond the scope of the current paper. Package \pkg{spareg} allows the integration of such (advanced) screening techniques using a flexible framework, which in turn enables users to apply various screening methods tailored to their data characteristics in the algorithm generating the ensemble. This flexibility allows users to evaluate different strategies, ensuring that the most effective approach is chosen for the specific application at hand. Moreover, it incorporates probabilistic screening strategies, which can be particularly useful in ensembles, as they enhance the diversity of predictors across ensemble models. Instead of relying on a fixed threshold $\delta$, predictors are sampled with probabilities proportional to their screening coefficient \citep[see][]{Dunson2020TargRandProj,parzer2024glms}. Note that this is different than the random subspace sampling employed in, e.g.,~random forests. % % \subsection{Random projection tools} \label{sec-rps} % Package \pkg{spareg} has been designed to allow the incorporation of various random projection techniques, enabling users to tailor the procedure to their specific data needs. Below, we provide background information on different random projection techniques and an overview of existing software implementing random projections. The random projection method relies on the Johnson-Lindenstrauss (JL) lemma \citep{JohnsonLindenstrauss1984}, which asserts that for each set of points in $p$~dimensional Euclidean space collected in the rows of $X\in \mathbb{R}^{n\times p}$ there exists a linear map $\Phi\in \mathbb{R}^{m \times p}$ such that all pairwise distances are approximately preserved within a factor of $(1\pm\epsilon)$ for $m\geq m_l=\mathcal O(\epsilon^{-2}\log(n))$. Computationally, an attractive feature of the method for high-dimensional settings is that the bound does not depend on $p$. The goal is to choose a random map $\Phi$ that satisfies the JL lemma with high probability given that it fulfills certain technical conditions. The literature focuses on constructing such matrices either by sampling them from some ``appropriate'' distribution, by inducing sparsity in the matrix and/or by employing specific fast constructs which lead to efficient matrix-vector multiplications. It turns out that the conditions are generally satisfied by nearly all sub-Gaussian distributions \citep{matouvsek2008variants}. A common choice is the standard normal distribution $\Phi_{ij} \overset{iid}{\sim} N(0,1)$ \citep{FRANKL1988JLSphere} or a sparser version where $\Phi_{ij}\overset{iid}{\sim} N(0,1/\sqrt{\psi})$ with probability $\psi$ and $0$ otherwise \citep{matouvsek2008variants}. Another computationally simpler option is the Rademacher distribution where $\Phi_{ij} = \pm 1/\sqrt{\psi}$ with probability $\psi/2$ and zero otherwise for $0<\psi\leq 1$, where \cite{ACHLIOPTAS2003JL} shows results for $\psi=1$ and $\psi=1/3$ while \cite{LiHastie2006VerySparseRP} recommend using $\psi=1/\sqrt{p}$ to obtain very sparse matrices. % Further approaches include using the Haar measure to generate random orthogonal matrices \citep{cannings2017random} or a non-sub-Gaussian distribution like the standard Cauchy, proposed by \cite{LiHastie2006VerySparseRP} for preserving approximate $\ell_1$ distances in settings where the data is high-dimensional, non-sparse, and heavy-tailed. Structured matrices, which allow for more efficient multiplication, have also been proposed \citep[see e.g.,][]{ailon2009fast,Clarkson2013LowRankApprox}. The conventional random projections mentioned above are data-agnostic. However, recent work has proposed incorporating information from the data either to select the ``best'' data-agnostic random projection or to directly inform the random projection procedure. \cite{cannings2017random} rely on the former approach and build an ensemble classifier where the random projection matrix is chosen by selecting the one that minimizes the test error of the classification problem among a set of data-agnostic random projections. % Another data-driven approach to random projection for regression has been proposed by \cite{ryder2019asymmetric}, who introduced a data-informed random projection using an asymmetric transformation of the predictor matrix without using information of the response. % On the other hand, \cite{parzer2024glms} propose to use a random projection matrix for GLMs which directly incorporates information about the relationship between the predictors and the response, rather than a projection matrix which satisfies the JL lemma. In the linear regression case \cite{parzer2024sparse} provide a theoretical bound on the expected gain in prediction error when using a projection which incorporates information about the true regression coefficients compared to a conventional random projection. In particular, they rely on the sparse embedding matrix of \cite{Clarkson2013LowRankApprox}, constructed as $\Phi=BD\in \mathbb{R}^{m\times p}$, where $B$ is a $(p\times p)$ binary matrix, where for each column~$j$ an index is uniformly sampled from $\{1,\ldots,m\}$ and the corresponding entry is set to one, and $D$ is a $(p\times p)$ diagonal matrix, with entries $d_j \sim \text{Unif}(\{-1, 1\})$, and provide theoretical results when replacing the diagonal entries of $D$ with the true regression coefficients. Motivated by this result, they propose to construct such a projection matrix where the random diagonal elements are replaced in practice by HOLP, i.e., the ridge coefficient with a penalty converging to zero. The constructed random projection has the advantage of approximately capturing the true regression coefficients in the span of the random projection, i.e.,~it ensures that the true regression coefficients can be recovered approximately after the projection. In \cite{parzer2024glms}, the authors extend this idea to the GLM framework by investigating how a similar ridge coefficient can be used to construct a suitable data-informed projection matrix. Through simulations, they demonstrate that for non-Gaussian families, allowing the penalty to converge to zero is not always optimal. Instead, they propose selecting the smallest penalty value such that the deviance ratio—the proportion of the null deviance explained—remains below a specified threshold. Specifically, they suggest using a threshold of 0.8 for non-Gaussian families and 0.99 for Gaussian families. % Several packages in \proglang{R} provide functionality for random projections. For instance, package \pkg{RandPro} \citep{RandProR,SIDDHARTH2020100629} allows a Gaussian random matrix (i.e, where entries are simulated from a standard normal), a sparse matrix \citep{ACHLIOPTAS2003JL,LiHastie2006VerySparseRP} or a matrix generated using the equal probability distribution with the elements $\{-1,1\}$, to be applied to the predictor matrix before employing one of $k$ nearest neighbors, support vector machine or naive Bayes classifier on the projected features. Package \pkg{SPCAvRP} \citep{SPCAvRPR} implements sparse principal component analysis, based on the aggregation of eigenvector information from ``carefully-selected'' axis-aligned random projections of the sample covariance matrix. Additionally, package \pkg{RPEnsembleR} \citep{RPEnsembleR} implements a similar idea when building the ensemble of classifiers: for each classifier in the ensemble, a collection of (Gaussian, axis-aligned projections, or Haar) random projection matrices is generated, and the one that minimizes a risk measure for classification on a test set is selected. For \proglang{Python} \citep{Python} the \pkg{sklearn.random\_projection} module \citep{pedregosa2011scikit} implements two types of unstructured random matrices, namely Gaussian random matrix and sparse random matrix. % % % % % % % % % % % % % % % % % % % % % % \subsection{Generalized linear models} % After we perform an initial screening step followed by a projection step in each marginal model, we assume that the reduced and projected set of predictors $\boldsymbol{z}_i=\Phi\boldsymbol{x}_i \in \mathbb{R}^m$ together with the response arise from a GLM with the response having conditional density from a (reproductive) exponential dispersion family of the form \begin{align*}\label{eqn:y_density} f(y_i|\theta_i,\phi) = \exp\Bigl\{\frac{y_i\theta_i- b(\theta_i)}{a(\phi)} + c(y_i,\phi)\Bigr\}, \quad g(\E[y_i|\boldsymbol{z}_i]) = \gamma_0 + (\Phi\boldsymbol{x}_i)^\top\boldsymbol{\gamma}=:\eta_i, \end{align*} where $\theta_i$ is the natural parameter, $a(.)>0$ and $c(.)$ are specific real-valued functions determining different families, $\phi$ is a dispersion parameter, and $b(.)$ is the log-partition function normalizing the density to integrate to one. If $\phi$ is known, we obtain densities in the natural exponential family for our responses. The responses are related to the $m$~dimensional reduced and projected predictors through the conditional mean, i.e.,~the conditional mean of $y_i$ given ${\boldsymbol{z}}_i$ depends on a linear combination of the predictors through a (invertible) link function $g(.)$, where $\gamma_0\in\mathbb{R}$ is the intercept and $\boldsymbol{\gamma}\in\mathbb{R}^m$ is a vector of regression coefficients for the $m$ projected predictors. We can see that the original coefficients of the predictors $\boldsymbol{x}_i$ can be obtained as: $\boldsymbol{\beta}=\Phi^\top\boldsymbol{\gamma}$. Estimates for $\gamma_0\in\mathbb{R}$ and $\boldsymbol{\gamma}\in\mathbb{R}^m$ can be obtained by directly maximizing the log-likelihood below, as generally one chooses $m < n$: \begin{align*} \ell(\beta_0,\boldsymbol{\beta}) = \sum_{i=1}^n \frac{y_i\theta_i(\beta_0,\boldsymbol{\beta},\boldsymbol{x}_i)- b(\theta_i(\beta_0,\boldsymbol{\beta},\boldsymbol{x}_i))}{a(\phi)} + c(y_i,\phi), \end{align*} where $\theta_i(\beta_0,\boldsymbol{\beta},\boldsymbol{x}_i) = (b')^{-1}(g^{-1}(\eta_i))$. However, even if $m < n$ it might still be desirable to add a (e.g.,~small $\ell_2$) penalty to the likelihood of the marginal models -- for the binomial family this can alleviate problems related to separation while keeping the bias low. Moreover, if outlier influence should be reduced, $M$- or trimmed estimators could be employed for obtaining robust estimates of the regression coefficients \citep[see e.g.,][]{cantoni2001robust}. % % % % % % % % \subsection{SPAR algorithm} \label{sec-algo} We present the general algorithm for sparse projected averaged regression (SPAR) implemented in package \pkg{spareg}. %% \begin{enumerate} %% \item Choose family with corresponding log-likelihood $\ell(.)$ and link $g(.)$. %% \item Standardize the $(n\times p)$ matrix of predictors $X$ for all families and the vector of responses $\boldsymbol y$ for the Gaussian family by subtracting the sample mean and dividing by the sample standard deviation of each variable. %% \item Calculate screening coefficients $\hat{\boldsymbol \omega}$. %% \item For $k=1,\dots,M$ models: \begin{enumerate} \item If $p>p_0$, where $p_0$ is the number of variables to be screened, screen $p_0$ predictors based on the screening coefficient $\hat{\boldsymbol\omega}$, which yields for model $k$ the screening index set $I_k=\{j_1^k,\dots,j_{p_0}^k\}\subset \{1,\dots,p\}$; if probabilistic screening should be employed, draw the predictors sequentially without replacement using an initial vector of probabilities $p_j\propto |\hat\omega_j|$. Otherwise, select the $p_0$ variables with the highest $|\hat\omega_j|$. If $p < p_0$, perform no screening and set $I_k=\{1,\dots,p\}$. \item Project the screened variables to a randomly drawn dimension $m_k\sim \text{Unif}\{m_l,\dots,m_u\}$ using \emph{projection matrix} $\Phi_k$ to obtain $Z_k=X_{\cdot I_k}\Phi_k^\top \in \mathbb{R}^{n\times m_k}$, where $X_{\cdot I_k}$ contains the columns in $X$ having a column index in $I_k$. \item Fit a \emph{GLM}-type model of $\boldsymbol y$ on $Z_k$ to obtain estimated coefficients $\widehat{\boldsymbol\gamma}^k\in\mathbb{R}^{m_k}$ and $\hat {\boldsymbol\beta}_{I_k}^k=\Phi_k^\top\widehat {\boldsymbol\gamma}^k$, $\hat {\boldsymbol\beta}_{\bar I_k}^k=0$. \end{enumerate} \item For a given threshold $\nu\geq 0$, set all $\hat\beta_j^k$ with $|\hat\beta_j^k|\leq\nu$ to $0$ for all $j,k$. \item Choose $M$ and $\nu$ via a validation set or cross-validation evaluating a two-dimensional grid of values. For each training fold, repeat Steps 2 to 6 using fixed index sets $I_k$ and projections $\Phi_k$ and assess the performance on the test fold by a loss function $K(M, \nu)$: \begin{align*} (M_{\text{best}},\nu_{\text{best}}) = \text{argmin}_{M,\nu}K(M,\nu). \end{align*} \item Combine models of the ensembles either via the coefficients by using a \emph{simple average} of the regression coefficients $\hat{\boldsymbol\beta} = \sum_{k=1}^M\hat {\boldsymbol\beta}^k / M$ (this results in averaging of the models on the link level) or via the fitted values by taking a \emph{simple average} of $\hat y = \sum_{k=1}^Mg^{-1}(\boldsymbol x_i^\top\hat{\boldsymbol\beta}^k) / M$ (here we omit the intercept for the sake of notational simplicity). \item Output the averaged estimated coefficients and predictions for the chosen $M$ and $\nu$. \end{enumerate} % The proposed algorithm is flexible, allowing for various choices of screening coefficients $\hat{\boldsymbol w}$, random projection matrices $\Phi$ and marginal models. Several variants have been introduced in the literature. For example, \cite{Dunson2020TargRandProj} proposes a version that uses probabilistic marginal correlation screening along with the sparse random projection method from \cite{ACHLIOPTAS2003JL}, without applying thresholding. More recent work by \cite{parzer2024sparse,parzer2024glms} extends this approach by employing data-informed random projection that leverages information from the ridge coefficients. In their method, the ridge penalty is chosen as the smallest value such that the deviance ratio (the proportion of null deviance explained) remains below a family-specific threshold. They complement the projection step with probabilistic screening based on the same ridge regression coefficients (see details in Section~\ref{sec-rps}). % In order to choose default values for $p_0$, $m_l$, $m_u$ as well as the maximal number of considered models $M$, we rely on previous literature. \cite{parzer2024sparse} find that, when using their proposed data-driven random projection and HOLP screening coefficient in a linear regression setting, the gain in predictive performance when using more than $M=20$ models is marginal. We use therefore $M=20$ as a default value in the algorithm. Regarding the number of screened variables, \cite{parzer2024sparse} propose choosing $p_0=c\cdot n$ as a multiple of $n$ (thus independent of $p$) and find that the best results are achieved for $2\leq c\leq 4$. We thus set in the algorithm $p_0=2n$ as a default value. The dimension $m_k$ is random for the purpose of introducing more variability in the ensemble and reducing the reliance on a fixed (possibly arbitrarily chosen) goal dimension. \cite{Dunson2020TargRandProj} propose using $m_l=2\log(p)$ and $m_u=3n/4$ while \cite{parzer2024glms} use slightly smaller goal dimensions ($m_l=\log(p)$, $m_u=n/2$), to reduce the dimension of the marginal models to be estimated. On the other hand, for random projection matrices satisfying the JL lemma, $m_l$ can be derived from the theoretical bounds for preserving the distances between all pairs of points approximately. We employ as default values the ones proposed in \cite{parzer2024sparse}. A further modification of the algorithm above is to employ part of the data for estimating the screening coefficients (Step~3) and the remaining data to estimate the ensemble models (Step~4). Such a strategy could avoid issues related to overfitting. Even though \cite{parzer2024sparse} find that in the linear regression case such a splitting approach does not improve performance, the implementation in \pkg{spareg} allows for data splitting. Finally, when choosing the optimal $\nu$ and $M$ in Step~6 given a grid of values for each parameter, one can pick the combination delivering the lowest loss function or the combination delivering the sparsest model with a loss within one standard deviation of the minimum (this corresponds to the commonly employed ``one-standard-error (1se)'' rule in penalized regression). We note that the cross-validation on the grid of number of models is not computationally intensive as we only need to train the maximum number of models considered in the grid on each training fold and evaluate performance by aggregating the first $M_1$, then the first $M_2$, and so on. \section{Software} \label{sec-software} % Package \pkg{spareg} can be installed from CRAN: <>= install.packages("spareg") @ and loaded by <<>>= library("spareg") @ % In this section, we rely for illustration purposes on a simulated example data set which contains $n=200$ observations of a continuous response \code{y} and $p=2000$ predictors \code{x} which can be used as a training data set and $n=100$ observations to be used as a test set. <<>>= set.seed(1234) example_data <- simulate_spareg_data(n = 200, p = 2000, ntest = 100) str(example_data) @ The function \code{simulate_spareg_data()} simulates data from a linear regression model. Using the default values, data is generated using $\sigma^2=83$, an intercept $\mu=1$ and $\beta$ coefficients with 100 non-zero entries, where the non-zero entries are uniformly sampled from $\{-3,-2,-1,1,2,3\}$. \subsection{Main functions and their arguments} % The two main functions for fitting the SPAR algorithm are: % \begin{Code} spar(x, y, family = gaussian("identity"), model = NULL, rp = NULL, screencoef = NULL, xval = NULL, yval = NULL, nnu = 20, nus = NULL, nummods = c(20), measure = c("deviance", "mse", "mae", "class", "1-auc"), avg_type = c("link", "response"), parallel = FALSE, inds = NULL, RPMs = NULL, seed = NULL, ...) \end{Code} % which implements the algorithm in Section~\ref{sec-algo} without cross-validation and returns an object of class \class{spar}, and % \begin{Code} spar.cv(x, y, family = gaussian("identity"), model = NULL, rp = NULL, screencoef = NULL, nfolds = 10, nnu = 20, nus = NULL, nummods = c(20), measure = c("deviance", "mse", "mae", "class", "1-auc"), avg_type = c("link", "response"), parallel = FALSE, seed = NULL, ...) \end{Code} % which implements the cross-validated procedure and returns an object of class `\code{spar.cv}'. Additionally, the aliases \fct{spareg} and \fct{spareg.cv} are available. The common arguments of these functions are: \begin{itemize} \item \code{x} an $n \times p$ numeric matrix of predictor variables, \item \code{y} numeric response vector of length $n$, % \item \code{family} object from \fct{stats::family}; defaults to \fct{gaussian}; % \item \code{model} an object of class `\code{sparmodel}' which specifies the model employed for each element of the ensemble. Defaults to \code{spar_glm()} for Gaussian family with identity link and to \code{spar_glmnet()} for all other family-link combinations. The latter uses the \pkg{glmnet} package to penalize the marginal models with a small ridge penalty in order to increase stability in the coefficients. Class `\code{sparmodel}' will further be described in Section~\ref{sec:soft:sparmodel}. % \item \code{rp} an object of class \class{randomprojection}. Defaults to \code{NULL}, in which case \code{rp_cw(data = TRUE)} is used. Class `\code{randomprojection}' will further be described in Section~\ref{sec:soft:rps}. % \item \code{screencoef} an object of class \class{screencoef}. Defaults to \code{NULL}, in which case no screening is employed. Class `\code{screencoef}' will further be described in Section~\ref{sec:soft:screen}. % \item \code{nnu} is the number of threshold values $\nu$ which should be considered for thresholding; defaults to 20; % \item \code{nus} is an optional vector of $\nu$ values to be considered for thresholding. If it is not provided, is defaults to a grid of \code{nnu} values. This grid is generated by including zero and \code{nnu}$-1$ quantiles of the absolute values of the estimated non-zero coefficients from the marginal models, chosen to be equally spaced on the probability scale. % \item \code{nummods} is the number of models to be considered in the ensemble; defaults to 20. If a vector is provided, all combinations of \code{nus} and \code{nummods} are considered when choosing the optimal $\nu_\text{best}$ and $M_\text{best}$. % \item \code{measure} specifies the measure $K(\nu, M)$ based on which the thresholding value $\nu_\text{best}$ and the number of models $M_\text{best}$ should be chosen on the validation set (for \code{spar()}) or in each of the folds (in \code{spar.cv()}). The default value for \code{measure} is \code{"deviance"}, which is available for all families. Other options are mean squared error \code{"mse"} or mean absolute error \code{"mae"} (between responses and predicted conditional means, for all families), \code{"class"} (misclassification error) and \code{"1-auc"} (one minus area under the ROC curve) both just for binomial family. % \item \code{avg_type} specifies the averaging type for computing the validation measure $K(\nu, M)$. It can be either on \code{"link"} or \code{"response"} level. % \item \code{parallel} assuming a parallel backend is loaded and available, a logical indicating whether the function should use it for parallelizing the estimation of the marginal models. Defaults to \code{FALSE}. \end{itemize} % Furthermore, \code{spar()} has the specific arguments: \begin{itemize} \item \code{xval} and \code{yval} which are used as validation sets for choosing $\nu_\text{best}$ and $M_\text{best}$. If not provided, \code{x} and \code{y} will be employed. % \item \code{inds} is an optional list of length \code{max(nummods)} containing column index-vectors $I_k$ corresponding to variables that should be kept after screening for each marginal model; dimensions need to fit those of the dimensions of the provided matrices in \code{RPM}. % \item \code{RPM} is an optional list of length \code{max(nummods)} which contains projection matrices to be used in each marginal model. \end{itemize} Function \code{spar.cv()} has the specific argument \code{nfolds} which is the number of folds to be used for cross-validation. It relies on a lightweight version of \code{spar()} as a workhorse, which is called for each fold. The random projections for each model are held fixed throughout the cross-validation to reduce the computational burden. This is achieved by calling \code{spar()} on the whole data set once before starting the cross-validation. The random projections and the screening indicators generated by this function call are held then fixed throughout the cross-validation procedure. If data-driven random projections are employed, only the data to be used in the random projection will be updated in each fold iteration with the corresponding training data. More details will be provided in Section~\ref{sec-extensibility}. % % % \subsection{Screening coefficients}\label{sec:soft:screen} % The objects for creating screening coefficients are implemented as \proglang{S}3 classes \class{screencoef}. These objects are created by several implemented \code{screen_*()} functions, % \begin{Code} screen_*(..., control = list()) \end{Code} % which take as arguments \code{...} (to be saved as attributes of the object) and \code{control} (a list of controls to be used in the main function for computing the screening coefficients). % The following screening coefficients are implemented in \pkg{spareg}: \begin{itemize} \item \code{screen_marglik()} -- computes the screening coefficients by the coefficient of $\boldsymbol X_{j}$ for $j =1,\dots,p$ in a univariate GLM using the \fct{stats::glm} function. $$ \hat\omega_j=:\text{argmin}_{\beta_j\in \mathbb{R}}\text{min}_{{\beta_0} \in\mathbb{R}}\sum_{i=1}^n -\ell(\beta_0,\beta_j;y_i,x_{ij}) $$ It allows to pass a list of controls through the \code{control} argument to \fct{stats::glm} -- such as \code{weights}, \code{family}, \code{offset} -- e.g.,~\code{screen_marglik(control = list(family = binomial(probit)))}. Note that if \code{family} is not provided in \code{control}, the \code{family} used in \fct{spar} or \fct{spar.cv} will be used. % \item \code{screen_cor()} -- computes the screening coefficients by the correlation between $\boldsymbol y$ and $\boldsymbol X_{j}$ using the function \fct{stats::cor}. It allows to pass a list of controls through the \code{control} argument to \fct{stats::cor} -- e.g.,~\code{screen_cor(control = list(method = "spearman"))}. % \item \code{screen_glmnet()} -- computes by default a ridge coefficient with a small penalty: $$ \hat\omega=: \text{argmin}_{\beta\in \mathbb{R}^p}\text{min}_{{\beta_0} \in\mathbb{R}}\sum_{i=1}^n -\ell(\beta;y_i,\boldsymbol x_i) + \frac{\varepsilon}{2} \sum_{j=1}^p{\beta}_j^2, \, \varepsilon > 0 $$ The function relies on \code{glmnet::glmnet()}. It uses by default $\alpha = 0$ and a small \code{lambda.min.ratio}. The optimal penalty is chosen using the recommendations in \cite{parzer2024glms}, namely choosing the smallest penalty for which the deviance ratio (the fraction of null deviance explained) is less than 0.99 for the Gaussian family and 0.8 for other families. It, however, allows to pass a list of controls through the \code{control} argument to \code{glmnet::glmnet()} -- e.g.,~\code{screen_glmnet(control = list(alpha = 0.5))}. % This screening coefficient is used as a default if % \code{screencoef = NULL} in function call of \code{spar()} or \code{spar.cv()}. \end{itemize} % As mentioned above, arguments related to the screening procedure can be passed to the \code{screen_*()} function through \code{...}, and will be saved as attributes of the \class{screencoef} object. More specifically, the following attributes are relevant for function \code{spar()}: \begin{itemize} \item \code{nscreen} integer giving the number of variables to be retained after screening; if not specified, defaults to $2n$. If $2n > p$, no screening is performed. % \item \code{split_data_prop}, double between 0 and 1 which indicates the proportion of the data that should be used for computing the screening coefficient. The remaining data will be used for estimating the marginal models in the SPAR algorithm; if not specified, the whole data will be used for estimating both the screening coefficient and the marginal models. % \item \code{type} character -- either \code{"prob"} (indicating that probabilistic screening should be employed) or \code{"fixed"} (indicating that a fixed set of \code{nscreen} variables should be employed across the ensemble); defaults to \code{type = "prob"}. % \item \code{reuse_in_rp} logical -- indicates whether the screening coefficient should be reused at a later stage in the construction of the random projection. Defaults to \code{FALSE}. \end{itemize} % All implemented \code{screen_*()} functions return an object of class \class{screencoef} which in turn is a list with three elements: \begin{itemize} \item a character \code{name}, % \item \code{generate_fun()} -- an \proglang{R} function for generating the screening coefficient. This function should have the following arguments: \code{x} -- the matrix of standardized predictors -- and \code{y} -- the vector of (standardized in the Gaussian case) responses, and the argument \code{object}, which is a \class{screencoef} object itself. It returns a vector of screening coefficients of length $p$. % \item \code{control}, which is the control list passed by the user in \code{screen_*()}. These controls are arguments which are needed in \code{generate_fun()} in order to generate the desired screening coefficients. \end{itemize} For illustration purposes, consider the object created by calling \code{screen_marglik()}: <<>>= obj <- screen_marglik() @ A user-friendly \code{print} of the \class{screencoef} object is provided: <<>>= obj @ The structure of the object is the following: <<>>= unclass(obj) @ Function \code{generate_fun()} defines the generation of the screening coefficient. Note that it considers the controls in \code{object$control} when calling the \code{stats::glm()} function (unless it is provided, the \code{family} argument in \code{stats::glm()} will be set to the ``global'' family of the SPAR algorithm which is assigned inside the \code{spar()} function as an attribute for the \class{screencoef} object). For convenience, a constructor function \code{constructor_screencoef()} is provided, which can be used to create new \code{screen_*} functions. An example is presented in Section~\ref{sec-extensscrcoef}. % % \subsection{Random projections}\label{sec:soft:rps} % Similar to the screening procedure, the objects for creating random projections are implemented as \proglang{S}3 classes \class{randomprojection} and are created by functions which take \code{...} and a list of controls \code{control} as arguments: % \begin{Code} rp_*(..., control = list()) \end{Code} % The following random projections are implemented in \pkg{spareg}: \begin{itemize} \item \code{rp_gaussian()} -- random projection object where the generated matrix will have iid entries from a normal distribution (defaults to standard normal entries). % \item \code{rp_sparse()} -- random projection object where the generated matrix will be the one in \cite{ACHLIOPTAS2003JL}. The value of $\psi$ can be passed through the \code{control} argument e.g.,~\code{rp_sparse(control = list(psi = 1/3))}. Defaults to \code{psi = 1}. % \item \code{rp_cw()} -- sparse embedding random projection in \cite{Clarkson2013LowRankApprox}. This matrix is constructed as $\Phi=BD\in \mathbb{R}^{m\times p}$, where $B$ is a $(p\times p)$ binary matrix, where for each column $j$ an index is uniformly sampled from $\{1,\ldots,m\}$ and the corresponding entry is set to one, and $D$ is a $(p\times p)$ diagonal matrix, with entries $d_j \sim \text{Unif}(\{-1, 1\})$. If specified as \code{rp_cw(data = TRUE)}, the random elements on the diagonal of $D$ are replaced by the ridge coefficients with a small penalty, as introduced in \cite{parzer2024glms}. \end{itemize} % Arguments related to the random projection can be passed through \code{...}, which will then be saved as attributes of the \class{randomprojection} object. More specifically, the following attributes are relevant in the SPAR algorithm and are present in all \class{randomprojection} objects: \begin{itemize} \item \code{mslow}: integer giving the minimum dimension to which the predictors should be projected; defaults to $\log(p)$. % \item \code{msup}: integer giving the maximum dimension to which the predictors should be projected; defaults to $n/2$ except in the case that the user provides a predictor matrix where the number of variables $p < n/2$, in which case it is set to $p$. % \end{itemize} % Note that for random projection matrices which satisfy the JL lemma, \code{mslow} can be determined by employing existing results which give a lower bound on the goal dimension in order to preserve the distances between all pairs of points within a factor $(1 \pm \epsilon)$. For example, \cite{ACHLIOPTAS2003JL} show $m_0 = \log n(4 + 2\tau)/(\epsilon^2/2 - \epsilon^3/3)$ for probability $1 - n^{-\tau}$. % The \code{rp_*()} functions return an object of class \class{randomprojection} which is a list of five elements. The most important three elements are: \begin{itemize} \item a character \code{name}, % \item \code{generate_fun()} function for generating the random projection matrix. This function should have arguments \begin{itemize} \item \code{rp}, which is itself a \class{randomprojection} object; \item \code{m}, the target dimension; \item a vector of indices \code{included_vector} which indicates the column index of the original variables in the \code{x} matrix to be projected using the random projection. This is needed due to the fact that screening can be employed pre-projection. \item \code{x} the vector of standardized predictors--can be \code{NULL} if the random projection to be generated is data-agnostic; \item \code{y} the vector of (standardized) responses--can be \code{NULL} if the random projection to be generated is data-agnostic. \end{itemize} It returns a matrix or a sparse matrix of class \class{dgCMatrix} of the \pkg{Matrix} package \citep{pkg:Matrix} with \code{m} rows and \code{length(included_vector)} columns. % \item \code{control}, which is the control list in \code{rp_*()}. These controls are arguments needed in \code{generate_fun()} in order to generate the desired random projection. \end{itemize} For the case where the random projection should incorporate some information related to the data, two more elements can be potentially relevant. \begin{itemize} \item Function \code{update_fun()} updates the \class{randomprojection} object with relevant information from the arguments of the \code{spar()} or \code{spar.cv()} function call. If it is not provided, it defaults to updating the \class{randomprojection} object with a further attribute \code{family_string} which is character version of the \code{family} argument (e.g.,~in the default case it will be \code{"gaussian(identity)"}). In certain constructions, such as the one in \cite{parzer2024glms}, certain data dependent quantities need to be computed only once, not every time a random projection is generated. For example, in \cite{parzer2024glms}, the ridge coefficients are estimated once at the beginning of the algorithm and reused in each projection. In such cases, function \code{update_fun()} can be employed to add data information as attributes of the \class{randomprojection} object to be subsequently used in the \code{generate_fun()} function. If specified by the user, this function should take only \code{...} as an argument. Internally in the \code{spar()} function, all arguments of \code{spar()} are passed to this function. It should return a \class{randomprojection} object. % \item In the case where a list of predefined \code{RPMs} is provided in \code{spar()}, for the data driven random projections we allow the user to specify whether some parts of the given \code{RPMs} should be updated with the provided data. This is possible through another optional function \code{update_rpm_w_data()}. This is particularly relevant for the cross-validation procedure, which employs the random projection matrices generated by calling the \code{spar()} function on the whole data set before starting the cross-validation exercise. For example, in our implementation of the data-driven \code{rp_cw(data = TRUE)}, in each fold, we only update the list of \code{RPMs} by adjusting the diagonal elements to the vector of screening coefficients computed on the training data for the current fold, but do not modify the random elements of the projection matrices in each fold, to reduce the computational burden. Defaults to \code{NULL}. If not provided, the values of the provided \code{RPMs} do not change. \end{itemize} % For illustration purposes, consider the implemented function \code{rp_gaussian()}, which generates a random projection with entries drawn from the normal distribution. The \code{print} method returns key information about the random projection procedure. <<>>= obj <- rp_gaussian() obj @ We turn to looking at the structure of the object: <<>>= unclass(obj) @ The \code{generate_fun()} function returns a matrix with \code{m} rows and \code{length(included_vector)} columns. Note that \code{included_vector} gives the indices of the variables which have been selected by the screening procedure. In this case, the random projection does not use any data information and we are only interested in the length of this vector. The function \code{update_fun()} only converts the \class{family} object to a string and adds it as an attribute. Function \code{update_rpm_w_data()} is \code{NULL} as this random projection is data-agnostic. An example for implementing a new \class{randomprojection} is presented in Section~\ref{sec-ext-rp}. % \subsection{Marginal models}\label{sec:soft:sparmodel} % The package provides a class `\code{sparmodel}' for the marginal model to be fitted for each element of the ensemble. The framework assumes that the model employs a linear predictor, i.e.,~a linear combination of the projected variables. Similar to the objects for random projection and screening coefficients, the functions which create these objects have arguments \code{...} (to be saved as attributes) and \code{control} (to be used in the main function for building the model). The two functions implemented are \code{spar_glmnet()}, which estimates as marginal models regularized GLMs using function \code{glmnet::glmnet()} (where the default is to estimate a ridge regression with the small penalty value\footnote{ In particular, we use by default the penalty value $\lambda_\text{min}=\lambda_\text{max}/100$ where $\lambda_\text{max}$ is the smallest value that shrinks all coefficients to zero, given by $\lambda_\text{max} = \frac{1}{n\alpha}\max_j|\langle z_j, y \rangle| (g^{-1})'(g(\bar y)) \frac{1}{\text{Var}_y(\mu)|_{\mu = \bar y}}$ with $z_j$ denoting predictors standardized by the square root of their variance (using $n$ in the denominator), and $\text{Var}_y(\mu)$ the variance function of the GLM family evaluated at the mean response. For ridge regression $\alpha=0$ and $\lambda_{\max}=\infty$ but for computational purposes $\alpha=0.001$ is used.}), and \code{spar_glm()} which estimates unregularized GLMs using \code{stats::glm()}. An object of class `\code{sparmodel}' is a list with elements: \begin{itemize} \item a character \code{name}, % \item \code{model_fun()} -- a function which takes \code{y} (the vector of standardized responses), \code{z} (the matrix of reduced predictors) and a further argument which is the object of class `\code{sparmodel}' itself. It returns a list of two elements: \code{gammas} which contains the vector of coefficients and the value of the \code{intercept}. % \item \code{update_fun()} -- an optional function which can add further attributes to the `\code{sparmodel}' object which is called at the beginning of the SPAR algorithm. This function returns the `\code{sparmodel}' object after modifying it. In the case of function \code{spar_glmnet()} this function manipulates the \class{family} object in a way which is convenient for function \code{glmnet::glmnet()}.\footnote{In the case of families Gaussian, binomial and Poisson with canonical link, the family object is replaced by a string containing the name of the family. This leads to \pkg{glmnet} using the faster specialized algorithms rather than the general algorithm implemented for all \class{family} objects.} \end{itemize} % The default is to use \code{spar_glm()} for Gaussian family with identity link and \code{spar_glmnet()} for the other families. An example for implementing a new marginal model class is presented in Section~\ref{sec-ext-model} % \subsection{Methods} % Methods \code{print}, \code{plot}, \code{coef}, \code{predict} are available for both `\code{spar}' and `\code{spar.cv}' classes. % \subsubsection[print]{\code{print}} The \code{print} method returns information on $\nu_\text{best}$ $M_\text{best}$, the number of active predictors (i.e.,~predictors which have at least a non-zero coefficient across the marginal models) and a summary of the non-zero non-standardized coefficients. We estimate the SPAR algorithm with no screening and \code{rp_cw(data = TRUE)} (default). <>= set.seed(12) spar_res <- spar(example_data$x, example_data$y, xval = example_data$xtest, yval = example_data$ytest, nummods = c(5, 10, 15, 20, 25, 30)) spar_res @ % For `\code{spar.cv}' it also provides the same information for the $(\nu, M)$ combination chosen by the one-standard-error rule. <>= set.seed(12) spar_cv <- spar.cv(example_data$x, example_data$y, nummods = c(5, 10, 15, 20, 25, 30)) spar_cv @ % \subsubsection[coef]{\code{coef}} Method \code{coef} takes as inputs a `\code{spar}' or `\code{spar.cv}' object, together with further arguments: \begin{itemize} \item \code{nummod} -- number of models used to compute the averaged coefficients; value of \code{nummod} with minimal \code{measure} is used if not provided. % \item \code{nu} -- threshold level used to compute the averaged coefficients; value with minimal \code{measure} is used if not provided. \item \code{aggregate} -- one of \code{c("mean", "median", "none")}. If set to \code{"none"}, the coefficients are not aggregated over the \code{nummod} marginal models , otherwise the coefficients are aggregated over the \code{nummod} marginal models using the specified method (mean or median). Defaults to mean aggregation. \end{itemize} Additionally for `\code{spar.cv}', the \code{coef} method also has argument \code{opt_par} which is one of \code{c("best", "1se")} and chooses whether to select the best pair of \code{nus} and \code{nummods} according to the cross-validation \code{measure}, or the solution yielding the sparsest vector of coefficients within one standard deviation of that optimal cross-validation \code{measure}. This argument is ignored when \code{nummod} and \code{nu} are given. It returns an object of class \class{coefspar}, which is a list with the following elements: intercept (numeric -- if \code{aggregate} is set to \code{"mean"} or \code{"median"} -- or a vector of length \code{nummod} if \code{aggregate = "none"}), \code{beta} -- a vector of length $p$ or a matrix of dimension $p\times$\code{nummod} of \code{beta} coefficients, together with the \code{nummod} and \code{nu} employed in the calculation. For class \class{coefspar} user-friendly \code{print} and \code{summary} methods are provided: <<>>= coef(spar_res) @ <<>>= coef(spar_res, aggregate = "none") @ <<>>= summary(coef(spar_res, aggregate = "median")) @ as well as extractor functions \code{get_intercept} and \code{get_coef} which return the values of the intercept and coefficients, respectively: <<>>= get_intercept(coef(spar_res, nummod = 10, aggregate = "none")) @ % \subsubsection[predict]{\code{predict}} Functionality for computing predictions is provided through the method \code{predict} which takes a `\code{spar}' or `\code{spar.cv}' object, together with \begin{itemize} \item \code{xnew} -- matrix of new predictor variables; must have same number of columns as \code{x}. Defaults to \code{NULL} % \item \code{type} -- the type of required predictions; either on \code{"response"} level (default) or on \code{"link"} level. % \item \code{avg_type} -- type of averaging used across the marginal models; either on \code{"link"} (default) or on \code{"response"} level. % \item \code{nummod} -- number of models used to compute the averaged coefficients; value of \code{nummod} with minimal \code{measure} is used if not provided. % \item \code{nu} -- threshold level used to compute the averaged coefficients; value with minimal \code{measure} is used if not provided. % \item \code{aggregate} -- one of \code{c("mean", "median")} to be used in aggregating over the \code{nummod} marginal models. Defaults to mean aggregation. % \end{itemize} Additionally, for class `\code{spar.cv}', argument \code{opt_par} is available and used in the computation of the coefficients to be used for prediction (see above description of method \code{coef}). % <>= pred <- predict(spar_res, xnew = example_data$xtest) pred_cv <- predict(spar_cv, xnew = example_data$xtest, opt_par = "1se") @ % \subsubsection[plot]{\code{plot}} % Plotting functionality is provided through the \code{plot} method, which takes a `\code{spar}' or `\code{spar.cv}' object, together with further arguments: \begin{itemize} \item \code{plot\_type} -- one of: \begin{itemize} \item \code{"val_measure"} plots the (cross-)validation \code{measure} for either a grid of \code{nu} values for a fixed number of models \code{nummod} or viceversa. \item \code{"val_numactive"} plots the number of active variables for either a grid of \code{nu} values for a fixed number of models \code{nummod} or viceversa. \item \code{"res_vs_fitted"} produces a residuals-vs-fitted plot. The residuals are computed as $r_i=y_i- \widehat y_i$, where $\widehat y_i$ is the prediction computed on response level. \item \code{"coefs"} produces a plot of the value of the standardized coefficients for each predictor in each marginal model (before thresholding). For each predictor, the values of the coefficients are sorted from largest to smallest across the marginal models and their value is represented in the plot using a color scale. \end{itemize} \item \code{plot_along} -- one of \code{c("nu","nummod")}; for \code{plot_type = "val_measure"} as well as for \code{plot_type = "val_numactive"} it indicates whether the values of the cross-validation measure or number of active variables, respectively, should be shown for a grid of $\nu$ values while keeping the number of models \code{nummod} fixed or viceversa. This argument is ignored when \code{plot_type = "res_vs_fitted"} or \code{plot_type = "coefs"}. \item \code{nummod} -- fixed value for number of models when \code{plot_along = "nu"} for \code{plot_type = "val_measure"} or \code{"val_numactive"}; if \code{plot_type = "res_vs_fitted"}, it is used in the \code{predict} method, as described above. \item \code{nu} -- fixed value for $\nu$ when \code{plot_along = "nummod"} for \code{plot_type = "val_measure"} or \code{"val_numactive"}; if \code{plot_type = "res_vs_fitted"}, it is used in the \code{predict} method, as described above. \item \code{xfit} -- if \code{plot_type = "res_vs_fitted"}, it is the matrix of predictors used in computing the fitted values. This argument must be provided for the plot of residuals and fitted values, as the `\code{spar}' or `\code{spar.cv}' objects do not store the original data. \item \code{yfit} -- if \code{plot_type = "res_vs_fitted"}, vector of responses used in computing the residuals. This argument must be provided for the plot of residuals and fitted values, as the `\code{spar}' or `\code{spar.cv}' objects do not store the original data. \item \code{prange} -- optional vector of length 2 in case \code{plot_type = "coefs"} which gives the limits of the predictors' plot range; defaults to \code{c(1, p)}. \item \code{coef_order} -- optional index vector of length $p$ in case \code{plot_type = "coefs"} to give the order of the predictors; defaults to \code{seq_len(p)}. \end{itemize} % The four plots for the `\code{spar}' object are produced with the code below and shown in Figure~\ref{fig:plot_example}. % <>= plot(spar_res) plot(spar_res, plot_type = "val_numactive") plot(spar_res, plot_type = "coefs") plot(spar_res, plot_type = "res_vs_fitted", xfit = example_data$xtest, yfit = example_data$ytest) @ % \begin{figure}[t!] \centering \setkeys{Gin}{width=\textwidth} <>= library(ggplot2) p1 <- plot(spar_res)+ theme(axis.text.x = element_text(angle = 45,vjust = 0.5)) p2 <- plot(spar_res, plot_type = "val_numactive") + theme(axis.text.x = element_text(angle = 45,vjust = 0.5)) p4 <- plot(spar_res, plot_type = "res_vs_fitted", xfit = example_data$xtest, yfit = example_data$ytest) p3 <- plot(spar_res, plot_type = "coefs") p <- ggpubr::ggarrange(p1, p2, p3, p4, ncol = 4, nrow = 1) p @ \caption{\code{plot} methods for \class{spar} object. The red dots in the first two figures show the best $\nu$ chosen on the validation set for the optimal number of models $M=5$. \label{fig:plot_example}} \end{figure} % For class `\code{spar.cv}' there is the extra argument \code{opt_par = c("best", "1se")} which is only used for \code{plot_type = "res_vs_fitted"} and indicates whether the predictions should be based on coefficients using the best $(\nu, M)$ combination or using the combination which delivers the sparsest $\boldsymbol\beta$ having validation measure within one standard deviation from the minimum. Moreover, the \code{plot} method for object `\code{spar.cv}' with \code{plot_type = "coefs"} will display the coefficients obtained from first running \fct{spar} on the whole data set, before cross-validation is performed. The \code{plot} methods return objects of class \class{ggplot} \citep{ggplotR}. % \subsection{Parallelization} % The package supports parallelization in the estimation of the marginal models in the ensemble via package \pkg{foreach} \citep{pkg:foreach}. This is possible by setting the parameter \code{parallel = TRUE} in \code{spar()} or \code{spar.cv()}, assuming that a parallel backend for \pkg{foreach} is registered using the \code{registerDoParallel()} function of package \pkg{doParallel} \citep{pkg:doParallel}. To ensure reproducibility the \pkg{doRNG} \citep{pkg:doRNG} can be used. % Our experiments suggest that parallelization pays off especially for data sets with a larger number of observations~$n$. A minimal example with a correlation-based probabilistic screening and a Gaussian random projection matrix is provided below: <>= example_data4 <- simulate_spareg_data(n = 1000, p = 2000, ntest = 1000, seed = 123) library(doParallel) library(doRNG) cl <- makeCluster(2, type = "PSOCK") registerDoParallel(cl) registerDoRNG(seed = 123) spar_res_par <- spar(example_data4$x, example_data4$y, screencoef = screen_cor(), rp = rp_gaussian(), nummods = 50, parallel = TRUE) stopCluster(cl) @ % \section{Extensibility} \label{sec-extensibility} \subsection{Screening coefficients} \label{sec-extensscrcoef} % We exemplify how screening coefficients implemented in package \pkg{VariableScreening} can easily be incorporated in the framework of \pkg{spareg}. We start by defining the function for generating the screening coefficients using function \code{screenIID()} in \pkg{VariableScreening}. <<>>= generate_scr_sirs <- function(y, x, object) { res_screen <- do.call(function(...) VariableScreening::screenIID(x, y, ...), object$control) coefs <- res_screen$measurement coefs } @ % Note that \code{screenIID()} also takes method as an argument. To allow for flexibility, we do not fix the screening method in \code{generate_scr_sirs()} but rather allow the user to pass a method through the \code{control} argument in the \code{screen_*()} function. This function is created using the helper \code{constructor_screencoef()}: <>= screen_sirs <- constructor_screencoef( "screen_sirs", generate_fun = generate_scr_sirs) @ % We now call \code{spar()} with the newly created screening procedure. We consider the method SIRS of \cite{zhu2011model}, which ranks the predictors by their correlation with the rank-ordered response and we do not perform probabilistic variable screening but employ the top $2n$ variables in each marginal model. The employed random projection is \code{rp_sparse()} where we set $\psi=1/\sqrt{p}$, as proposed by \cite{LiHastie2006VerySparseRP}. % <>= set.seed(123) spar_example <- spar(example_data$x, example_data$y, screencoef = screen_sirs(type = "fixed", control = list(method = "SIRS")), rp = rp_sparse(psi = 1/sqrt(ncol(example_data$x))), measure = "mse") spar_example @ % \subsection{Random projections}\label{sec-ext-rp} % In the following we exemplify how new random projections can be implemented in the framework of \pkg{spareg}. We implement the random projection of \cite{cannings2017random}, who propose using the Haar measure for generating the random projections. They simulate matrices from the Haar measure by independently drawing each entry of a matrix $Q$ from a standard normal distribution, and then take the projection matrix to be the transpose of the matrix of left singular vectors in the singular value decomposition of $Q$. The helper function below simulates matrices of size $m\times p$ from the Haar measure: % <<>>= simulate_haar <- function(m, p) { R0 <- matrix(1/sqrt(p) * rnorm(p * m), nrow = p, ncol = m) RM <- qr.Q(qr(R0), complete = FALSE) t(RM) } @ % Moreover, \cite{cannings2017random} suggest using ``good'' random projections, in the sense that they deliver the best out-of-sample prediction. The proposed approach employs $B_1$ models in an ensemble of classifiers. For each model $k$, $B_2$ Haar random projections are generated and the one with the lowest error on a test set is the one chosen to project the variables in model $k$. (Note that, while the $B_2$ random projections are data-agnostic, the whole data is needed in finding the best random projection.) We will generate the random projections in the following way. For each model $k$, we use a proportion $\xi$ of the data as a test set and for $b=\{1,\ldots,B_2\}$ we generate a Haar random projection. We then estimate a ridge regression on the training data and compute the misclassification error for the binomial family and MSE for all other families on the test set. Finally, the best out of $B_2$ projections in terms of minimizing the loss on the test set is chosen. We can start implementing such a random projection in \pkg{spareg} by the following steps. First, there are no data quantities which are to be used in all possible random projections so we do not need to specify a function \code{update_fun()}. However, we can use \code{update_fun()} to update the \class{randomprojection} object with further information related to the family. Given that we will rely again on \code{glmnet::glmnet()} in each iteration~$b$, the family object is modified to a string containing the name of the family for families Gaussian, binomial and Poisson with canonical link (this will lead to using faster specialized algorithms). This modified family is added to the list of controls as \code{fit\_family}. We also set by default $B_2=50$, $\xi=0.25$ and $\alpha = 0$ (in the \pkg{glmnet} model to be estimated inside the function generating the random projection). <<>>= update_rp_cannings <- function(...) { args <- rlang::list2(...) if (is.null(args$rp$control$family)) { family_string <- paste0(args$family$family, "(", args$family$link, ")") args$rp$control$family_string <- family_string family <- args$family fit_family <- switch(family$family, "gaussian" = if (family$link == "identity") "gaussian" else family, "binomial" = if (family$link == "logit") "binomial" else family, "poisson" = if (family$link == "log") "poisson" else family, family) args$rp$control$fit_family <- fit_family } if (is.null(args$rp$control$alpha)) args$rp$control$alpha <- 1 if (is.null(args$rp$control$B2)) args$rp$control$B2 <- 50 if (is.null(args$rp$control$xi)) { args$rp$control$xi <- 0.25 } else { stopifnot("xi must be between 0 and 1."= args$rp$control$xi >= 0 & args$rp$control$xi<= 1) } args$rp } @ As we can see, values for \code{xi} and \code{B2} can be passed by the user through the \code{control} argument, and if they are not specified, they are set to the default values. Next we specify the function for generating the random projection. <>= generate_cannings <- function(rp, m, included_vector, x, y) { xs <- x[, included_vector] n <- nrow(x); p <- ncol(xs) B2 <- rp$control$B2; xi <- rp$control$xi id_test <- sample(n, size = n * xi) xtrain <- xs[-id_test, ]; xtest <- xs[id_test,] ytrain <- y[-id_test]; ytest <- y[id_test] control_glmnet <- rp$control[names(rp$control) %in% names(formals(glmnet::glmnet))] best_val <- Inf family <- eval(parse(text = rp$control$family_string)) for (b in seq_len(B2)) { RM <- simulate_haar(m, p) xrp <- tcrossprod(xtrain, RM) mod <- do.call(function(...) glmnet::glmnet(x = xrp, y = ytrain, family = rp$control$fit_family, ...), control_glmnet) coefs <- coef(mod, s = min(mod$lambda)) eta_test <- (cbind(1, tcrossprod(xtest, RM)) %*% coefs) pred <- family$linkinv(as.vector(eta_test)) out_perf <- ifelse(family$family == "binomial", mean(((pred > 0.5) + 0) != ytest), mean((pred - ytest)^2)) if (out_perf < best_val) { best_val <- out_perf; best_RM <- RM } rm(RM) } return(best_RM) } @ For the cross-validation procedure implemented in \fct{spar.cv}, we choose to not generate new matrices for each training fold in order to keep computational costs low, so we do not specify a function \code{update_rpm_w_data()}. Putting it all together, we get: <>= rp_cannings <- constructor_randomprojection( "rp_cannings", generate_fun = generate_cannings, update_fun = update_rp_cannings ) @ % We can now estimate SPAR for a binomial model, where we transform the response to a binary variable. We simulate again data from a linear model and then transform the response to a binary scale: <>= set.seed(1234) example_data2 <- simulate_spareg_data(n = 100, p = 1000, ntest = 100) ystar <- (example_data2$y > 0) + 0 ystarval <- (example_data2$ytest > 0) + 0 @ We use only one value for the number of models $M=50$ \citep[which is in line to recommendations in][]{cannings2017random}, and no screening procedure. Moreover, we do not perform any thresholding: <>= file <- "cannings_example.rda" if (file.exists(file)) { load(file) } else { set.seed(12345) Sys.time() spar_example_1 <- spar( x = example_data2$x, y = ystar, xval = example_data2$xtest, yval = ystarval, family = binomial(), nus = 0, nummods = 50, rp = rp_cannings(control = list(lambda.min.ratio = 0.01)), measure = "class" ) Sys.time() set.seed(12345) spar_example_2 <- spar( x = example_data2$x, y = ystar, xval = example_data2$xtest, yval = ystarval, family = binomial(), rp = rp_cw(data = TRUE), nus = 0, nummods = 50, measure = "class" ) save(spar_example_1, spar_example_2, file = file) } @ <>= set.seed(12345) spar_example_1 <- spar(x = example_data2$x, y = ystar, family = binomial(), rp = rp_cannings(control = list(lambda.min.ratio = 0.01)), nus = 0, nummods = 50, xval = example_data2$xtest, yval = ystarval, measure = "class") @ We can compare with results obtained by using the data-driven \code{rp_cw(data = TRUE)}: <>= set.seed(12345) spar_example_2 <- spar(x = example_data2$x, y = ystar, family = binomial(), rp = rp_cw(data = TRUE), nus = 0, nummods = 50, xval = example_data2$xtest, yval = ystarval, measure = "class") @ Given that we only consider the case of $\nu=0$ and $M=50$, we can now compare the two approaches by directly looking at the validation measure achieved on the provided validation set. The extractor function \code{get_measure()} can be used to extract the value of the measure for the \class{spar} model object: <>= get_measure(spar_example_1) get_measure(spar_example_2) @ % \subsection{Marginal models}\label{sec-ext-model} % Finally, we illustrate how to implement a new marginal model in \pkg{spareg}. We consider estimating a robust GLM as a marginal model using the package \pkg{robustbase} \citep{pkg:robustbase}. We start by defining the \code{model_fun()} function, where \code{y} is the vector of responses, \code{z} is the matrix of reduced predictors and \code{object} is a `\code{sparmodel}' object. In the case of a linear model, i.e.,~Gaussian family with identity link, we rely on the function \code{robustbase::lmrob()}, for other family-links we use \code{robustbase::glmrob()}. <<>>= model_glmrob <- function(y, z, object) { requireNamespace("robustbase") fam <- object$control$family if (fam$family == "gaussian" & fam$link == "identity") { glmrob_res <- do.call(function(...) robustbase::lmrob(y ~ as.matrix(z), ...), object$control) } else { glmrob_res <- do.call(function(...) robustbase::glmrob(y ~ as.matrix(z), ...), object$control) } intercept <- coef(glmrob_res)[1] gammas <- coef(glmrob_res)[-1] list(gammas = gammas, intercept = intercept) } @ % We then construct a \code{spar_glmrob()} function, which builds the `\code{sparmodel}' object. We can do this easily by the available constructor function: % <<>>= spar_glmrob <- constructor_sparmodel(name = "glmrob", model_fun = model_glmrob) @ % For illustration purposes we generate another data set \code{example\_data3} and consider a count version of the response variable \code{y}. % <>= set.seed(123) example_data3 <- simulate_spareg_data(n = 100, p = 1000, ntest = 100, snr = 10, beta_vals = c(-1, 1)/10) ypois <- round(exp(example_data3$y)) + 1 @ % We further contaminate $25$\% of the values in the predictor matrix with outliers. % <>= perc_cont <- 0.25 x <- example_data3$x; np <- ncol(x) * nrow(x) id_outliers_x <- sample(seq_len(np), perc_cont * np) x[id_outliers_x] <- x[id_outliers_x] + 50 @ % We can now estimate the SPAR algorithm with robust GLMs as marginal models and compare it to a version with marginal GLMs using the \code{poisson()} link. We use no screening and \code{rp_gaussian()}. We set \code{measure = "mae"}, i.e.,~we find the threshold $\nu$ by choosing the value which delivers the lowest mean absolute error on the training sample. To improve stability in the estimates from \code{robustbase::glmrob}, we set the upper bound on goal dimension of the random projection to 25 (default would be $n/2=\Sexpr{round(nrow(x)/2)}$). <>= set.seed(1234) spar_rob_res <- spar(x, ypois, family = poisson(), model = spar_glmrob(), rp = rp_gaussian(msup = 25), measure = "mae") set.seed(1234) spar_res <- spar(x, ypois, family = poisson(), model = spar_glm(), rp = rp_gaussian(msup = 25), measure = "mae") @ % The extractor function \code{get_model(object, opt_par = c("best", "1se"))} can be employed to extract the best $\nu$ for $M=20$ (which is the default for the number of models consider). It extracts the best model for \class{spar} objects and the best model or the one chosen by the one-standard-error rule for \class{spar.cv} objects. <>= best_rob <- get_model(spar_rob_res, opt_par = "best") best_glm <- get_model(spar_res, opt_par = "best") @ Then the extractor function \code{get_measure()} can be used to extract the value of the measure for the best model:` <>= get_measure(best_rob) get_measure(best_glm) @ We observe that the attained \code{mae} is indeed lower for the robust version. We also observe that all predictors are active. Even though the optimal threshold parameter was greater than zero, it did not happen that the coefficients of one predictor were set to zero in all $M=20$ marginal models. % \ifPositive \section{Illustrations} \label{sec-illustrations} % \subsection{Face image data} % We illustrate the functionality of \pkg{spareg} on the Isomap data set containing $n = 698$ black and white face images of size $p = 64 \times 64 = 4096$ together with the faces' horizontal looking direction angle as the response variable. The data set is publicly available for download at \url{https://web.archive.org/web/20150922051706/http://isomap.stanford.edu/face_data.mat.Z}. <>= if (is_paper) { url1 <- "https://web.archive.org/web/20150922051706/" url2 <- "http://isomap.stanford.edu/face_data.mat.Z" ## On Ubuntu or MacOS use code below ## On Windows, download locally and unzip if (!file.exists(file.path(tmpdir, "face_data.mat"))) { # system('uncompress face_data.mat.Z') library("R.matlab") download.file(paste0(url1, url2), file.path(tmpdir, "face_data.mat.Z")) system(sprintf('uncompress %s', file.path(tmpdir, "face_data.mat.Z"))) } } @ <>= if (is_paper) { library("R.matlab") facedata <- readMat(file.path(tmpdir,"face_data.mat")) x <- t(facedata$images) y <- facedata$poses[1,] } @ % After downloading and uncompressing the \code{.mat} file, we import it into \proglang{R} using package \pkg{R.matlab} \citep{pkg:rmatlab}: <>= library("R.matlab") facedata <- readMat("face_data.mat") x <- t(facedata$images) y <- facedata$poses[1,] @ % We can visualize one observation in this data set in the left panel Figure~\ref{fig:faces_predictions} (code for reproducing the figure is provided in the supplementary materials). % This data set has the issue of many columns being almost constant. Functions \code{spar()} and \code{spar.cv()} ignore constant columns from the estimation, but we can further alleviate the issue by setting all columns which have a low standard deviation to zero. % <>= if (is_paper) x[, apply(x, 2, sd) < 0.01] <- 0 @ <>= x[, apply(x, 2, sd) < 0.01] <- 0 @ % We split the data into training vs test sample <>= if (is_paper) { set.seed(123) ntot <- length(y); ntest <- ntot * 0.25 testind <- sample(ntot, ntest, replace = FALSE) xtrain <- as.matrix(x[-testind, ]); ytrain <- y[-testind] xtest <- as.matrix(x[testind, ]); ytest <- y[testind] } @ <>= set.seed(123) ntot <- length(y); ntest <- ntot * 0.25 testind <- sample(ntot, ntest, replace = FALSE) xtrain <- as.matrix(x[-testind, ]); ytrain <- y[-testind] xtest <- as.matrix(x[testind, ]); ytest <- y[testind] @ Versions of the SPAR algorithm have been shown to achieve a high prediction performance on this data set \citep[see][]{parzer2024sparse}. We illustrate here the SPAR algorithm with cross-validation, where the data-driven random projection in \cite{parzer2024sparse} is used together with screening based on the ridge coefficients with the penalty approaching zero \citep{Wang2015HOLP}. To ensure convergence of the \pkg{glmnet} algorithm in computing the screening coefficient (which is also employed in the data-driven random projection), we set the \code{lambda.min.ratio} parameter to $0.001$. Moreover, each marginal model is a linear regression with the regression coefficients estimated by maximum likelihood. For the number of models we consider a grid of $M=5,10,20,50$. <>= if (is_paper) { file <- file.path(tmpdir, "faces_res.rda") if (!file.exists(file)) { set.seed(123) control_glmnet <- list(lambda.min.ratio = 0.001) library("spareg") Sys.time() spar_faces <- spar.cv( xtrain, ytrain, model = spar_glm(), screencoef = screen_glmnet(control = control_glmnet, reuse_in_rp = TRUE), rp = rp_cw(data = TRUE, ), nummods = c(5, 10, 20, 50), measure = "mse") Sys.time() save(spar_faces, file = file) } else { load(file) } } @ <>= library("spareg") set.seed(123) control_glmnet <- list(lambda.min.ratio = 0.001) spar_faces <- spar.cv(xtrain, ytrain, model = spar_glm(), screencoef = screen_glmnet(control = control_glmnet, reuse_in_rp = TRUE), rp = rp_cw(data = TRUE), nummods = c(5, 10, 20, 50), measure = "mse") @ The \code{print} method returns: % <>= spar_faces @ <>= if (is_paper) spar_faces @ % The \code{plot} method for `\code{spar.cv}' objects displays by default the measure employed in the cross-validation (in this case MSE) for a grid of thresholding values $\nu$, where the number of models is fixed to the value found to perform best in cross-validation exercise (see Figure~\ref{fig:facesplot_valmeasure} top panel). <>= plot(spar_faces) @ If we wish to visualize the results for $M=5$, which is the value chosen by the one-standard-error rule, we can specify this through the \code{nummod} argument (see Figure~\ref{fig:facesplot_valmeasure} bottom panel). <>= plot(spar_faces, nummod = 5) @ \begin{figure}[t!] \centering \setkeys{Gin}{width=0.95\textwidth} <>= if (is_paper) { p <- plot(spar_faces, digits = 3L) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) p } @ <>= if (is_paper) { p2 <- plot(spar_faces, nummod=5, digits = 3L) + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) p2 } @ \caption{Plot of mean squared error over a grid of threshold values $\nu$ for fixed number of optimal models (top) and for the number of models chosen by the one-standard-error rule (bottom) for the \emph{Isomap faces} data set. The left-most red point corresponds to the threshold value in the grid which achieves the lowest cross-validation measure. The right-most red point corresponds to threshold which leads to the most sparse vector of coefficients while achieving on average a cross-validation measure within one standard deviation of the best. Confidence bands represent one standard deviation in the measures across the number of folds. \label{fig:facesplot_valmeasure}} \end{figure} % We observe that the combination $(M=20, \nu=0)$ delivers the lowest MSE while the one-standard-error rule chooses $(M=5, \nu=0.00424)$, with small differences in performance. The coefficients of the different variables (in this example pixels) obtained by averaging over the coefficients the marginal models (for optimal $\nu$ and number of models) are given by: % <>= face_coef <- coef(spar_faces, opt_par = "best") summary(face_coef) @ <>= if (is_paper) { face_coef <- coef(spar_faces, opt_par = "best") summary(face_coef) } @ % For a sparser solution we can compute the coefficients using \code{opt_par = "1se"} which leads to more sparsity and a lower number of models. % <>= face_coef_1se <- coef(spar_faces, opt_par = "1se") summary(face_coef_1se) @ <>= if (is_paper) { face_coef_1se <- coef(spar_faces, opt_par = "1se") summary(face_coef_1se) } @ % % The \code{predict()} function can be applied to the `\code{spar.cv}' object. % We will employ the sparser solution chosen by the \code{opt_par = "1se"} rule: % <>= % if (is_paper) { % ynew <- predict(spar_faces, xnew = xtest, opt_par = "1se") % } % @ % <>= % ynew <- predict(spar_faces, xnew = xtest, opt_par = "1se") % @ We can easily compare the performance of the estimated version of the SPAR algorithm with another version proposed in \cite{Dunson2020TargRandProj}, where the sparse random projection in \cite{ACHLIOPTAS2003JL} is combined with probabilistic marginal correlation screening. % <>= set.seed(123) tarp_faces <- spar.cv(xtrain, ytrain, model = spar_glm(), screencoef = screen_cor(), rp = rp_sparse(psi = 1/3), nummods = c(5, 10, 20, 50), measure = "mse") @ <>= if(is_paper) { set.seed(123) tarp_faces <- spar.cv(xtrain, ytrain, model = spar_glm(), screencoef = screen_cor(), rp = rp_sparse(psi = 1/3), nummods = c(5, 10, 20, 50), measure = "mse") ynew_tarp <- predict(tarp_faces, xnew = xtest, opt_par = "1se") } @ We can assess the predictive performance of the two approaches on the training data, by first extracting the model using the $(\nu, M)$ combination selected by the one-standard-error rule and then looking at the average mean squared error measure across the folds: <>= spar_faces_1se <- get_model(spar_faces, opt_par = "1se") tarp_faces_1se <- get_model(tarp_faces, opt_par = "1se") get_measure(spar_faces_1se) @ <>= if (is_paper){ spar_faces_1se <- get_model(spar_faces, opt_par = "1se") get_measure(spar_faces_1se) } @ <>= get_measure(tarp_faces_1se) @ <>= if (is_paper){ tarp_faces_1se <- get_model(tarp_faces, opt_par = "1se") get_measure(tarp_faces_1se) } @ The predictive performance on the test data can be assessed by employing the \code{predict()} method on the `\code{spar.cv}' object. We will use the sparser solution chosen by the \code{opt_par = "1se"} rule for the predictions: <>= ynew_spar <- predict(spar_faces_1se, xnew = xtest) ynew_tarp <- predict(tarp_faces_1se, xnew = xtest) c("SPAR" = mean((ytest - ynew_spar)^2), "TARP" = mean((ytest - ynew_tarp)^2)) @ <>= if(is_paper) { ynew_spar <- predict(spar_faces_1se, xnew = xtest) ynew_tarp <- predict(tarp_faces_1se, xnew = xtest) c("SPAR" = mean((ytest - ynew_spar)^2), "TARP" = mean((ytest - ynew_tarp)^2))} @ Finally, for this data set, one can visualize $\hat\beta^\text{1se}_j x^\text{new}_{i,j}$, the effect of each pixel in predicting the face orientation in a given image. The contribution of each pixel in predicting the orientation of the face in the left panel of Figure~\ref{fig:faces_predictions} can be visualized in the right panel of Figure~\ref{fig:faces_predictions}. We observe that the face contour and the hairline are most informative for the prediction. %% \begin{figure}[t!] \centering \setkeys{Gin}{width=0.45\textwidth} <>= if (is_paper) { i <- 179 p <- ggplot(data.frame(X = rep(1:64,each=64), Y = rep(64:1,64), Z = facedata$images[,i]), aes(X, Y, fill = Z)) + geom_tile() + theme_void() + ggtitle(paste0("y = ", round(facedata$poses[1, i],1))) + theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) p } @ <>= if (is_paper) { id <- 3 p4 <- ggplot(data.frame(X = rep(1:64, each = 64), Y = rep(64:1, 64), effect = xtest[id,] * face_coef_1se$beta), aes(X, Y, fill = effect)) + geom_tile() + theme_void() + scale_fill_gradient2() + ggtitle(bquote(hat(y) == .(round(ynew_spar[id]))), ) + theme(plot.title = element_text(hjust = 0.5)) p4 } @ \caption{Left: Image corresponding to one observation in the \emph{Isomap faces} data set. Right: The effect of each pixel $\hat\beta^\text{1se}_j x^\text{new}_{i,j}$ in predicting the face orientation in left panel. \label{fig:faces_predictions}} \end{figure} % \subsection{Darwin data set} % The Darwin data set \citep{CILIA2022darwin} contains a binary response for Alzheimer's disease (AD) together with extracted features from 25 handwriting tests (18 features per task) for 89 AD patients and 85 healthy people ($n=174$). The data set can be downloaded from \url{https://archive.ics.uci.edu/dataset/732/darwin} as a \code{.zip} file. <>= tmpdir <- tempdir() download.file("https://archive.ics.uci.edu/static/public/732/darwin.zip", file.path(tmpdir, "darwin.zip")) @ <>= if (is_paper) { # if (file.exists(file.path(tmpdir, "data.zip"))) { # unzip(file.path(tmpdir, "data.zip")exdir = ) #} else { if (!file.exists(file.path(tmpdir, "darwin.zip"))) { download.file("https://archive.ics.uci.edu/static/public/732/darwin.zip", file.path(tmpdir, "darwin.zip")) unzip(zipfile = file.path(tmpdir, "darwin.zip"), files = "data.csv", exdir = tmpdir) } darwin_tmp <- read.csv(file.path(tmpdir, "data.csv"), stringsAsFactors = TRUE) } @ <>= unzip(zipfile = file.path(tmpdir, "darwin.zip"), files = "data.csv", exdir = file.path(tmpdir)) darwin_tmp <- read.csv(file.path(tmpdir, "data.csv"), stringsAsFactors = TRUE) @ Before proceeding with the analysis, the data is screened for multivariate outliers using the DDC algorithm in package \pkg{cellWise} \citep{rcellwise}. <>= darwin_orig <- list( x = darwin_tmp[, !(colnames(darwin_tmp) %in% c("ID", "class"))], y = as.numeric(darwin_tmp$class) - 1) tmp <- cellWise::DDC( as.matrix(darwin_orig$x), list(returnBigXimp = TRUE, tolProb = 0.999, silent = TRUE)) darwin <- list(x = tmp$Ximp, y = darwin_orig$y) @ <>= if (is_paper) { darwin_orig <- list( x = darwin_tmp[, !(colnames(darwin_tmp) %in% c("ID", "class"))], y = as.numeric(darwin_tmp$class) - 1) tmp <- cellWise::DDC( as.matrix(darwin_orig$x), list(returnBigXimp = TRUE, tolProb = 0.999, silent = TRUE)) darwin <- list(x = tmp$Ximp, y = darwin_orig$y) } @ The structure of the data is: <>= str(darwin) @ <>= if (is_paper) str(darwin) @ % We estimate the SPAR algorithm with the screening and random projection introduced in \cite{parzer2024glms} for binomial family and logit link, using $1-$area under the ROC curve as the cross-validation measure: <>= set.seed(1234) spar_darwin <- spar.cv(darwin$x, darwin$y, family = binomial(logit), screencoef = screen_glmnet(reuse_in_rp = TRUE), nummods = c(10, 20, 30, 50), measure = "1-auc") spar_darwin @ <>= if (is_paper) { file <- file.path(tmpdir, "darwin_res.rda") if (!file.exists(file)) { set.seed(1234) spar_darwin <- spareg::spar.cv(darwin$x, darwin$y, family = binomial(logit), screencoef = screen_glmnet(reuse_in_rp = TRUE), nummods = c(10, 20, 30, 50), measure = "1-auc") save(spar_darwin, file = file) } else { load(file = file) } spar_darwin } @ We can look at the average number of active variables for a grid of threshold values $\nu$, where the number of models is fixed to the value found to perform best in cross-validation exercise (see Figure~\ref{fig:darwin_activevars}). <>= plot(spar_darwin, plot_type = "val_numactive") @ We observe again that no thresholding achieves the best measure and translates to almost all variables being active (some variables can be inactive at $\nu=0$ as they may never be screened). The one-standard-error rule would however indicate that more sparsity can be introduced without too much increase in the cross-validation measure. % \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= if (is_paper) { p <- plot(spar_darwin, plot_type = "val_numactive", digits=3L) p } @ \caption{Average number of active variables for the grid of thresholding values $\nu$ and $M=20$ models for the \emph{Darwin} data set. The red points correspond to the average number of active variables for the model with the lowest cross-validation measures and to the one chosen by the one-standard-error rule. \label{fig:darwin_activevars}} \end{figure} % Finally, standardized coefficients of the predictors across the maximum number of considered marginal models (in this case $M=50$) can be visualized with \code{plot(spar_darwin, plot_type = "coefs")}. In this data set, the predictors are ordered by task, where the first 18 covariates represent different features measured for the first task. Given that there is clear grouping in the variables in this example, we can reorder the coefficients for plotting by grouping them by feature, rather than task. This allows to assess how the different features (e.g.,~time it takes to complete a certain task) relate to the likelihood of having AD and how stable the sign and magnitude of the coefficient is across the models in the ensemble. We can achieve this by using reordering argument \code{coef\_order} in method \code{plot} with \code{plot\_type = "coefs"} (see Figure~\ref{fig:darwin_coefs} which can be reproduced with code in the supplementary materials). %% \begin{figure}[t!] \centering \setkeys{Gin}{width=0.8\textwidth} <>= if (is_paper) { ntasks <- 25 nfeat <- 18 reorder_ind <- c(outer( (seq_len(ntasks) - 1) * nfeat, seq_len(nfeat), "+")) feat_names <- sapply(colnames(darwin$x)[seq_len(nfeat)], function(name) substr(name, 1, nchar(name) - 1)) p <- plot(spar_darwin,"coefs",coef_order = reorder_ind) + geom_vline(xintercept = 0.5 + seq_len(ntasks - 1) * ntasks, alpha = 0.2, linetype = 2) + annotate("text",x = (seq_len(nfeat) - 1) * ntasks + 12, y = 40, label=feat_names, angle = 90, size = 3) p } @ \caption{Coefficient plot for all variables and all considered $M=50$ models in the SPAR ensemble for the \emph{Darwin} data set. The coefficients are standardized, before thresholding. \label{fig:darwin_coefs}} \end{figure} In general, we observe that the regression coefficients for the same features tend to have the same sign across different tasks, indicating a consistent positive or negative influence on the probability of Alzheimer's disease. This is visually reflected in the recurring blocks of blue or red lines. \fi \section{Conclusion} \label{sec-conclusion} Package \pkg{spareg} can be employed for modeling high-dimensional data in a GLM framework, especially in settings where the number of predictors is much higher than the number of observations. The package provides an implementation of a computationally-efficient algorithm for sparse projected and average regression (SPAR), which combines variable screening and random projection in an ensemble of GLMs to reduce dimensionality of the predictors. Package \pkg{spareg} provides flexible classes for i)~specifying the coefficient based on which screening should be performed (both in a classical fashion, where the predictors with the highest screening coefficient are selected for subsequent analysis or in a probabilistic fashion, where variables are sampled for inclusion with probabilities proportional to their screening coefficient), ii)~generating the random projection to be employed in each marginal model, iii)~specifying the marginal models to be used in the ensemble. Screening coefficients based on marginal correlation between the predictors and the response, marginal coefficients from a GLM or ridge coefficients are provided in the package. Moreover, several random projections are implemented: the Gaussian and sparse matrices which are data-agnostic and satisfy the JL lemma and the data-driven projection proposed in \cite{parzer2024sparse} for linear regression and extended to GLMs in \cite{parzer2024glms}. This data-driven approach, where information about the relationship among the responses and the predictors is incorporated in the random projection by replacing the non-zero elements of the random projection of \cite{Clarkson2013LowRankApprox} with the ridge coefficients obtained with a small penalty, has the advantage of ensuring that the true regression coefficients can be recovered approximately after the projection. Methodologically, the SPAR algorithm with ridge screening coefficients and the data-driven random projection \citep[as proposed in][]{parzer2024glms} has been demonstrated to perform effectively in terms of predictive power and variable ranking in settings with correlated predictors and across different degrees of sparsity of the coefficient vector, making it a suitable method for high-dimensional settings with correlated predictors where the sparsity of the problem is unknown. This is the motivation behind setting the screening coefficient and random projection in \cite{parzer2024glms} as default values in \pkg{spareg}. Moreover, the flexibility and adaptability of the \pkg{spareg} package in dealing with different types of screening, random projection and types of GLMs, make it an attractive choice for practitioners and researchers in a wide variety of data settings. It encourages exploration of new methods for variable screening and random projections or the combination of existing approaches to tailor solutions to specific data requirements. \section*{Computational details} The results in this paper were obtained using \proglang{R} \Sexpr{paste(R.Version()[6:7], collapse = ".")}. \proglang{R} itself and all packages used are available from CRAN at \url{https://CRAN.R-project.org/}. \ifPositive \else <>= sessionInfo() @ \fi \section*{Acknowledgments} Roman Parzer and Laura Vana-G\"ur acknowledge funding from the Austrian Science Fund (FWF) for the project ``High-dimensional statistical learning: New methods to advance economic and sustainability policies'' (ZK 35), jointly carried out by WU Vienna University of Economics and Business, Paris Lodron University Salzburg, TU Wien, and the Austrian Institute of Economic Research (WIFO). %% -- Bibliography ------------------------------------------------------------- %% - References need to be provided in a .bib BibTeX database. %% - All references should be made with \cite, \citet, \citep, \citealp etc. %% (and never hard-coded). See the FAQ for details. %% - JSS-specific markup (\proglang, \pkg, \code) should be used in the .bib. %% - Titles in the .bib should be in title case. %% - DOIs should be included where available. \bibliography{spareg} \end{document}