--- title: "Graph-constrained Regression with Enhanced Regulariazation Parameters Selection: intro and usage examples" author: "Marta Karas " date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Intro and usage examples} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # The two-formulas intro ## `riPEER` estimator `mdpeer` provides penalized regression method `riPEER()` to estimate a linear model: $$y = X\beta + Zb + \varepsilon$$ where: - $y$ - response - $X$ - input data matrix - $Z$ - input data matrix - $\beta$ - regression coefficients, *not penalized* in estimation process - $b$ - regression coefficients, *penalized* in estimation process and for whom there is, *possibly*^[`riPEER` might be also used as ridge regression estimator by supplying a diagonal matrix as `Q` argument. see: Examples. Example 3.], a **prior graph of similarity / graph of connections** available `riPEER()` estimation method uses a penalty being a linear combination of a graph-based and ridge penalty terms: $$\widehat{\beta}, \widehat{b}= \underset{\beta,b}{arg \; min}\left \{ (y - X\beta - Zb)^T(y - X\beta - Zb) + \lambda_Qb^TQb + \lambda_Rb^Tb\right \},$$ where: - $Q$ - a graph-originated penalty matrix; typically: a graph Laplacian matrix, - $\lambda_Q$ - regularization parameter for a graph-based penalty term - $\lambda_R$ - regularization parameter for ridge penalty term ## A word about the `riPEER` penalty properties * A graph-originated penalty matrix $Q$ allows imposing similarity between coefficients of variables which are *similar* (or *connected*), based on some graph given. * Adding ridge penalty term, $\lambda_Rb^Tb$, even with very small $\lambda_R$, eliminates computational issues arising from singularity in a graph-originated penalty matrix. * Also, in cases when the graph information given is only partially informative / not informative about regression coefficients, ridge penalty provides partial / full regularization in the estimation. ## A cool thing about regularization parameters selection They are **estimated automatically** as ML estimators of equivalent Linear Mixed Models optimization problem (see: Karas et al. (2017)). # Examples ## example 1: `riPEER` used with informative graph information ```{r, echo = FALSE, results='hide', include=FALSE} library(mdpeer) library(glmnet) library(ggplot2) ``` ```{r, fig.width = 3.3, fig.height = 2.8} library(mdpeer) n <- 100 p1 <- 10 p2 <- 40 p <- p1 + p2 # Define graph adjacency matrix A <- matrix(rep(0, p*p), nrow = p, ncol = p) A[1:p1, 1:p1] <- 1 A[(p1+1):p, (p1+1):p] <- 1 diag(A) <- rep(0,p) # Compute graph Laplacian matrix L <- Adj2Lap(A) vizu.mat(A, "graph adjacency matrix", 9); vizu.mat(L, "graph Laplacian matrix", 9) ``` - graph adjacency matrix represents connections between p=100 nodes on a graph (speaking graph-constrained regression language: *represents connections / similarity between p=100 regression coefficients $b$*) - 1 value of $[ij]$ adjacency matrix entry denotes $i$-th and $j$-th nodes are *connected* on a graph; 0 value means they are not - generally, adjacency matrices with continuous (both positive and negative) values might be used ```{r} # simulate data objects set.seed(1) Z <- scale(matrix(rnorm(n*p), nrow = n, ncol = p)) X <- cbind(1, scale(matrix(rnorm(n*3), nrow = n, ncol = 3))) # cbind with 1s col. for intercept b.true <- c(rep(1,p1), rep(0,p2)) # coefficients of interest (penalized) beta.true <- c(0, runif(3)) # intercept=0 and 3 coefficients (non-penalized) eta <- Z %*% b.true + X %*% beta.true R2 <- 0.5 # assumed variance explained sd.eps <- sqrt(var(eta) * (1 - R2) / R2) error <- rnorm(n, sd = sd.eps) y <- eta + error ``` $b$ estimation - `riPEER`: graph Laplacian matrix `L` used as a penalty matrix $Q$ - `riPEER`: graph highly informative about regression coefficients -> relatively high contribution of graph-based penalty over contribution of ridge penalty (`lambda.Q` >> `lambda.R`) ```{r} # estimate with riPEER riPEER.out <- riPEER(Q = L, y = y, Z = Z, X = X, verbose = FALSE) b.est.riPEER <- riPEER.out$b.est # (graph penalty regulatization parameter, ridge penalty regularization parameter) c(riPEER.out$lambda.Q, riPEER.out$lambda.R) # estimate with cv.glmnet library(glmnet) cv.out.glmnet <- cv.glmnet(x = cbind(X,Z), y = y, alpha = 0, intercept = FALSE) b.est.cv.glmnet <- unlist(coef(cv.out.glmnet))[5:54] # exclude intercept and covs ``` $b$ estimates plot ```{r, echo = FALSE, fig.width=7, fig.height=3} plt.df <- data.frame(x = rep(1:p,2), b.coef = c(b.est.riPEER,b.est.cv.glmnet), est.method = c(rep("riPEER",p), rep("cv.glmnet",p))) ggplot(plt.df, aes(x=x,y=b.coef,group=est.method, color=est.method)) + geom_line(data = data.frame(x=1:p,y=b.true), aes(x=x,y=y),inherit.aes=FALSE) + geom_line() + geom_point() + theme_bw(base_size = 9) + labs(title = "black line: b.true", color = "estimation\nmethod") ``` $b$ estimation error ```{r} MSEr <- function(b.true, b.est) sum((b.true-b.est)^2)/sum(b.true^2) # (riPEER error, cv.glmnet error) c(MSEr(b.true,b.est.riPEER), MSEr(b.true,b.est.cv.glmnet)) ``` ## example 2: `riPEER` used with non-informative graph information - use same graph information as above - simulate truen $b$ coefficients about which the graph given is not informative ```{r} # simulate data objects set.seed(1) Z <- scale(matrix(rnorm(n*p), nrow = n, ncol = p)) X <- cbind(1, scale(matrix(rnorm(n*3), nrow = n, ncol = 3))) b.true <- rep(c(-1,1), p/2) # coefficients of interest (penalized) beta.true <- c(0, runif(3)) # intercept=0 and 3 coefficients (non-penalized) eta <- Z %*% b.true + X %*% beta.true R2 <- 0.5 # assumed variance explained sd.eps <- sqrt(var(eta) * (1 - R2) / R2) error <- rnorm(n, sd = sd.eps) y <- eta + error ``` $b$ estimation - `riPEER`: graph non-informative about regression coefficients -> relatively high contribution of ridge penalty over contribution of graph-based penalty (`lambda.Q` << `lambda.R`) ```{r} # estimate with riPEER riPEER.out <- riPEER(Q = L, y = y, Z = Z, X = X, verbose = FALSE) b.est.riPEER <- riPEER.out$b.est c(riPEER.out$lambda.Q, riPEER.out$lambda.R) # estimate with cv.glmnet cv.out.glmnet <- cv.glmnet(x = cbind(X,Z), y = y, alpha = 0, intercept = FALSE) b.est.cv.glmnet <- unlist(coef(cv.out.glmnet))[5:54] # exclude intercept and covs ``` $b$ estimates plot ```{r, echo = FALSE, fig.width=7, fig.height=3} plt.df <- data.frame(x = rep(1:p,2), b.coef = c(b.est.riPEER,b.est.cv.glmnet), est.method = c(rep("riPEER",p), rep("cv.glmnet",p))) ggplot(plt.df, aes(x=x,y=b.coef,group=est.method, color=est.method)) + geom_line(data = data.frame(x=1:p,y=b.true), aes(x=x,y=y),inherit.aes=FALSE) + geom_line() + geom_point() + theme_bw(base_size = 9) + labs(title = "black line: b.true", color = "estimation\nmethod") ``` $b$ estimation error ```{r} MSEr <- function(b.true, b.est) sum((b.true-b.est)^2)/sum(b.true^2) # (riPEER error, cv.glmnet error) c(MSEr(b.true,b.est.riPEER), MSEr(b.true,b.est.cv.glmnet)) ``` ## example 3: `riPEERc` used as ordinary ridge estimator - `riPEERc` - method for *Graph-constrained regression with addition of a small ridge term to handle the non-invertibility of a graph Laplacian matrix* - one might provide diagonal matrix as its `Q` argument to use it as ordinary ridge estimator ```{r} # example based on `glmnet::cv.glmnet` CRAN documentation set.seed(1010) n=1000;p=100 nzc=trunc(p/10) x=matrix(rnorm(n*p),n,p) beta=rnorm(nzc) fx= x[,seq(nzc)] %*% beta eps=rnorm(n)*5 y=drop(fx+eps) set.seed(1011) cvob1=cv.glmnet(x,y, alpha = 0) est.cv.glmnet <- coef(cvob1)[-c(1)] # exclude intercept and covs # use riPEERc (X has column of 1s to represent intercept in a model) riPEERc.out <- riPEERc(Q = diag(p), y = matrix(y), Z = x, X = matrix(rep(1,n))) est.riPEERc <- riPEERc.out$b.est ``` $b$ estimates plot ```{r, echo = FALSE, fig.width=7, fig.height=3} plt.df <- data.frame(x = rep(1:p,2), b.coef = c(est.riPEERc,est.cv.glmnet), est.method = c(rep("riPEERc",p), rep("cv.glmnet",p))) ggplot(plt.df, aes(x=x,y=b.coef,group=est.method, color=est.method)) + geom_line(data = data.frame(x=1:p,y=beta), aes(x=x,y=y),inherit.aes=FALSE) + geom_line() + geom_point() + theme_bw(base_size = 9) + labs(title = "black line: true", color = "estimation\nmethod") ``` $b$ estimation error ```{r} MSEr <- function(b.true, b.est) sum((b.true-b.est)^2)/sum(b.true^2) # (riPEER error, cv.glmnet error) c(MSEr(beta,est.riPEERc), MSEr(beta,est.cv.glmnet)) ``` ## Conclusion - Example 1: graph highly informative about regression coefficients: `riPEER` outperforms `cv.glmnet` in $b$ coefficients estimation significantly - Example 2: graph non-informative about regression coefficients: `riPEER` still outperforms `cv.glmnet` in $b$ coefficients estimation - Example 3: `riPEERc` used as ordinary ridge estimator (with regularization parameter being estimated automatically) outperforms `cv.glmnet` in coefficients estimation # References 1. Karas, M., Brzyski, D., Dzemidzic, M., J., Kareken, D.A., Randolph, T.W., Harezlak, J. (2017). Brain connectivity-informed regularization methods for regression. doi: https://doi.org/10.1101/117945