--- title: "The ridge function, matrices, and prediction" author: "Terry Therneau" date: Dec 2020 output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The ridge function, matrices, and prediction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} options(continue=" ", width=70, prompt="> ") options(contrasts=c("contr.treatment", "contr.poly")) #ensure default options(show.significant.stars = FALSE) #statistical intelligence knitr::opts_chunk$set( collapse = TRUE, warning=FALSE, error=FALSE, tidy=FALSE, fig.asp=.75, fig.align="center", comment = "#>", highlight=TRUE, echo=TRUE, prompt=FALSE, fig.width=7, fig.height=5.5 ) library(survival) ``` # The ridge function The `ridge()` function was included in the survival package as a test of the code for penalized models. As the author of the package, I never actually use the function myself, and as a consequence it has never received any polish. But it does work. This neglect is simply due to the types of data that I analyse. This vignette was created to address a common issue with using the function, one that arises with some frequency on internet message boards. Hopefully this note will provide context and a reference for a solution. I will create a dummy dataset with 20 random 0/1 variables as a running example. ```{r, rdata} set.seed(1954) # force reproducability library(survival) n <- nrow(lung) snp <- matrix(rbinom(20*n, 1, p=.1), nrow=n) snpdata <- cbind(lung, data.frame(snp)) dim(snpdata) ``` ## The issue The ridge function makes the most sense when there are a large number of variables, for instance if one had 100 SNPs. Say that you write the call in the obvious way: ```{r, pass1} cfit1 <- coxph(Surv(time, status) ~ age + sex + ridge(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, theta=.1), data=snpdata) ``` The problem with this, of course, is that typing this statement is cumbersome for 20 SNPs and practically impossible if there were 500. One can use `.` in R formulas to stand for "all the other variables", but this does not work inside a function. This leads to the first work-around, which is to create the formula programatically. ```{r, pass2} xlist <- paste0("X", 1:20) myform <- paste("Surv(time, status) ~ age + sex + ridge(", paste(xlist, collapse= ", "), ", theta=.1)") cfit2 <- coxph(formula(myform), data=snpdata) all.equal(cfit1$loglik, cfit2$loglik) ``` This approach works well up to a point and then fails, once the model statement grows too long for the internal buffer of the R parser. The solution is to call the ridge function with a single matrix argument. ```{r, pass3} cfit3 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1), data = snpdata) ``` This fit has completely ignored the variables X1, X2, ..., X50 found in the dataframe. If the SNP data had come to us as a dataframe, then we would create a temporary matrix using the as.matrix command applied to the appropriate columns of said data. The problem with this approach comes if we want to do predictions using a new set of data. ```{r, pass4, error=TRUE} newsnp <- matrix(rbinom(20*4, 1, p=.12), nrow=4) prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), newsnp) predict(cfit3, newdata=prdata) ``` We get a failure message that "variable lengths differ". The reason is that R fitting functions look for their variables first in the `data=` argument, and then look in the primary working data for any that were not found. The call to cfit3 has 3 variables of age, sex, and snp; while the new dataframe prdata has variables of age, sex, X1, X2, ..., X20. The predict function pulls 4 rows from prdata (for age and sex), and 50 rows from the global variable snp. This clearly does not work. What if we put those variables in the the main data, instead of a dataframe? ```{r, pass5} prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), newsnp) snpsave <- snp snp <- newsnp age <- prdata$age sex <- prdata$sex new <- predict(cfit3) length(new) snp <- snpsave # restore the status quo rm(age, sex) ``` This time we have been tripped up by the `predict.coxph function`. If there is no newdata argument, the function assumes the user is asking for predictions using the original data, and that is what is produced. There is also the need to restore datasets that we overwrote. The most direct way around this is to do the prediction "by hand". Since this dataset has no factor variables, splines, or other terms that lead to special coding, it is fairly easy to do the computation. ```{r, pass6} prmat <- cbind(age= c(50, 65, 48, 70), sex=c(1,1,2,1), newsnp) drop(prmat %*% coef(cfit3)) # simplify results to vector #alternate (center) drop(prmat %*% coef(cfit3)) - sum(coef(cfit3)* cfit3$mean) ``` The `predict.coxph` function, by default, gives predictions for centered covariates. This has its roots in the internals of the `coxph` code, where centering is used to avoid overflow of the exp function. This is entirely optional when doing the prediction ourselves, unless one wants to match `predict.coxph`. (If I could go back in time, centering the predictions would be undone; it's effect has been mostly to cause confusion. Que sera sera.) Suppose that one still wanted to use the standard predict function, for instance if the model did include a factor variable, or simply to fit in to the usual R pattern? What is needed is a way to store a matrix directly into a dataframe. This is done regularly by the `model.frame` function; the code below causes data.frame to use that behavior for our matrices of SNP values. ```{r, pass7} # new data type ridgemat <- function(x) { class(x) <- c("ridgemat", class(x)) x } as.data.frame.ridgemat <- function(x, ...) as.data.frame.model.matrix(as.matrix(x), ...) snpdata2 <- cbind(lung, snp= ridgemat(snp)) names(snpdata2) cfit4 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1), data= snpdata2) prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1), snp= ridgemat(newsnp)) predict(cfit4, newdata=prdata) ``` Success! The downside to this is that the modified dataframe object is known to work with modeling functions, but is not guaranteed to be comparable with standard manipulations for dataframes, such as merge, subset, etc., or tidyverse concepts such as a tibble.