---
title: "The ridge function, matrices, and prediction"
author: "Terry Therneau"
date: Dec 2020
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The ridge function, matrices, and prediction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
options(continue="  ", width=70, prompt="> ")
options(contrasts=c("contr.treatment", "contr.poly")) #ensure default
options(show.significant.stars = FALSE) #statistical intelligence

knitr::opts_chunk$set(
  collapse = TRUE, warning=FALSE, error=FALSE, tidy=FALSE,
  fig.asp=.75, fig.align="center",
  comment = "#>", highlight=TRUE, echo=TRUE, prompt=FALSE,
  fig.width=7, fig.height=5.5
)

library(survival)
```
# The ridge function

The `ridge()` function was included in the survival package as a test of the
code for penalized models.  As the author of the package, I never actually
use the function myself, and as a consequence it has never received any
polish.  But it does work.  This neglect is simply due to the types of data
that I analyse.

This vignette was created to address a common issue with using the function,
one that arises with some frequency on internet message boards.
Hopefully this note will provide context and a reference for a solution.
I will create a dummy dataset with 20 random 0/1 variables as a running
example.

```{r, rdata}
set.seed(1954)  # force reproducability
library(survival)

n <- nrow(lung)
snp <- matrix(rbinom(20*n, 1, p=.1), nrow=n)
snpdata <- cbind(lung, data.frame(snp))
dim(snpdata)
```

## The issue

The ridge function makes the most sense when there are a large number of
variables, for instance if one had 100 SNPs. Say that you write the call in the
obvious way:

```{r, pass1}
cfit1 <- coxph(Surv(time, status) ~ age + sex + ridge(X1, X2, X3, X4, X5, X6,
                     X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, 
		     X19, X20, theta=.1), data=snpdata)
```

The problem with this, of course, is that typing this statement is cumbersome
for 20 SNPs and practically impossible if there were 500. 
One can use `.` in R formulas to
stand for "all the other variables", but this does not work inside a function.
This leads to the first work-around, which is to create the formula
programatically.

```{r, pass2}
xlist <-  paste0("X", 1:20)
myform <- paste("Surv(time, status) ~ age + sex + ridge(",
                paste(xlist, collapse= ", "), ", theta=.1)")
cfit2 <- coxph(formula(myform), data=snpdata)

all.equal(cfit1$loglik, cfit2$loglik)
```

This approach works well up to a point and then fails, once the model statement
grows too long for the internal buffer of the R parser. The solution is to call
the ridge function with a single matrix argument.

```{r, pass3}
cfit3 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1),
      	 data = snpdata)
```

This fit has completely ignored the variables X1, X2, ..., X50 found in the
dataframe. If the SNP data had come to us as a dataframe, then we would create a
temporary matrix using the as.matrix command applied to the appropriate
columns of said data.

The problem with this approach comes if we want to do predictions using a new
set of data.

```{r, pass4, error=TRUE}
newsnp <- matrix(rbinom(20*4, 1, p=.12), nrow=4)
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                     newsnp)
predict(cfit3, newdata=prdata)
```

We get a failure message that "variable lengths differ".  The reason is that R
fitting functions look for their variables first in the `data=` argument, and
then look in the primary working data for any that were not found.  The call to
cfit3 has 3 variables of age, sex, and snp; while the new dataframe prdata has
variables of age, sex, X1, X2, ..., X20.  The predict function pulls 4 rows from
prdata (for age and sex), and 50 rows from the global variable snp. This clearly
does not work.

What if we put those variables in the the main data, instead of a dataframe?

```{r, pass5}
prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                     newsnp)
snpsave <- snp
snp <- newsnp
age <- prdata$age
sex <- prdata$sex

new <- predict(cfit3)
length(new)

snp <- snpsave  # restore the status quo
rm(age, sex)
```

This time we have been tripped up by the `predict.coxph function`.  If there is
no newdata argument, the function assumes the user is asking for predictions
using the original data, and that is what is produced.  There is also the need
to restore datasets that we overwrote.

The most direct way around this is to do the prediction "by hand".  Since
this dataset has no factor variables, splines, or other terms that lead
to special coding, it is fairly easy to do the computation.

```{r, pass6}
prmat <- cbind(age= c(50, 65, 48, 70), sex=c(1,1,2,1), newsnp)
drop(prmat %*% coef(cfit3)) # simplify results to vector
#alternate (center)
drop(prmat %*% coef(cfit3)) - sum(coef(cfit3)* cfit3$mean)
```

The `predict.coxph` function, by default, gives predictions for centered
covariates.  This has its roots in the internals of the `coxph` code, where
centering is used to avoid overflow of the exp function. This is entirely
optional when doing the prediction ourselves, unless one wants to match
`predict.coxph`.  (If I could go back in time, centering the predictions would
be undone; it's effect has been mostly to cause confusion. Que sera sera.)

Suppose that one still wanted to use the standard predict function, for instance
if the model did include a factor variable, or simply to fit in to the usual R
pattern? What is needed is a way to store a matrix directly into a dataframe.
This is done regularly by the `model.frame` function; the code below causes
data.frame to use that behavior for our matrices of SNP values.

```{r, pass7}
# new data type
ridgemat <- function(x) {
    class(x) <- c("ridgemat", class(x))
    x
}
as.data.frame.ridgemat <- function(x, ...) 
    as.data.frame.model.matrix(as.matrix(x), ...)

snpdata2 <- cbind(lung, snp= ridgemat(snp))
names(snpdata2)

cfit4 <- coxph(Surv(time, status) ~ age + sex + ridge(snp, theta=.1),
      	     data= snpdata2)
     

prdata <- data.frame(age= c(50, 65, 48, 70), sex= c(1, 1, 2,1),
                       snp= ridgemat(newsnp))
predict(cfit4, newdata=prdata)
```

Success!

The downside to this is that the modified dataframe object is known to work
with modeling functions, but is not guaranteed to be comparable with standard
manipulations for dataframes, such as merge, subset, etc., or tidyverse
concepts such as a tibble.