\documentclass{article} \usepackage{url} \addtolength{\topmargin}{-0.5in} \addtolength{\textheight}{0.75in} \addtolength{\oddsidemargin}{-0.5in} \addtolength{\textwidth}{1in} %\VignetteIndexEntry{Two-phase designs in epidemiology} \usepackage{Sweave} \author{Thomas Lumley} \title{Two-phase designs in epidemiology} \begin{document} \maketitle This document explains how to analyse case--cohort and two-phase case--control studies with the ``survey'' package, using examples from \url{http://faculty.washington.edu/norm/software.html}. Some of the examples were published by Breslow \& Chatterjee (1999). The data are relapse rates from the National Wilm's Tumor Study (NWTS). Wilm's Tumour is a rare cancer of the kidney in children. Intensive treatment cures the majority of cases, but prognosis is poor when the disease is advanced at diagnosis and for some histological subtypes. The histological characterisation of the tumour is difficult, and histological group as determined by the NWTS central pathologist predicts much better than determinations by local institution pathologists. In fact, local institution histology can be regarded statistically as a pure surrogate for the central lab histology. In these examples we will pretend that the (binary) local institution histology determination (\texttt{instit}) is avavailable for all children in the study and that the central lab histology (\texttt{histol}) is obtained for a probability sample of specimens in a two-phase design. We treat the initial sampling of the study as simple random sampling from an infinite superpopulation. We also have data on disease stage, a four-level variable; on relapse; and on time to relapse. \section*{Case--control designs} Breslow \& Chatterjee (1999) use the NWTS data to illustrate two-phase case--control designs. The data are available at \url{http://faculty.washington.edu/norm/software.html} in compressed form; we first expand to one record per patient. <<>>= library(survey) load(system.file("doc","nwts.rda",package="survey")) nwtsnb<-nwts nwtsnb$case<-nwts$case-nwtsb$case nwtsnb$control<-nwts$control-nwtsb$control a<-rbind(nwtsb,nwtsnb) a$in.ccs<-rep(c(TRUE,FALSE),each=16) b<-rbind(a,a) b$rel<-rep(c(1,0),each=32) b$n<-ifelse(b$rel,b$case,b$control) index<-rep(1:64,b$n) nwt.exp<-b[index,c(1:3,6,7)] nwt.exp$id<-1:4088 @ As we actually do know \texttt{histol} for all patients we can fit the logistic regression model with full sampling to compare with the two-phase analyses <<>>= glm(rel~factor(stage)*factor(histol), family=binomial, data=nwt.exp) @ The second phase sample consists of all patients with unfavorable histology as determined by local institution pathologists, all cases, and a 20\% sample of the remainder. Phase two is thus a stratified random sample without replacement, with strata defined by the interaction of \texttt{instit} and \texttt{rel}. <<>>= dccs2<-twophase(id=list(~id,~id),subset=~in.ccs, strata=list(NULL,~interaction(instit,rel)),data=nwt.exp) summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=dccs2)) @ Disease stage at the time of surgery is also recorded. It could be used to further stratify the sampling, or, as in this example, to post-stratify. We can analyze the data either pretending that the sampling was stratified or using \texttt{calibrate} to post-stratify the design. <<>>= dccs8<-twophase(id=list(~id,~id),subset=~in.ccs, strata=list(NULL,~interaction(instit,stage,rel)),data=nwt.exp) gccs8<-calibrate(dccs2,phase=2,formula=~interaction(instit,stage,rel)) summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=dccs8)) summary(svyglm(rel~factor(stage)*factor(histol),family=binomial,design=gccs8)) @ \section*{Case--cohort designs} In the case--cohort design for survival analysis, a $P$\% sample of a cohort is taken at recruitment for the second phase, and all participants who experience the event (cases) are later added to the phase-two sample. Viewing the sampling design as progressing through time in this way, as originally proposed, gives a double sampling design at phase two. It is simpler to view the process \emph{sub specie aeternitatis}, and to note that cases are sampled with probability 1, and controls with probability $P/100$. The subcohort will often be determined retrospectively rather than at recruitment, giving stratified random sampling without replacement, stratified on case status. If the subcohort is determined prospectively we can use the same analysis, post-stratifying rather than stratifying. There have been many analyses proposed for the case--cohort design (Therneau \& Li, 1999). We consider only those that can be expressed as a Horvitz--Thompson estimator for the Cox model. First we load the data and the necessary packages. The version of the NWTS data that includes survival times is not identical to the data set used for case--control analyses above. <<>>= library(survey) library(survival) data(nwtco) ntwco<-subset(nwtco, !is.na(edrel)) @ Again, we fit a model that uses \texttt{histol} for all patients, to compare with the two-phase design <<>>= coxph(Surv(edrel, rel)~factor(stage)+factor(histol)+I(age/12),data=nwtco) @ We define a two-phase survey design using simple random superpopulation sampling for the first phase, and sampling without replacement stratified on \texttt{rel} for the second phase. The \texttt{subset} argument specifies that observations are in the phase-two sample if they are in the subcohort or are cases. As before, the data structure is rectangular, but variables measured at phase two may be \texttt{NA} for participants not included at phase two. We compare the result to that given by \texttt{survival::cch} for Lin \& Ying's (1993) approach to the case--cohort design. <<>>= (dcch<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco)) svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcch) subcoh <- nwtco$in.subcohort selccoh <- with(nwtco, rel==1|subcoh==1) ccoh.data <- nwtco[selccoh,] ccoh.data$subcohort <- subcoh[selccoh] cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12), data =ccoh.data, subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing") @ Barlow (1994) proposes an analysis that ignores the finite population correction at the second phase. This simplifies the standard error estimation, as the design can be expressed as one-phase stratified superpopulation sampling. The standard errors will be somewhat conservative. More data preparation is needed for this analysis as the weights change over time. <<>>= nwtco$eventrec<-rep(0,nrow(nwtco)) nwtco.extra<-subset(nwtco, rel==1) nwtco.extra$eventrec<-1 nwtco.expd<-rbind(subset(nwtco,in.subcohort==1),nwtco.extra) nwtco.expd$stop<-with(nwtco.expd, ifelse(rel & !eventrec, edrel-0.001,edrel)) nwtco.expd$start<-with(nwtco.expd, ifelse(rel & eventrec, edrel-0.001, 0)) nwtco.expd$event<-with(nwtco.expd, ifelse(rel & eventrec, 1, 0)) nwtco.expd$pwts<-ifelse(nwtco.expd$event, 1, 1/with(nwtco,mean(in.subcohort | rel))) @ The analysis corresponds to a cluster-sampled design in which individuals are sampled stratified by subcohort membership and then time periods are sampled stratified by event status. Having individual as the primary sampling unit is necessary for correct standard error calculation. <<>>= (dBarlow<-svydesign(id=~seqno+eventrec, strata=~in.subcohort+rel, data=nwtco.expd, weight=~pwts)) svycoxph(Surv(start,stop,event)~factor(stage)+factor(histol)+I(age/12), design=dBarlow) @ In fact, as the finite population correction is not being used the second stage of the cluster sampling could be ignored. We can also produce the stratified bootstrap standard errors of Wacholder et al (1989), using a replicate weights analysis <<>>= (dWacholder <- as.svrepdesign(dBarlow,type="bootstrap",replicates=500)) svycoxph(Surv(start,stop,event)~factor(stage)+factor(histol)+I(age/12), design=dWacholder) @ \subsection*{Exposure-stratified designs} Borgan et al (2000) propose designs stratified or post-stratified on phase-one variables. The examples at \url{http://faculty.washington.edu/norm/software.html} use a different subcohort sample for this stratified design, so we load the new \texttt{subcohort} variable <<>>= load(system.file("doc","nwtco-subcohort.rda",package="survey")) nwtco$subcohort<-subcohort d_BorganII <- twophase(id=list(~seqno,~seqno), strata=list(NULL,~interaction(instit,rel)), data=nwtco, subset=~I(rel |subcohort)) (b2<-svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=d_BorganII)) @ We can further post-stratify the design on disease stage and age with \texttt{calibrate} <<>>= d_BorganIIps <- calibrate(d_BorganII, phase=2, formula=~age+interaction(instit,rel,stage)) svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=d_BorganIIps) @ \section*{References} Barlow WE (1994). Robust variance estimation for the case-cohort design. \emph{Biometrics} 50: 1064-1072 Borgan \O, Langholz B, Samuelson SO, Goldstein L and Pogoda J (2000). Exposure stratified case-cohort designs, \emph{Lifetime Data Analysis} 6:39-58 Breslow NW and Chatterjee N. (1999) Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. \emph{Applied Statistics} 48:457-68. Lin DY, and Ying Z (1993). Cox regression with incomplete covariate measurements. \emph{Journal of the American Statistical Association} 88: 1341-1349. Therneau TM and Li H., Computing the Cox model for case-cohort designs. \emph{Lifetime Data Analysis} 5:99-112, 1999 Wacholder S, Gail MH, Pee D, and Brookmeyer R (1989) Alternate variance and efficiency calculations for the case-cohort design \emph{Biometrika}, 76, 117-123 \end{document}