%\VignetteIndexEntry{Copy Number Variation Tools} %\VignetteDepends{CNVtools} %\VignettePackage{CNVtools} \documentclass[10pt]{article} \usepackage{amsmath, amsfonts, amssymb, amsthm} \usepackage{graphicx} \usepackage{fullpage} \title{CNVtools} \author{Vincent Plagnol and Chris Barnes} \date{\today} \begin{document} \maketitle \tableofcontents \section{What \texttt{CNVtools} is meant to do} \subsection{Input data} \texttt{CNVtools} is a \texttt{R} package meant to perform robust case-control and quantitative trait association testing of copy number variants (CNVs). The association testing procedure for \texttt{CNVtools} requires a one-dimensional normalized data summary per sample. Another vector is required to provide the case-control status or, alternatively, a quantitative trait. Another optional input vector is used to separate the samples in batches in situations where one expects the signal distribution to differentially affect each batch because of potential technical artifacts. This issue of differential bias is discussed in details in the companion publication published in Nature Genetics in 2008 (see Reference at the end of this vignette). In addition, \texttt{CNVtools} fits a mixture model to the one-dimensional CNV data summary. For this approach to make sense, it is essential that the distinct components can at least be distinguished in the histogram of the normalized signal. If this is not the case mixture models are not an modeling appropriate tool and \texttt{CNVtools} cannot be used to analyze such data. Note that no code is provided in this package for CNV signal normalization. However, is it frequent for multiple CNV probes located in the same chromosome region to provide information about the same CNV. It is therefore useful to combine the information across a small number of CNV probes to obtain a one-dimensional signal for each sample. In fact, this summary step is required to use the \texttt{CNVtools} routines which need a one-dimensional data summary as input. This data summary step is the purpose of the functions \texttt{apply.pca} and \texttt{apply.ldf} which suggest two methods to summarize CNV data across multiple probes: principal component and canonical correlation analysis. Unlike the actual association testing routines (like \texttt{CNVtest.binary}) and the underlying C code, these two functions are not essential to \texttt{CNVtools} but we thought it would be convenient to include them. \subsection{Genome-wide or single CNV data} Unlike a \texttt{R} package like ``snpMatrix'' specifically designed for genome-wide data, the \texttt{CNVtools} functions are implemented to analyze each CNV separately. To be more explicit, we did not design data structure to store and analyze data from multiple CNVs (for example genome-wide). However, this package was initially designed to analyze CNV data generated by the Wellcome Trust Case Control Consortium, which performed a genome-wide scan for association for eight common disorders (approximately 19,000 samples). Therefore \texttt{CNVtools} is well suited for large scale CNV association studies but it is the user\'s task to write a wrapper in order to apply the \texttt{CNVtools} code separately for each CNV. To give some explicit computation time: for the WTCCC CNV association study, and using recently purchased computers (as of 2009, a mixture of Dual-Core AMD Opteron Processor 2220 and Quad-Core AMD Opteron Processor 2384), \texttt{CNVtools} could analyze approximately 3,000 CNVs typed in 19,000 individuals for an approximate total time of 20 days of computing (6h using a 80 nodes computing cluster). \section{First look at the data} We first load an example CNV data set, called A112, in the two WTCCC control groups (1958 British Birth cohort and National Blood Services). The data required for CNVtools is a matrix of normalized signal intensities. In this example each row represents an individual and each column represents a locus within the CNV. To get a feel for the data, we plot the histograms for the mean intensity as well as the first principal component analysis (Figure \ref{mean-pca.fig}). <>= #source("../CNVtools.r"); dyn.load("../../src/CNVtools.so"); load("../../CNVtools/data/A112.RData") library(CNVtools) data(A112) head(A112) raw.signal <- as.matrix(A112[, -c(1,2)]) dimnames(raw.signal)[[1]] <- A112$subject mean.signal <- apply(raw.signal, MAR=1, FUN=mean) pca.signal <- apply.pca(raw.signal) pdf("fig/mean_pca_signal.pdf", width=10, height=5) par(mfrow=c(1,2)) hist(mean.signal, breaks=50, main='Mean signal', cex.lab=1.3) hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3) dev.off() @ \begin{figure}[!htb] \begin{center} \includegraphics[width=11cm]{fig/mean_pca_signal.pdf} \end{center} \caption{Histograms for the mean intensity and the first principal component of the CNV A112} \label{mean-pca.fig} \end{figure} \section{Model selection using the Bayesian Information Criterion (BIC)} To determine the number of components of the CNV for the downstream analysis we can run the model selection algorithm. The models scanned by this function can be specified by the user, however the default settings allow for determining the number of components using some general models for the mean and variance of the components. We specify 3 iterations under H0 to increase the chances of locating a global maximum for each model. The output of the model selection gives the BIC (and AIC) for the different models. The model that minimizes the chosen statistic is the most likely model. This is demonstrated in Figure \ref{model-select.fig}. <>= batches <- factor(A112$cohort) sample <- factor(A112$subject) set.seed(0) results <- CNVtest.select.model(signal=pca.signal, batch = batches, sample = sample, n.H0 = 3, method="BIC", v.ncomp = 1:5, v.model.component = rep('gaussian',5), v.model.mean = rep("~ strata(cn)",5), v.model.var = rep("~1", 5)) ncomp <- results$selected pdf("fig/modelselect.pdf",width=5,height=5) plot(-results$BIC, xlab="n comp", ylab="-BIC", type="b", lty=2, col="red", pch = '+') dev.off() @ \begin{figure}[!htb] \begin{center} \includegraphics[width=11cm]{fig/modelselect.pdf} \end{center} \caption{The BIC as a function of the number of components fit to the data output from the model selection stage. The most appropriate model is the one that minimises the BIC.} \label{model-select.fig} \end{figure} \section{Clustering the PCA transformed data} We can then cluster the result of the pca analysis under the null hypothesis of no association between the number of copies and the case-control status. The data will be clustered only once, assuming the null hypothesis $\mathbb{H}_0$. Because of this we do not have to specify any case-control status. We assume free model for the means and the variances for each number of copies using '$\sim$ strata(cn)'. We could have chosen free variances for each combination of batch and copy number using '$\sim$ strata(cn, batch)'. Alternatively a variance model proportional to the number of copies is possible using '$\sim$ cn'. Note, however, that the formulation using strata is much quicker and numerically robust, and should be used when possible. We can also provide an optional vector of starting values for the mean locations of the three clusters. Note that we must check the status of the fit. Only 'C' should be accepted for further analysis. The possibilities include \begin{itemize} \item['C'] Converged. This is the only acceptable status. \item['M'] Maximum iterations reached. The EM algorithm did not converge. \item['P'] Posterior density problem. The posterior probabilities are not monotonic. \item['F'] Fit failed. Most likely due to singularity at $\sigma = 0$. \end{itemize} The output contains a list, and the first element of this list is the data frame of interest. In Figure \ref{pca-fit.fig} we plot the result of the clustering. <>= ncomp <- 3 batches <- factor(A112$cohort) sample <- factor(A112$subject) fit.pca <- CNVtest.binary ( signal = pca.signal, sample = sample, batch = batches, ncomp = ncomp, n.H0=3, n.H1=0, model.var= '~ strata(cn)') print(fit.pca$status.H0) pdf("fig/pca-fit.pdf", width=10, height=5) par(mfrow=c(1,2)) cnv.plot(fit.pca$posterior.H0, batch = '58C', main = 'Cohort 58C', breaks = 50, col = 'red') cnv.plot(fit.pca$posterior.H0, batch = 'NBS', main = 'Cohort NBS', breaks = 50, col = 'red') dev.off() @ \begin{figure}[!htb] \begin{center} \includegraphics[width=11cm]{fig/pca-fit.pdf} \end{center} \caption{Output of the clustering procedure using the pca (under the null hypothesis $\mathbb{H}_0$ of no allele frequency difference between both cohorts). The colored lines show the posterior probability for each of the three copy number classes (copy number $= 1,2$ or $3$). For clarity the scale for the posterior probabilities is not shown but the maximum is 1 and the three posterior probabilities always add up to 1.} \label{pca-fit.fig} \end{figure} \section{Assigning individuals to a copy number genotype} The output from the clustering under $\mathbb{H}_0$ can be used to obtain the posterior probabilities and MAP estimates of an individuals cluster membership. This can also be applied after the LDF improvement (see below). The columns P1, P2, P3 represent the posterior probability of belonging to component 1, 2, 3. The column labeled cn is the MAP assignment. <>= head(fit.pca$posterior.H0) @ \section{Improving using the LDF procedure} It is now possible to use the posterior probabilities from the pca procedure to improve the fit. This is done by using a linear discriminant analysis. <>= ncomp <- 3 pca.posterior <- as.matrix((fit.pca$posterior.H0)[, paste('P',seq(1:ncomp),sep='')]) dimnames(pca.posterior)[[1]] <- (fit.pca$posterior.H0)$subject ldf.signal <- apply.ldf(raw.signal, pca.posterior) pdf("fig/ldf_pca_signal.pdf", width=10, height=5) par(mfrow=c(1,2)) hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3) hist(ldf.signal, breaks=50, main='LDF signal', cex.lab=1.3) dev.off() @ \begin{figure}[!htb] \begin{center} \includegraphics[width=11cm]{fig/ldf_pca_signal.pdf} \end{center} \caption{Comparing the LDF and PCA analysis, both clustered under the null hypothesis of no association.} \label{ldf-pca.fig} \end{figure} The results of the LDF analysis can now be see in Figure \ref{ldf-pca.fig} and we can observe a clear improvement. The data will then be much easier to cluster. \section{Testing for genetic association with a dichotomous disease trait} \subsection{Some mathematical details} The association testing approach has been described previously (see reference below) but for completeness we sketch the principle. We use a likelihood ratio approach to test for association between the genotype calls and the case-control status. Genotypes are called using a finite mixture model. Formally, this association test can be as summarized as jointly fitting two linear models: \begin{eqnarray} X & = & \gamma + \theta^t Z + \epsilon \label{eqn1} \\ logit(Y) & = & \alpha + \beta X \label{eqn2} \end{eqnarray} The first model is the Gaussian or T mixture model, and the second model is a traditional generalized logit linear model. Notations are: \begin{itemize} \item $X$ is a N-dimensional vector of signal intensities, where $N$ is the number of samples in the study. \item $Z$ is the $(N, G)$ matrix of genotype assignment, where $G$ designates the number of copy number classes: $Z_{i,j} = 1$ if and only if the sample $i$ has genotype $j$. Each row $z_i$ of $Z$ is sampled from a multinomial distribution with probabilities $(\Phi_i)_{i=1}^G$ representing the genotype frequencies in the sampled population. \item The error term $\epsilon$ is normally distributed with mean 0. \item $\theta$ is a $G$ dimensional vector, linking the genotype status with the mean value of the signal intensity. \item $\alpha$ and $\beta$ are scalar and $\beta \neq 0$ under the alternative $\mathbb{H}_1$. Our default assumption is that the log-odds ratio is proportional to the genotype $X$. \item $Y$ is the $N$ dimensional binary vector describing the case-control status. \end{itemize} \subsection{Example} Now that we have summarised the intensity data in an efficient manner we can use it to test for genetic association between both cohorts. Here we have defined an artificial trait, 0 for NBS and 1 for 58C. We can specify the number of iterations under $\mathbb{H}_0$ and $\mathbb{H}_1$, and in that case we will use one iteration for each scenario because the data quality is sufficient and does not require multiple iterations to be fitted properly. <>= ncomp <- 3 trait <- ifelse( A112$cohort == '58C', 0, 1) fit.ldf <- CNVtest.binary ( signal = ldf.signal, sample = sample, batch = batches, disease.status = trait, ncomp = ncomp, n.H0=3, n.H1=1, model.var = "~cn") print(fit.ldf$status.H0) print(fit.ldf$status.H1) pdf("fig/ldf-fit.pdf", width=10, height=5) par(mfrow=c(1,2)) cnv.plot(fit.ldf$posterior.H0, batch = '58C', main = 'Cohort 58C', breaks = 50, col = 'red') cnv.plot(fit.ldf$posterior.H0, batch = 'NBS', main = 'Cohort NBS', breaks = 50, col = 'red') dev.off() LR.statistic <- -2*(fit.ldf$model.H0$lnL - fit.ldf$model.H1$lnL) print(LR.statistic) @ \begin{figure}[!htb] \begin{center} \includegraphics[width=11cm]{fig/ldf-fit.pdf} \end{center} \caption{Output of the clustering procedure using the LDF, under the null hypothesis $\mathbb{H}_0$ of no association. The colored lines show the posterior probability for each of the three copy number classes. For clarity the scale for the posterior probabilities is not shown but the maximum is 1 and the three posterior probabilities always add up to 1.} \label{ldf-fit.fig} \end{figure} If the fit is correct and there is indeed no association, this statistic should be distributed as $\chi^2$ with one degree of freedom. The fit can be checked in Figure \ref{ldf-fit.fig}. Note that the assumed disease model, under the alternate hypothesis $\mathbb{H}_1$, is a linear odds model, which means that the effect on the log-odds is proportional to the number of alleles. We might be interested in testing an allelic model, where the odds are not constrained by a linear trend. This is done by specifying the \texttt{model.disease} formula when fitting the data, as follows: <>= fit.ldf <- CNVtest.binary ( signal = ldf.signal, sample = sample, batch = batches, disease.status = trait, ncomp = 3, n.H0=3, n.H1=1, model.disease = " ~ as.factor(cn)") print(fit.ldf$status.H0) print(fit.ldf$status.H1) LR.statistic <- -2*(fit.ldf$model.H0$lnL - fit.ldf$model.H1$lnL) print(LR.statistic) @ The default for \texttt{model.disease} is $\sim \texttt{cn}$. Introducing the \texttt{factor} adds one degree of freedom, canceling the default linear constraint. The resulting statistic is now distributed, under the null, as $\chi^2$ with 2 degrees of freedom. \section{Testing for genetic association with a quantitative trait} Now consider the testing of association with a quantitative trait. The model now consists of a standard regression instead of a logistic regression. For this example we will generate a gaussian hypothetical trait and test for association with the combined NBS and 58C individuals. The association test is done in a completely analogous way, namely the LR under $\mathbb{H}_0$ should be distributed as $\chi^2$ with one degree of freedom assuming a linear trend model. <>= batches <- rep("ALL",length(sample)) qt <- rnorm(length(sample), mean=9.0, sd=1.0) fit.ldf <- CNVtest.qt(signal = ldf.signal, sample = sample, batch = batches, qt = qt, ncomp = ncomp, n.H0=3, n.H1=1, model.var = "~strata(cn)") print(fit.ldf$status.H0) print(fit.ldf$status.H1) LR.statistic <- -2*(fit.ldf$model.H0$lnL - fit.ldf$model.H1$lnL) print(LR.statistic) pdf("fig/qt-fit.pdf", width=15, height=5) qt.plot(fit.ldf) dev.off() @ \begin{figure}[!htb] \begin{center} \includegraphics[width=16cm]{fig/qt-fit.pdf} \end{center} \caption{Output of the quantitative trait association procedure on the LDF transformed data. For the rightmost graph the colored lines show the posterior probability for each of the three copy number classes.} \label{qt-fit.fig} \end{figure} \section{Reference} The statistical ideas underlying this package have been published in:\\ {\it A robust statistical method for case-control association testing with copy number variation}, Chris Barnes, Vincent Plagnol, Tomas Fitzgerald, Richard Redon, Jonathan Marchini, David G. Clayton, Matthew E. Hurles, Nature Genetics 2008 \end{document}