%\VignetteIndexEntry{DNAcopy} %\VignetteDepends{} %\VignetteKeywords{DNA Copy Number Analysis} %\VignettePackage{DNAcopy} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=FALSE} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \begin{document} \setkeys{Gin}{width=0.99\textwidth} \title{\bf DNAcopy: A Package for Analyzing DNA Copy Data} \author{Venkatraman E. Seshan$^1$ and Adam B. Olshen$^2$} \maketitle \begin{center} $^1$Department of Epidemiology and Biostatistics\\ Memorial Sloan-Kettering Cancer Center\\ {\tt seshanv@mskcc.org}\\ \ \\ $^2$Department of Epidemiology and Biostatistics\\ University of California, San Francisco\\ {\tt olshena@biostat.ucsf.edu} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} This document presents an overview of the {\tt DNAcopy} package. This package is for analyzing array DNA copy number data, which is usually (but not always) called array Comparative Genomic Hybridization (array CGH) data \citep{pinkel98, snijders01, wigler03}. It implements our methodology for finding change-points in these data \citep{olshen04}, which are points after which the (log) test over reference ratios have changed location. Our model is that the change-points correspond to positions where the underlying DNA copy number has changed. Therefore, change-points can be used to identify regions of gained and lost copy number. We also provide a function for making relevant plots of these data. \section{Data} We selected a subset of the data set presented in \cite{snijders01}. We are calling this data set {\tt coriell}. The data correspond to two array CGH studies of fibroblast cell strains. In particular, we chose the studies {\bf GM05296} and {\bf GM13330}. After selecting only the mapped data from chromosomes 1-22 and X, there are 2271 data points. There is accompanying spectral karyotype data (not included), which can serve as a gold standard. The data can be found at \\ \url{http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html} \section{An Example} Here we perform an analysis on the {\bf GM05296} array CGH study described above. <>= library(DNAcopy) @ <>= data(coriell) @ \noindent Before segmentation the data needs to be made into a CNA object. <>= CNA.object <- CNA(cbind(coriell$Coriell.05296), coriell$Chromosome,coriell$Position, data.type="logratio",sampleid="c05296") @ \noindent We generally recommend smoothing single point outliers before analysis. It is a good idea to check that the smoothing is proper for a particular data set. <>= smoothed.CNA.object <- smooth.CNA(CNA.object) @ \noindent After smoothing, if necessary, the segmentation is run. Here the default parameters are used. A brief discussion of parameters that can be adjusted is in the Tips section. <>= segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1) @ %Plot whole studies \noindent There are a number of plots that can be made. The first is ordering the data by chromosome and map positons. The red lines correspond to mean values in segments. Note that the points are in alternate colors to indicate different chromosomes. \pagebreak \begin{center} <>= plot(segment.smoothed.CNA.object, plot.type="w") @ \end{center} \noindent Another possibility is to plot by chromosome within a study. \begin{center} <>= plot(segment.smoothed.CNA.object, plot.type="s") @ \end{center} %Plot each chromosome across studies (6 per page) %\begin{center} %<>= %plot(segment.smoothed.CNA.object, plot.type="c", % cbys.layout=c(2,1), % cbys.nchrom=6) %@ %\end{center} %Plot by plateaus \noindent If there are multiple studies, one could plot by chromosome across studies using the option {\tt plot.type='c'}. A final plot orders the segment by their chromosome means. One can take the plateaus in this plot to determine what the mean values should be for calling segments gains or losses. In this case, maybe $0.4$ for gains and $-0.6$ for losses. For most data, these plateaus are much closer to zero. The next generation of this software will have automatic methods for calling gains and losses. \begin{center} <>= plot(segment.smoothed.CNA.object, plot.type="p") @ \end{center} \noindent Change-points are often found due to local trends in the data. An undo method is needed to get rid of unnecessary change-points. Below all splits that are not at least three SDs apart are removed. The following plot shows that all splits not corresponding to the gold standard results have been removed. <>= sdundo.CNA.object <- segment(smoothed.CNA.object, undo.splits="sdundo", undo.SD=3,verbose=1) @ \begin{center} <>= plot(sdundo.CNA.object,plot.type="s") @ \end{center} \section{Tips} \noindent A function that may be of interest that has not been mentioned is {\tt subset.CNA}. It allows for subsetting of a CNA object by chromosome and sample so that segmentation does not have to be run on a whole data set. Similarly, {\tt subset.DNAcopy} allows subsetting of DNAcopy objects, which contain the output of segmentation. The original default segmentation algorithm, because it was based on permutation, took $O(N^2)$ computations, where $N$ is the number of markers on a chromosome. The new default algorithm is much faster. It includes a hybrid approach to compute the $p$-value for segmenting based partly on permutation and partly on a Gaussian approximation (available in all versions after 1.2.0) and a stopping rule (available in all versions after 1.5.0) to declare change when there is a strong evidence for its presence \citep{venkat07}. We no longer recommend using overlapping windows for larger data sets. It is still possible to run the full permutations analysis using the option {\tt p.method='perm'}. If the new algorithm is still too slow, one can reduce the number of permutations in the hybrid method using the parameter {\tt nperm} (default is 10,000). However, the lower {\tt alpha} (the significance level for the test to accept change-points) is, the more permutations that are needed. The stopping boundary needs to be computed for any choice of {\tt nperm} and {\tt alpha} which is not the default which is done automatically within the function {\tt segment} or can be done externally using the function {\tt getbdry} and passed on to {\tt segment}. %\newpage \bibliographystyle{apalike} \bibliography{DNAcopy} \end{document}