%\VignetteIndexEntry{DNAcopy}
%\VignetteDepends{}
%\VignetteKeywords{DNA Copy Number Analysis}
%\VignettePackage{DNAcopy}

\documentclass[11pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\SweaveOpts{echo=FALSE}

\setlength{\textheight}{8.5in}
\setlength{\textwidth}{6in}
\setlength{\topmargin}{-0.25in}
\setlength{\oddsidemargin}{0.25in}
\setlength{\evensidemargin}{0.25in}

\begin{document}
\setkeys{Gin}{width=0.99\textwidth}


\title{\bf DNAcopy: A Package for Analyzing DNA Copy Data}

\author{Venkatraman E. Seshan$^1$ and Adam B. Olshen$^2$}

\maketitle

\begin{center}
$^1$Department of Epidemiology and Biostatistics\\
Memorial Sloan-Kettering Cancer Center\\
{\tt seshanv@mskcc.org}\\
\ \\
$^2$Department of Epidemiology and Biostatistics\\
University of California, San Francisco\\
{\tt olshena@biostat.ucsf.edu}
\end{center}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Overview}

This document presents an overview of the {\tt DNAcopy} package. This
package is for analyzing array DNA copy number data, which is usually
(but not always) called array Comparative Genomic Hybridization (array
CGH) data \citep{pinkel98, snijders01, wigler03}.  It implements our
methodology for finding change-points in these data \citep{olshen04},
which are points after which the (log) test over reference ratios have
changed location.  Our model is that the change-points
correspond to positions where the underlying DNA copy number has
changed.  Therefore, change-points can be used to identify regions of
gained and lost copy number.  We also provide a function for making
relevant plots of these data.

\section{Data}

We selected a subset of the data set presented in \cite{snijders01}.
We are calling this data set {\tt coriell}.  The data correspond to
two array CGH studies of fibroblast cell strains.  In particular, we
chose the studies {\bf GM05296} and {\bf GM13330}.  After selecting
only the mapped data from chromosomes 1-22 and X, there are 2271 data
points.  There is accompanying spectral karyotype data (not included),
which can serve
as a gold standard.  The data can be found at \\
\url{http://www.nature.com/ng/journal/v29/n3/suppinfo/ng754_S1.html}

\section{An Example}

Here we perform an analysis on the {\bf GM05296} array CGH study
described above.


<<echo=TRUE,print=FALSE>>=
library(DNAcopy)
@ 

<<echo=TRUE,print=FALSE>>=
data(coriell)
@ 

\noindent
Before segmentation the data needs to be made into a CNA object.

<<echo=TRUE,print=FALSE>>=
CNA.object <- CNA(cbind(coriell$Coriell.05296),
                  coriell$Chromosome,coriell$Position,
                  data.type="logratio",sampleid="c05296")
@ 

\noindent
We generally recommend smoothing single point outliers before
analysis.  It is a good idea to check that the smoothing is proper for
a particular data set.

<<echo=TRUE,print=FALSE>>=
smoothed.CNA.object <- smooth.CNA(CNA.object)
@ 

\noindent
After smoothing, if necessary, the segmentation is run.  Here the
default parameters are used.  A brief discussion of parameters that
can be adjusted is in the Tips section.

<<echo=TRUE,print=FALSE>>=
segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
@

%Plot whole studies

\noindent
There are a number of plots that can be made.  The first is ordering
the data by chromosome and map positons.  The red lines correspond to
mean values in segments.  Note that the points are in alternate colors
to indicate different chromosomes.

\pagebreak

\begin{center}
<<fig=TRUE,echo=TRUE>>=
plot(segment.smoothed.CNA.object, plot.type="w")
@
\end{center}

\noindent
Another possibility is to plot by chromosome within a study.

\begin{center}
<<fig=TRUE,echo=TRUE,width=11,height=9>>=
plot(segment.smoothed.CNA.object, plot.type="s") 
@
\end{center}

%Plot each chromosome across studies (6 per page)

%\begin{center}
%<<fig=FALSE,echo=TRUE>>=
%plot(segment.smoothed.CNA.object, plot.type="c", 
%     cbys.layout=c(2,1), 
%     cbys.nchrom=6)
%@
%\end{center}

%Plot by plateaus

\noindent
If there are multiple studies, one could plot by chromosome across
studies using the option {\tt plot.type='c'}.  A final plot orders the
segment by their chromosome means.  One can take the plateaus in this
plot to determine what the mean values should be for calling segments
gains or losses.  In this case, maybe $0.4$ for gains and $-0.6$ for
losses.  For most data, these plateaus are much closer to zero.  The
next generation of this software will have automatic methods for
calling gains and losses.

\begin{center}
<<fig=TRUE,echo=TRUE>>=
plot(segment.smoothed.CNA.object, plot.type="p")
@
\end{center}

\noindent
Change-points are often found due to local trends in the data.  An
undo method is needed to get rid of unnecessary change-points.  Below all
splits that are not at least three SDs apart are removed.  The
following plot shows that all splits not corresponding to the gold
standard results have been removed.

<<echo=TRUE,print=FALSE,width=6,height=5>>=
sdundo.CNA.object <- segment(smoothed.CNA.object, 
                             undo.splits="sdundo", 
                             undo.SD=3,verbose=1)
@ 


\begin{center}
<<fig=TRUE,echo=TRUE,width=11,height=9>>=
plot(sdundo.CNA.object,plot.type="s")
@
\end{center}

\section{Tips}

\noindent
A function that may be of interest that has not been mentioned is {\tt
  subset.CNA}.  It allows for subsetting of a CNA object by chromosome and
sample so that segmentation does not have to be run on a whole data set.
Similarly, {\tt subset.DNAcopy} allows subsetting of DNAcopy objects, which
contain the output of segmentation.

The original default segmentation algorithm, because it was based on
permutation, took $O(N^2)$ computations, where $N$ is the number of markers on
a chromosome.  The new default algorithm is much faster.  It includes a hybrid
approach to compute the $p$-value for segmenting based partly on permutation
and partly on a Gaussian approximation (available in all versions after 1.2.0) 
and a stopping rule (available in all versions after 1.5.0) to declare change
when there is a strong evidence for its presence \citep{venkat07}.  We no
longer recommend using overlapping windows for larger data sets.  It is still
possible to run the full permutations analysis using the option {\tt
  p.method='perm'}.  If the new algorithm is still too slow, one can reduce the
number of permutations in the hybrid method using the parameter {\tt nperm}
(default is 10,000).  However, the lower {\tt alpha} (the significance level
for the test to accept change-points) is, the more permutations that are
needed.  The stopping boundary needs to be computed for any choice of {\tt
  nperm} and {\tt alpha} which is not the default which is done automatically
within the function {\tt segment} or can be done externally using the function
{\tt getbdry} and passed on to {\tt segment}.

%\newpage
\bibliographystyle{apalike}
\bibliography{DNAcopy}

\end{document}