%\VignetteIndexEntry{cghMCR findMCR}
%\VignetteKeyword{platform}
%\VignettePackage{cghMCR}
\documentclass[12pt]{article}
\usepackage{hyperref}
\usepackage{graphicx}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\begin{document}
\author{Jianhua Zhang\and Bin Feng}

\title{How to use cghMCR}

\maketitle

\section{Overview}

Copy number data (arrayCGH or SNP) can be used to identify genomic regions (Regions Of Interest or ROI) showing gains or losses that are common across samples. Existing algorithms, MCR (Aguirre et. al. 2004), GISTIC(), or GTS(), for the identification of ROI rely on the probe level data, which may be a concern when array density increases (to 1 million for example) or when data generated by arrays of different densities need to be analyzed together. The \Rpackage{cghMCR} initially implemented simplified version of MCR. An alternative approach (Segment Gain Or Loss or SGOL) were then added by applying a modified version of GISTIC algorithm so that the computations can be done based on segmented data. This vignette demonstrate how to use \Rpackage{cghMCR} to identify genomic alterations across samples profiled using copy number platforms using the two approaches (e. g. arrayCGH or SNP array). 

Since both approaches use the output of CBS (\Rpackage{DNAcopy}), the first section of the vignette shows how to generate the segmented data using \Rpackage{DNAcopy} based on three raw arrayCGH data files to reduce the length of time required for the calculation. 
\section{From raw to segmented data}


\section{From raw data to segment list}

The raw data used for segment computation are downloaded from the TCGA web site \url{http://www.} and stored in the \textit{sampleData} directory of the package. Several \textit{Bioconductor} packages may be used to process the raw data. Here we choose to use \Rpackage{limma} to process and normalize the raw data. The three samples files are:

<<>>=
require("limma")
arrayFiles  <- list.files(system.file("sampleData", package = "cghMCR"), 
     full.names = TRUE, pattern = "TCGA")
arrayFiles
@ 

\Rfunction{read.maimages} is a generic function of the \Rpackage{limma} package that can be used to read the process the raw data. In the example below we used the default settings of the function. Curious readers may read the man page of \Rfunction{read.maimages} for descriptions of the parameters and their possible settings.

<<>>=
rawData <- read.maimages(arrayFiles, source = "agilent", columns = 
    list(R = "rMedianSignal", G = "gMedianSignal", Rb = "rBGMedianSignal", 
    Gb = "gBGMedianSignal"), annotation = c("Row", "Col", "ControlType", 
    "ProbeName", "GeneName", "SystematicName", "PositionX", "PositionY", 
    "gIsFeatNonUnifOL", "rIsFeatNonUnifOL", "gIsBGNonUnifOL", "rIsBGNonUnifOL",
    "gIsFeatPopnOL", "rIsFeatPopnOL", "gIsBGPopnOL", "rIsBGPopnOL", 
    "rIsSaturated", "gIsSaturated"), names = basename(arrayFiles))
@

Dye assignment defults to Cy5 = sample and Cy3 for reference in limma (set for expression arrays). However, arrayCGH experiments are usually carried out with sample dyed using Cy3. To take this into account, we define a design vector using 1 (Cy5 = sample) or -1 (Cy3 = sample) to indicate the dye assignment for each sample.

<<>>=
rawData$design <- c(-1, -1, -1)
@

The following code does the background correction and normalization within and then between arrays:

<<>>=
ma <- normalizeWithinArrays(backgroundCorrect(rawData, method = "minimum"), method = "loess")
@

Since some of the probes are not mapped to exact positions in the genome, we need to drop them together with the control probes.

<<>>=
chrom <- gsub("chr([0-9XY]+):.*", "\\1", ma$genes[, "SystematicName"])
dropMe <- c(which(!chrom %in% c(1:22, "X", "Y")), which(ma$genes[, "ControlType"] != 0))
@

A common approach to analyzing copy number data is to apply the \Rfunction{segment} function of \Rpackage{DNAcopy} to segment the normalized data so that chromosome regions with the same copy number have the same segment mean values.

<<>>=
require(DNAcopy, quietly = TRUE)
set.seed(25)
cna <- CNA(ma$M[-dropMe, ], 
    gsub("chr([0-9XY]+):.*", "\\1", ma$genes[-dropMe, "SystematicName"]),
    as.numeric(gsub(".*:([0-9]+)-.*", "\\1", 
      ma$genes[-dropMe, "SystematicName"])),
    data.type = "logratio", sampleid = colnames(ma$M)) 
segData <- segment(smooth.CNA(cna)) 
@

The \textit{segData} object contains the segment list that we are going to use in the sections to follow. The segment list can be extracted from the \textit{segData} object by issuing the following command.

<<>>=
mySeglist <- segData[["output"]]
head(mySeglist)
@

\section{Identifying Segment Gain Or Loss (SGOL)}

In this section or sections to follow, we are not going to used \textit{mySeglist} we created in the previous section as the data set only contains three samples. Instead , we will use a different set of sample data that was created the same way but with more samples to make the results more interesting. The sample data is stored in the \textit{data} subdirectory of the \Rpackage{CNTools} package and can be loaded into R by:

<<>>=
require(CNTools, quietly = TRUE)
data("sampleData", package = "CNTools")
head(sampleData)
@

The segment list shown above is a data frame but can not be used directly for computation across samples as each row only contains the segment data for a given segment within a sample. For computations on segmented data across samples, the segment list need to be converted into a matrix format with segments as rows and samples as columns. The \Rpackage{CNTools} packages provides the functionalities for data conversion and we are going to take the advantage of the package without detailing the algorithm of the conversion. Curious readers are encouraged to read the vignette of \Rpackage{CNTools} for detailed descriptions of the algorithms. Using the following code, we convert the segment list into a matrix format with by aligning samples based on chromosome segment defined by genes. Alternatively we can align samples based on overlapping chromosomal fragments. The \Rpackage{CNTools} vignette has an example for that. Since the sample data contains over 200 samples, we only take 20 random samples here for the sake of time.    

<<>>=
data(geneInfo)
data(sampleData, package = "CNTools")
set.seed(1234)
convertedData <- getRS(CNSeg(sampleData[which(is.element(sampleData[, "ID"], sample(unique(sampleData[, "ID"]), 20))), ]), by = "gene", imput = FALSE, 
XY = FALSE, geneMap = geneInfo)
@ 

Once we have the segment data converted into a matrix format, we can try to identify regions showing gains or losses that are common across samples. The \textit{imput} parameter indicates whether cells with missing values will be imputed and the \textit{parameter} indicates whether regions on the X and Y chromosome should be kept. Since our samples are a mixture of male and female DNAs profiled against pooled male human DNA, we choose to drop the data on X and Y chromosomes. Parameter \textit{geneMap} is needed when samples will be aligned by genes. The \Rpackage{CNTools} contains a built human gene information data set (geneInfo) that was used in the code above. Users working on other organisms or their own gene mapping information need to create a gene mapping data set following the same format as \textit{geneInfo} shown below:

Working on the data converted from segment list, we can compute the SGOL scores for genes (or chromosomal fragment if samples are aligned by regions) by calculating the summations (parameter \textit{method}) for all the positive values over a set threshold and all the negative values below a set threshold (\textit{threshold} below). 

<<>>=
require(cghMCR, quietly = TRUE)
SGOLScores <- SGOL(convertedData, threshold = c(-0.2, 0.2), method = sum)
plot(SGOLScores)
@

\begin{figure}[htbp]
  \begin{center}
    \includegraphics[scale=0.5]{sgolPlot}
    \caption{\label{Figure 1. Plot of SGOL scores along chromosomes to show regions with gains or losses}
    }
  \end{center}
\end{figure}

Based on the SGOL scores, genes in regions of gains or losses can be obtained by a set of thresholds, say -20 and 20.

<<>>=
GOIGains <- SGOLScores[which(as.numeric(unlist(gol(SGOLScores[, "gains"]))) > 
     20), "gains"]
GOILosses <- SGOLScores[which(as.numeric(unlist(gol(SGOLScores[, "losses"]))) < 
     -20), "losses"]
head(gol(GOIGains))
@

   
\section{Identifying Minimum Common Regions (MCR)}

The \textit{MCR} approach was implemented following the heuristics listed below:

\begin{itemize}
\item MCRs are identified based on the segments obtained using \Rpackage{DNAcopy} 
\item Segments above an upper (defined by a parameter \Robject{alteredHigh}) and lower (\Robject{alteredLow}) threshold values of percentile are identified as altered.   
\item If two or more altered segments are deparated by less than 500 kb, the entire region spanned by the segments is considered to be an altered span.
\item Highly altered segments or spans are retained as informative spans that define descrete locus boundaries.
\item Informative spanes are compared acress samples to identify overlapping groups of positive or negative value segments.
\item Minimal common regions (MCRs) are defined as contiguous spans having at least a recurrence rate defined by a parameter (\Robject{recurrence}) across samples. 
\end{itemize}    

We use the \textit{segData} that were generated by running \Rfunction{segements} as the input to the \Rfunction{cghMCR} function. The parameter \Robject{gapAllowed} is numeric and indicate how many basepairs should two adjancent segments be apart, below which the segments will be joined to form an altered span. Parameters \Robject{alteredLow} and \Robject{alteredHigh} are also numerics and specify the lower and upper percential threshold values. Only segements with means less or greater than the lower or upper threshold values will be considered as altered regions and included in the subsequent analysis. \Robject{recurrence} is an integer defining the rate of recurrence for a region to show gain/loss across samples before it can be declared as an MCR. Due to the small number of sample size, the parameters are set to values that result in presentable results rather than correctness.


<<>>=
cghmcr <- cghMCR(segData, gapAllowed = 500, alteredLow = 0.9,
 alteredHigh = 0.9, recurrence = 100)
mcrs <- MCR(cghmcr)
@ 

Using the above settings, we get a few MCRs that are common to the samples.

<<>>=
head(cbind(mcrs[, c("chromosome", "status", "mcr.start", "mcr.end", 
                     "samples")])) 
@ 

To include probe ids for the MCRs identified, we can call the function \Rfunction{mergeMCRProbes} to have probe ids within each MCR appended. Multiple probes are separated by a ",".

<<>>=
mcrs <- mergeMCRProbes(mcrs[mcrs[, "chromosome"] == "7", ], as.data.frame(segData[["data"]]))
head(cbind(mcrs[, c("chromosome", "status", "mcr.start", "mcr.end", 
                     "probes")]))
@ 


\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE>>=
sessionInfo()
@

\section{References}
Aguirre, AJ, C. Brennan, G. Bailey, R. Sinha, B. Feng, C. Leo, Y. Zhang,
J. Zhang, N. Bardeesy, C. Cauwels, C. Cordon-Cardo, MS Redston, RA DePinho and
L. Chin. High-resolution Characterization of the Pancreatic Adenocarcinoma
Genome. Proc Natl Acad Sci U S A. 2004. 101(24):9067-9072.


\end{document}