%\VignetteIndexEntry{Estimating Abundances with sonicLength} \documentclass{article} \begin{document} \title{Estimating Abundance Using the SonicLength Package} \author{Charles C. Berry} \date{\today} \maketitle \begin{abstract} This document reviews some key functions in the \texttt{sonicLength} package. Section \ref{sec:data-structure} shows how to set up data from an experiment in which sonication is used to retrieve sites. Sections \ref{sec:using-textttestabund} and \ref{sec:replicated-data} show how to call the \texttt{estAbund} function. Section \ref{sec:jackkn-corr} demonstrates how the \texttt{jackknife} option can be used to bias correct estimates or estimate standard errors. Simulation is helpful in assessing the properties of estimators and the \texttt{simSonic} function provides a means for simulating data; it is described in Section \ref{sec:simulating-data} \end{abstract} \section{Data Structures} \label{sec:data-structure} The experimental setup for retrieving retroviral insertion sites from data via sonication was laid out by Gillet et al \cite{pmid21228324}. The data from a collection of integration sites obtained by sonication can be represented by an R \texttt{data.frame} with columns for the \emph{Chromosome}, the \emph{Position} on the Chromosome, and the \emph{Strand} at which the integration site was found and a column for the \emph{Length} of the fragment recovered. The \texttt{data.frame} would have one row for each unique combination of the above variables observed. The following code block generates a \texttt{data.frame} that simulates such data, and displays the first and last 6 rows of it. The first 100 locations have one length each; the next 100 locations have 2,\dots, 100 length sampled, but duplicates of some of these were removed. The function \texttt{rfrag} generates samples from a geometric distribution with probability $0.02$ that has been stochastically truncated according to a logistic law with location 45 and scale 2.5 --- roughly corresponding to what is seen in actual data. @ <>= require(sonicLength) set.seed(123) chr <- sample(c(1:23,"X","Y"), 200, repl=TRUE ) pos <- sample(100000L,200) strand <- sample(c("+","-"),200,repl=TRUE) lens <- lapply(1:200, rfrag ) nlens <- sapply( lens, length ) loc.dframe <- data.frame( Chromosome=chr, Position=pos, Ort=strand ) len.dframe <- unique(cbind( loc.dframe[ rep( 1:200, nlens ) , ], length=unlist(lens) )) rbind( head( len.dframe ), tail( len.dframe ) ) @ %def The following code creates a unique identifier for each location and plots the actual number of lengths simulated versus the number of unique lengths for each location. The graph illustrates the essential difficulty with estimating the abundance of integration sites from the lengths of sonicated fragments: the more abundant sites tend to have multiple fragments of the same length. Since the fragments are subjected to PCR amplification, it is impossible to distinguish duplicates that result from amplification from those that result from unique sonicated fragments. @ <>= id <- with( len.dframe , paste(Chromosome, Position, Ort ) ) id.counts <- table(factor(id,unique(id))) plot( as.vector(nlens), as.vector(id.counts), xlab='number of sonicants',ylab='distinct lengths', xlim=c(1,100), ylim=c(1,100)) abline(a=0,b=1,col='gray') @ %def \section{Using \texttt{estAbund}} \label{sec:using-textttestabund} The function \texttt{estAbund} uses a maximum likelihood approach to finding the number of sonicants that underlie the number of distinct lengths seen. The following code shows how to use it. The call to \texttt{str(fit)} shows the structure of the resulting object, which has the estimated values (as \texttt{theta}), the estimated probability of recovering a sonicant of a given length (as \texttt{phi}), the number of iterations of the algorithm before convergence is achieved (as \texttt{iter}), the observed number of distinct lengths for each integration site (as \texttt{obs}), the call that generated the object, and optionally the estimated variance of \texttt{theta} (as \texttt{var.theta}). The figure shows how the estimates compare to the actual values that generated the data. @ <>= options( str = strOptions(strict.width="wrap") ) @ %def @ <>= fit <- estAbund(id, len.dframe$length) str( fit ) plot(nlens, fit$theta[unique(id)], xlab="actual sonicant count", ylab="estimated sonicant count" ) abline(a=0,b=1,col='gray') @ %def \section{Replicated Data} \label{sec:replicated-data} When there are replicates available, likelihood methods provide a means for combining the data from all of them by maximizing the likelihood using data from all the replicates simultaneously. Here more data is simulated for the same set of locations as were used above. An additional column called \texttt{repl} is added to keep track of the separate replicates. Also, the simulations use different sonication methods resulting in a different distribution of lengths for each of the three replicates. There are some differences in the object returned by \texttt{estAbund}; there is now a \texttt{data.frame} in the component called \texttt{lframe}, and it keeps track of the lengths (in column \texttt{x}) and replicates ( in column \texttt{strata}). Also, the \texttt{obs} component is now a matrix with one row for each integration site loction and one column for the count for each replicate. @ <>= len.dframe$repl <- 1 lens2 <- lapply(1:200,rfrag, rate=0.03 ) nlens2 <- sapply( lens2, length ) len.dframe2 <- unique(cbind( loc.dframe[ rep( 1:200, nlens2 ) , ], length=unlist(lens2), repl=2 )) lens3 <- lapply(1:200,rfrag, rate=0.04 ) nlens3 <- sapply( lens, length ) len.dframe3 <- unique(cbind( loc.dframe[ rep( 1:200, nlens3 ) , ], length=unlist(lens3), repl=3 )) len.dframe <- rbind(len.dframe,len.dframe2,len.dframe3) fit2 <- with(len.dframe, estAbund( paste(Chromosome, Position, Ort ), length, repl )) @ %def Here is the structure of the resulting object. @ <>= str( fit2 ) @ %def Here is a plot of the estimated values of \texttt{phi}. @ <>= with( fit2, plot( lframe$x[ lframe$orig ], phi, pch=16, col=lframe$strata[ lframe$orig ], xlab="length", ylab='phi', xlim=c(1,200) )) legend("topright",col=1:3,pch=rep(16,3),legend=paste("replicate",1:3)) @ %def And here the estimated number of sonicants are plotted against the actual number (which now range up to 300, since there are three replicates). @ <>= with( fit2, plot(3*nlens , theta[unique(id)], xlab="actual sonicant count", ylab="estimated sonicant count" ) ) abline(a=0,b=1,col='gray') @ %def \section{Jackknife Corrections} \label{sec:jackkn-corr} When replicated data are available, one can use the jackknife method to correct the bias of any estimator or to estimate its standard error. See Miller \cite{miller1974jackknife} for a review. In this section we add the \texttt{jackknife=TRUE} argument and the \texttt{theta.var=TRUE} argument to the call for \texttt{estAbund}. The former argument is one of the named arguments for \texttt{estAbund}, but the latter is not. The latter argument is one of the named arguments to the fitting function \texttt{maxEM} and is passed to it with the effect that estimates of the variances of the abundance estimates and of the proportional abundance estimates are calculated. After the fitting is completed, some of the structure of the resulting object is revealing by listing the names of the components of the list returned. @ <>= fit2 <- with(len.dframe, estAbund( paste(Chromosome, Position, Ort ), length, repl, jackknife=TRUE, theta.var=TRUE)) head.names <- function(x) head( names(x) ) sapply( fit2, head.names ) @ %def And here some of the structure of the jackknife component is shown. Basically, that component has the same length as the number of replicates\dots @ <>= length(fit2$jackknife) @ %def \dots and each element has the same element names as the parent object (except for \texttt{jackknife}. @ <>= str(fit2$jackknife[[1]]) @ %def And here is a demonstration of using the jackknife to correct bias in the estimate of the proportion represented by the most abundant site. @ <>= nreps <- length( fit2$jackknife ) argmax.theta <- which.max( fit2$theta ) ## function to extract the estimate maxpr <- function(x) prop.table( x$theta )[ argmax.theta ] ## apply to full sample maxpr.total <- maxpr( fit2 ) ## make pseudo observations pseudo.maxpr <- nreps * maxpr.total - (nreps-1) * sapply( fit2$jackknife, maxpr ) ## report results likeStdErr <- sqrt( fit2$var.theta$prop[ argmax.theta ] ) jackStdErr = sd ( pseudo.maxpr ) / sqrt(nreps) all.res <- c( uncorrected = unname(maxpr.total), corrected = mean(pseudo.maxpr), likeStdErr = likeStdErr, jackStdErr = jackStdErr) cbind(all.res) @ %def The bias correction has scant effect here. As it turns out, the estimator is nearly unbiased, so this is expected. The jackknife standard error is within error distance of the standard error based on asymptotic theory given that there are just \Sexpr{nreps} replicates. \section{Simulating Data} \label{sec:simulating-data} A convenience wrapper called \texttt{simSonic} is given for simulating data. Here it is used to simulate more data according to the estimated values of \texttt{fit2[['theta']]} and \texttt{fit2[['phi']]}, and the resulting estimate of \texttt{theta} is plotted against its parent. @ <>= more.data <- simSonic(fit2$theta,fit2$phi) fit3 <- do.call(estAbund, more.data ) plot( fit2$theta[names(fit3$theta)], fit3$theta ) abline(a=0,b=1) str(fit3) @ %def \begin{thebibliography}{9} \bibitem{pmid21228324} N.~A. Gillet, N.~Malani, A.~Melamed, N.~Gormley, R.~Carter, D.~Bentley, C.~Berry, F.~D. Bushman, G.~P. Taylor, and C.~R. Bangham. \newblock {{T}he host genomic environment of the provirus determines the abundance of {H}{T}{L}{V}-1-infected {T}-cell clones}. \newblock {\em Blood}, 117:3113--3122, Mar 2011. \bibitem{miller1974jackknife} R.~G. Miller. \newblock {{T}he jackknife-a review}. \newblock {\em {B}iometrika}, 61(1):1--15, 1974. \end{thebibliography} \end{document}