%\VignetteIndexEntry{An R package for Nucleosome positioning prediction} %\VignetteKeywords{Nucleosome} \documentclass[a4paper]{article} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\textit{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \usepackage{ccaption} \usepackage{natbib} \setlength{\textwidth}{6.2in} \setlength{\textheight}{8.5in} \setlength{\parskip}{1.5ex plus0.5ex minus 0.5ex} \setlength{\oddsidemargin}{0.1cm} \setlength{\evensidemargin}{0.1cm} \setlength{\headheight}{0.3cm} \setlength{\arraycolsep}{0.1cm} \renewcommand{\baselinestretch}{1} \begin{document} \title{A introduction on NuPoP R package} \author{Ji-Ping Wang\thanks{jzwang@northwestern.edu}, \ Liqun Xi\\ Department of Statistics, Northwestern University} \maketitle \section{About NuPoP} \Rpackage{NuPoP} is an R package for \textbf{Nu}cleosome \textbf{Po}sitioning \textbf{P}rediction. \Rpackage{NuPoP} is built upon a duration hidden Markov model, in which the linker DNA length is explicitly modeled. The nucleosome or linker DNA state model can be chosen as either a fourth order or first order Markov chain. \Rpackage{NuPoP} outputs the Viterbi prediction of optimal nucleosome position map, nucleosome occupancy score (from backward and forward algorithms) and nucleosome affinity score (\cite{stat:XiWang2010},\cite{stat:WangWidom2008}). In addition to the R package, we also developed a web server prediction engine and a stand-alone Fortran program available at http://nucleosome.stats.northwestern.edu. The \Rpackage{NuPoP} R package and the Fortran program can predict nucleosome positioning for a DNA sequence of any length. For comments, please contact: jzwang@northwestern.edu. For technical details of \Rpackage{NuPoP} and the methods, please refer to \cite{stat:XiWang2010} and \cite{stat:WangWidom2008}. \section{NuPoP functions} \Rpackage{NuPoP} does not depend on any other R packages. It has three major functions, \verb@predNuPoP@, \verb@readNuPoP@, and \verb@plotNuPoP@. The \verb@predNuPoP@ function predicts the nucleosome positioning and nucleosome occupancy, the \verb@readNuPoP@ reads in the prediction results, and the function \verb@plotNuPoP@ visualizes the predictions. <<>>= library(NuPoP) @ The \verb@predNuPoP@ calls a Fortran subroutine to process the DNA sequence and make predictions, and outputs the predictions into a text file into the current working directory. This method is based on a duration Hidden Markov model consisting of two states, one for the nucleosome and the other for the linker state. For each state, a first order Markov chain and a fourth order Markov chain models are built in. For example, a sample DNA sequence is \Robject{test.seq} located in \Robject{inst/extdata} subdirectory of the package. Call the \verb@predNuPoP@ function as follows: <<>>= predNuPoP(system.file("extdata", "test.seq", package="NuPoP"),species=7,model=4) @ Note that the argument $\Rfunarg{file}$ must be specified as the complete path and file name of the DNA sequence in FASTA format in any directory. For example, if the "test.seq" file is located in "/Users/jon/DNA", the function can be called by \textit{>predNuPoP(file="/Users/jon/DNA/test.seq",species=7,model=4).} The user should not use \textit{file="$\sim$/DNA/test.seq"} to speficify the path to avoid error messages. The argument $\Rfunarg{species}$ can be specified as follows: 1 = Human; 2 = Mouse; 3 = Rat; 4 = Zebrafish; 5 = D. melanogaster; 6 = C. elegans; 7 = S. cerevisiae; 8 = C. albicans; 9 = S. pombe; 10 = A. thaliana; 11 = Maize; 0 = other. If species = 0 is specified, the algorithm will identify a species from 1-11 that has most similar base composition to the input sequence, and use the models from the selected species for prediction. The default value is 7. The argument $\Rfunarg{model}$ can be either 1 or 4, standing for the order of Markov chain models used for the nucleosome and linker states. The output file, generated in the current working directory, will be named after the sequence file name, with an added extension as $\_Prediction1.txt$ or $\_Prediction4.txt$. For the above codes, the output file will be $test.seq\_Prediction4.txt$. The four columns in the output file are \begin{enumerate} \item{\Robject{Position}: position in the input DNA sequence.} \item{\Robject{P-start}: probability that the current position is the start of a nucleosome.} \item{\Robject{Occup}: nucleosome occupancy score. The nucleosome occupancy score is defined as the probability that the given position is covered by a nucleosome.} \item{\Robject{N/L}: 1 indicates the given position is covered by nucleosome and 0 for linker linker based on Viterbi prediction.} \item{\Robject{Affinity}: nucleosome binding affinity score. This affinity score is defined for every 147 bp of DNA sequence centered at the given position. Therefore for the first and last 73 bp of the DNA sequence, the affinity score is not defined (indicated as NA).} \end{enumerate} The output file can be imported into R by \verb@readNuPoP@ function: <<>>= results=readNuPoP("test.seq_Prediction4.txt",startPos=1,endPos=5000) results[1:5,] @ The genomic sequence can be extremely long. The user can import a part of the predictions by specifying the start position ($\Rfunarg{startPos}$) and the end position ($\Rfunarg{endPos}$) in the \verb@readNuPoP@ function. For example, to visualize prediction results from \Rfunarg{startPos=1} to \Rfunarg{endPos=5000}, <>= plotNuPoP(results) @ \bibliographystyle{apalike} %\bibliography{nucleosomeNew.bib} \begin{thebibliography}{} \bibitem[Wang et~al., 2008]{stat:WangWidom2008} Wang, J.-P., Fondufe-Mittendorf, Y., Xi, L., Tsai, G.-F., Segal, E., and Widom, J. (2008). \newblock Preferentially quantized linker {DNA} lengths in \textit {Saccharomyces cerevisiae}. \newblock {\em PLoS Computational Biology}, 4(9):e1000175. \bibitem[Xi et~al., 2010]{stat:XiWang2010} Xi, L., Fondufe-Mittendorf, Y., Xia, L., Flatow, J., Widom, J., and Wang, J.-P. (2010). \newblock Predicting nucleosome positioning using a duration hidden markov model. \newblock {\em BMC Bioinformatics}, pages doi:10.1186/1471--2105--11--346. \end{thebibliography} \end{document}