%\VignetteIndexEntry{polySegratioMM} %\VignetteDepends{polySegratioMM} %\VignetteKeywords{polyploids,segregation ratio} %\VignettePackage{polySegratioMM} \documentclass[a4paper]{article} %\VignetteIndexEntry{polySegratioMM overview} %\VignettePackage{polySegratioMM} \usepackage{fancyhdr} \pagestyle{fancy} %%%\usepackage{palatino} \usepackage[T1]{fontenc} \usepackage[sc]{mathpazo} \linespread{1.05} % Palatino needs more leading (space between lines) \usepackage[round]{natbib} \usepackage{graphicx} \usepackage{url} \renewcommand{\sectionmark}[1]{\markright{\thesection.\#1}} \fancyhead{} \lhead{polySegratioMM} \rhead{\date{\today}} \cfoot{Mixture models for marker dosage in autopolyploids} \lfoot{} \rfoot{\thepage} \renewcommand{\headrulewidth}{0.4pt} \renewcommand{\footrulewidth}{0.4pt} \title{polySegratioMM: An R library for Bayesian mixture models for marker dosage in autopolyploids} \author{Peter Baker} \begin{document} \maketitle \SweaveOpts{prefix.string=tmp/tmp} It is well known that the dosage level of markers in autopolyploids and allopolyploids can be characterised by their observed segregation ratios. The related package \texttt{polySegratio} provides functions to allocate dosage by standard approaches and to simulate marker data sets for differing ploidies and levels of overdispersion. Note that these methods could equally well be applied to allopolyploids with specified expected segregation ratios. For details see \texttt{polySegratio}. A Bayesian approach was proposed by \cite{baker2010} where marker dosage estimation was obtained by fitting a finite mixture distribution. This library calls the \texttt{JAGS} software for Bayesian calculation. \texttt{JAGS 1.0} or higher must be installed following instructions from \url{http://mcmc-jags.sourceforge.net/}. Note that only the most recent version is used for testing with \texttt{R}. The \texttt{JAGS} executable must be in your path. Currently, no checking is carried out to ascertain whether or not \texttt{JAGS} is set up appropriately. To use the library, you need to attach it with <<>>= library(polySegratioMM) @ <>= ##library(cacheSweave) # comment this out after development phase ## to use: Sweave("polySegratioMM-overview.Rnw", driver=cacheSweaveDriver) ## or without cache: Sweave("polySegratioMM-overview.Rnw") op <- options() options(width=70, digits=4) @ \section{Simulated data} \label{sec:sim-data} Library functions are demonstrated on a simulated data set generated using the \texttt{sim.autoMarkers} function from the \texttt{polySegratio} package. The following \texttt{R} code can be used to generate 500 markers for 200 autohexaploid individuals exhibiting overdispersion with the parameter \texttt{shape1 = 25}. The underlying percentages of single double and triple dose markers are 70\%, 20\% and 10\%, respectively. \begin{Scode} hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200) \end{Scode} <>= data(hexmarkers) @ <<>>= ##<>= ## simulate small autohexaploid data set of 500 markers for 200 individuals ## with %70 Single, %20 Double and %10 Triple Dose markers ## created with ## hexmarkers <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200) ## save(hexmarkers, file="../../data/hexmarkers.RData") print(hexmarkers) @ Note that the segregation ratios for simulated or real data may be extracted by using \texttt{segregationRatios} which sets up the appropriate objects for testing marker dosage and plotting or summarising the marker data. <<>>= sr <- segregationRatios(hexmarkers$markers) @ \begin{figure} [htb] \begin{center} <>= print(plotTheoretical(ploidy.level=6, seg.ratios=sr, main="", expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200)) @ \caption{Segregation ratios of 500 simulated markers from 200 autohexaploid individuals. Percentages of single double and triple dose markers are 70\%, 20\% and 10\%, respectively. Data were generated assuming no overdispersion.} \label{fig:sim1} \end{center} \end{figure} For instance, as seen in Figure~\ref{fig:sim1}, segregation ratios may be plotted with %%##<>= \begin{Scode} plotTheoretical(ploidy.level=6, seg.ratios=sr, expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200) \end{Scode} %%@ On the other hand, consider a similar data set that exhibits overdispersion. This may be simulated as follows <>= ##<>= ## simulate small autohexaploid data set of 500 markers for 200 individuals ## with %70 Single, %20 Double and %10 Triple Dose markers ## hexmarkers.overdisp <- sim.autoMarkers(6,c(0.7,0.2,0.1),overdispersion=TRUE, shape1=30,n.markers=500,n.individuals=200) ## save(hexmarkers.overdisp, file="../../data/hexmarkers.overdisp.RData") data(hexmarkers.overdisp) ##print(hexmarkers.overdisp) @ \begin{Scode} hexmarkers.overdisp <- sim.autoMarkers(6,c(0.7,0.2,0.1),n.markers=500,n.individuals=200, overdispersion=TRUE, shape1=30) \end{Scode} <<>>= sr.overdisp <- segregationRatios(hexmarkers.overdisp$markers) @ \begin{figure} [htb] \begin{center} <>= print(plotTheoretical(ploidy.level=6, seg.ratios=sr.overdisp, main="", expected.segratio=NULL, proportions=c(0.7,0.2,0.1), n.individuals=200)) @ \caption{Segregation ratios of 500 simulated markers from 200 autohexaploid individuals. Percentages of single double and triple dose markers are 70\%, 20\% and 10\%, respectively. Data were generated from the Beta--Binomial distribution assuming a shape parameter \texttt{shape1} of 30.} \label{fig:sim2} \end{center} \end{figure} The histogram of marker segregation ratios, which is a useful graphical method for identifying overdispersion or outliers, is seen in Figure~\ref{fig:sim2}. Note that, due to overdispersion the theoretical distribution is narrower than the observed data. \section{A Bayesian mixture model approach} \label{sec:mix-model} For the $j^{th}$ marker $j=1 \ldots n$, we assume the observed number $r_j$ of dominant markers out of $N_j$ lines follows a binomial distribution denoted $Bin(N_j, Pk)$. If we knew the dosage $k$ then, following \cite{ripol99}, the expected value of $P_k$ may be written as \begin{equation} \label{eq:ripol1} P_k(k| m, x) = 1 - {{m-k \choose mx} \over {m \choose mx}} , k=0 \ldots m/2 \end{equation} where $m$ is the ploidy level or number of homologous chromosomes and the monoploid number $x$ is the number of chromosomes in a basic set. Note that for diploids $m=2$, tetraploids $m=4$ , octaploids then $m=8$ and so on and also that if there are no marker data missing then $N_j$ is simply the number of progeny. Since the dosage of each marker is unknown, we rely on the missing data representation of \cite{dempster77} and \cite{tannerandwong87} which is commonly adopted for MCMC computation in finite mixture models. An indicator variable $z_j$ corresponding to unknown marker dosage class $k$ is introduced where $z_j = k$ if the marker has dose $k$. For the $K$ components with $K \leq m/2$, consider the logit transformation of the true segregation proportions $P_k$ for dose $k, k=1 \dots K$. The the logit transformed segregation ratio $ \omega_k$ is then \begin{equation} \label{eq:logit} \omega_k = \log({P_k \over 1-P_k}). \end{equation} Let $z=(z_1 \ldots z_n)^T$ be a vector of unknown dosages (labelled $1,2 \dots K$ corresponding to simplex, duplex, triplex markers and so on), then $r_j$ is binomially distributed with known size parameter $N_j$ and unknown proportion parameter $\omega_{Z_j}$ which is the segregation ratio for marker dosage $z_j$. Hence, given marker dosage $z_j$ then \begin{eqnarray} \label{eq-bin-mod1} r_j|z_j & \sim & Bin \left( N_j, \omega_{Z_j} \right), \\ \mbox{ where } && \nonumber\\ \mbox{logit}( \omega_{Z_j} ) & = & \log ( { \omega_{Z_j} \over 1 - \omega_{Z_ j} } ) \sim N( \mu_{Z_j}, \tau_{Z_j}^{-1}) \nonumber \end{eqnarray} where $\mu_k$ and $\tau_k$ are the mean and precision $(\tau_k = 1/\sigma^2_k)$ of marker dosage class $k$ on the logit scale. Since the dosage is unknown, for the autohexaploid data generated here then for the logit($\omega_{z_k}$) can be modelled as a finite mixture of 3 normals \begin{equation} \label{eq:3normals} \mbox{logit}( \omega_{Z_j} ) \sim \pi_1 N(\mu_1,\tau_1^{-1}) + \pi_2 N(\mu_2,\tau_2^{-1}) + \dots + \pi_K N(\mu_K,\tau_K^{-1}) \end{equation} where $\mu_k$ is the mean and $\tau_k$ is the precision of component $k$ on the logit scale, and $\pi_k$ are the mixing proportions of the three components with $\sum_{k=1}^K\pi_k = 1$. The probability density function $f(x)$ of logit($\omega_k$) is \begin{equation} \label{eq:mix1} f(x) = \sum_{k=1}^K \pi_k \phi ( x | \mu_k, \tau_k^{-1}) \end{equation} where $\phi$ is the normal cumulative distribution function with parameters mean $\mu_k$ and variance $\sigma^2_k =\tau_k^{-1}$. Simulation studies suggested that incorporating strong prior information, such as the expected distributions of \cite{haldane30} provided the best method of allocating dosage. Further details may be found in \cite{baker2010}. \section{Specifying a model} \label{sec:model-spec} A mixture model may be set up with \texttt{setModel}. By default, only two parameters are required, namely the \texttt{ploidy.level} or the number of homologous chromosomes set either as a numeric or as a character string and also \texttt{n.components} or the number of components for mixture model (less than or equal to maximum number of possible dosages). By default, strong priors are set by using the formulae of \cite{haldane30} for the expected numbers and ratios of offspring for various parental configurations of autopolyploids. For the autohexaploid data generated above, the models are set with <<>>= x.mod1 <- setModel(3,6) # autohexaploid model with 3 components @ The \texttt{R} object \texttt{x.mod1} contains components describing aspects of the model such as the number of components, ploidy, expected segregation ratios and so on. Note that the \texttt{str} command is useful for displaying the internal structure of any \texttt{R} object. \section{Fitting a mixture model} \label{sec:model-fit} While various options are available for fine tuning the MCMC process, the simplest way to fit a mixture model to allocate marker dosages is with the wrapper function \texttt{runSegratioMM} as follows: %%##<>= \begin{Scode} mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1) \end{Scode} %%@ which automatically determines starting values, priors, length of burn in, number of iterations, and other parameters as well as producing summary statistics and diagnostic plots. <>= ## produced using the following but loaded as data to avoid the run time on slow machines ##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, burn.in=200, sample=500, plots=FALSE) ##mcmcHexRun <- runSegratioMM(sr.overdisp, x.mod1, plots=FALSE) ## save(mcmcHexRun, file="../../data/mcmcHexRun.RData") data(mcmcHexRun) @ To run \texttt{JAGS} without producing plots then set the \texttt{plots} option to \texttt{FALSE}. For the overdispersed data running this command produced the following selected output. While selected output is printed here the simple command \texttt{print(mcmcHexRun)} whould produce the following output and more. The summary of processing times: <<>>= print(mcmcHexRun$run.jags) @ And summary statistics for the posterior distributions of selected parameters: <<>>= print(mcmcHexRun$summary) @ Note that MCMC convergence diagnostic output is produced automatically. Assessing convergence is crucial in MCMC and poor convergence may result in mis--allocated marker dosages. The diagnostic statistics indicate that convergence was achieved. <<>>= print(mcmcHexRun$diagnostics) @ And finally, summaries of marker dosage allocations are produced: <<>>= print(mcmcHexRun$doses) @ Note that simply plotting \texttt{mcmcHexRun} will produce a histogram of segregation proportions and the fitted model but that other plots are easily produced. %% NB: new library default - 'sigma' rather than 'sigma[1]' \begin{figure} [htb] \begin{center} <>= print(plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")])) @ \caption{Trace and posterior density plots for the parameters parameters $p_1$, $\mu_1$, $\sigma_1$ and for the 140$^{th}$ marker for the overdispersed data.} \label{fig:trace1} \end{center} \end{figure} When the \texttt{plots} option of \texttt{runSegratioMM} is set to the default value of \texttt{TRUE}, numerous plots are produced including trace and density plots from the \texttt{CODA} package. These may also be extracted manually but the process is somewhat more complicated. For instance to obtain trace and density plots for the parameters $p_1$, $\mu_1$, $\sigma_1$ and for the 140$^{th}$ marker, as shown in Figure~\ref{fig:trace1}, then \texttt{CODA} may be used directly by following command. %%##<>= \begin{Scode} plot(mcmcHexRun$mcmc.mixture$mcmc.list[[1]][,c("P[1]","mu[1]","sigma","T[140]")]) \end{Scode} %%@ \begin{figure} [htb] \begin{center} <>= print(plot(mcmcHexRun, theoretical=TRUE, main="")) @ \caption{Fitted (blue) and theoretical (red) distributions for simulated segregation ratios with overdispersion for 500 markers from 200 individuals.} \label{fig:fitted1} \end{center} \end{figure} The histogram of segregation proportions with fitted and theoretical values shown in Figure~\ref{fig:fitted1} may be obtained by setting the \texttt{theoretical} option to \texttt{TRUE} as follows. %%##<>= \begin{Scode} print(plot(mcmcHexRun, theoretical=TRUE)) \end{Scode} %%@ \section{Assigning marker dosage} \label{sec:marker-dosage} Marker dosages allocations may be obtained directly from the object \texttt{mcmcHexRun}. The dosage with maximum posterior probability is simply \texttt{mcmcHexRun\$doses\$max.post.dosage}. A more conservative allocation is obtained by using \texttt{mcmcHexRun\$doses\$dosage[,"0.8"]} whereby the dosage with posterior probability over 0.8 is employed. For instance, to tabulate the number of markers (including those not allocated a dosage which are labelled NA) the \texttt{table} command can be employed. <<>>= cat("Employing maximum posterior probability\n") table(Dose=mcmcHexRun$doses$max.post.dosage, exclude=NULL) cat("Employing posterior probability > 0.8\n") table(Dose=mcmcHexRun$doses$dosage[,"0.8"], exclude=NULL) @ And of course since the data were simulated we can compare the estimated and true dosages obtained as \texttt{hexmarkers.overdisp\$true.doses\$dosage} via cross tabulation. Doses can also be obtained for the standard $\chi^2$ test by using the \texttt{test.segRatio} command from the \texttt{polySegratio} library. <<>>= cat("Employing theChi squared test\n") dose.chi <- test.segRatio(sr.overdisp, ploidy.level = 6) table(Chi2Dose=dose.chi$dosage, True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) cat("Employing maximum posterior probability\n") table(MixtureDose=mcmcHexRun$doses$max.post.dosage, True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) cat("Employing posterior probability > 0.8\n") table(MixtureDose=mcmcHexRun$doses$dosage[,"0.8"], True=hexmarkers.overdisp$true.doses$dosage, exclude=NULL) @ These tables show that far fewer markers are allocated a dosage using the standard $\chi^2$ test than by the mixture model. Fewer markers were misclassified using a posterior probability threshold of 0.8 rather than the maximum posterior probability as a basis for allocating dosage. \bibliographystyle{apalike} %%\bibliography{polyploid,qtl,Mybooks2-stat,thesis-eqtl,genomic,thesis}%,my_mcmc} % just include bibfile here \bibliography{polySegratioMM-overview} \subsection{Acknowledgments} \label{sec:acknowledgments} Karen Aitken, given her experience in tetraploids and sugarcane marker maps, has provided many valuable insights into marker dosage in autopolyploids. Additionally, Ross Darnell, Andrew George and Kerrie Mengersen provided useful comments and discussions. \end{document}