\documentclass[a4paper]{article} %\VignetteIndexEntry{Plotting Haplotype Networks} %\VignettePackage{pegas} \usepackage{ape} \newcommand{\loci}{\code{"loci"}} \newcommand{\NA}{\code{NA}} \newcommand{\pegas}{\pkg{pegas}} \author{Emmanuel Paradis} \title{Plotting Haplotype Networks with \pegas} \begin{document} \DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom=\color{darkblue}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom=\color{black}\vspace{-1.5em}} \maketitle \tableofcontents\vspace*{1pc}\hrule \vspace{1cm} \setkeys{Gin}{width=\textwidth} <>= options(width = 80, prompt = "> ") @ \vspace{1cm} \section{Introduction} Haplotype networks are powerful tools to explore the relationships among individuals characterised with genotypic or sequence data \cite{Huson2006, Paradis2018b}. \pegas\ has had tools to infer and plot haplotype networks since its first version (0.1, released in May 2009). These tools have improved over the years and are appreciated in the community working on population genetics and genomics (see John Bhorne's blog\footnote{https://johnbhorne.wordpress.com/2016/09/15/still-making-haplotype-networks-the-old-way-how-to-do-it-in-r/}). This document covers some aspects of drawing haplotype networks with \pegas\ with an emphasis on recent improvements. Not all details and options are covered here: see the respective help pages (\code{?plot.haploNet} and \code{?mutations}) for full details. The function \code{plotNetMDS}, which offers an alternative approach to plotting networks, is not considered in this document. \section{The Function \code{plot.haploNet}} The current version of \pegas\ includes five methods to reconstruct haplotype networks as listed in the table below. \newcommand\ditto{\textsl{\code{"}}} \begin{center} \begin{tabular}{llclc} \toprule Method & Acronym & Input data & Function & Ref.\\ \midrule Parsimony network & TCS & distances & \code{haploNet} & \cite{Templeton1992}\\ Minimum spanning tree & MST & \ditto & \code{mst} & \cite{Kruskal1956}\\ Minimum spanning network & MSN & \ditto & \code{msn} & \cite{Bandelt1999}\\ Randomized minimum spanning tree & RMST & \ditto & \code{rmst} & \cite{Paradis2018b}\\ Median-joining network & MJN & sequences & \code{mjn} & \cite{Bandelt1999}\\ \bottomrule \end{tabular} \end{center} All these functions output an object of class \code{"haploNet"} so that they are plotted with the same \code{plot} method (\code{plot.haploNet}). \subsection{Node Layout} The coordinates of the nodes (or vertices) representing the haplotypes are computed in two steps: first, an equal-angle algorithm borrowed from Felsenstein \cite{Felsenstein2004} is used; second, the spacing between nodes is optimised. The second step is ignored if the option \code{fast = TRUE} is used when calling \code{plot}. These two steps are detailed a bit in the next paragraphs. In the first step, the haplotype with the largest number of links is placed at the centre of the plot (i.e., its coordinates are $x=y=0$), then the haplotypes connected to this first haplotype are arranged around it and given equal angles. This is then applied recursively until all haplotypes are plotted. To perform this layout, an initial `backbone' network based on an MST is used, so there is no reticulation and the equal-angle algorithm makes sure that there is no segment-crossing. In practice, it is likely that this backbone MST is arbitrary with respect to the rest of the network. The other segments are then drawn on this MST. In the second step, a `global energy' is calculated based on the spaces between the nodes of the network (closer nodes imply higher energies). The nodes are then moved repeatedly, while keeping the initial structure of the backbone MST, until the global energy is not improved (decreased). We illustrate the procedure with the \code{woodmouse} data, a set of sequences of cytochrome \textit{b} from 15 woodmice (\textit{Apodemus sylvaticus}): <<>>= library(pegas) # loads also ape data(woodmouse) @ \noindent In order, to simulate some population genetic data, we sample, with replacement, 80 sequences, and create two hierarchical groupings: \code{region} with two levels each containing 40 haplotypes, and \code{pop} with four levels each containing 20 haplotypes: <<>>= set.seed(10) x <- woodmouse[sample.int(nrow(woodmouse), 80, TRUE), ] region <- rep(c("regA", "regB"), each = 40) pop <- rep(paste0("pop", 1:4), each = 20) table(region, pop) @ \noindent We extract the haplotypes which are used to reconstruct the RMST after computing the pairwise Hamming distances: <<>>= h <- haplotype(x) h d <- dist.dna(h, "N") nt <- rmst(d, quiet = TRUE) nt @ \noindent We now plot the network with the default arguments: <>= plot(nt) @ \noindent We compare the layout obtained with \code{fast = TRUE}: <>= plot(nt, fast = TRUE) @ \noindent By default, not all links are drawn. This is controlled with the option \code{threshold} which takes two values in order to set the lower and upper bounds of the number of mutations for a link to be drawn: <>= plot(nt, threshold = c(1, 14)) @ \noindent The visual aspect of the links is arbitrary: the links of the backbone MST are shown with continuous segments, while ``alternative'' links are shown with dashed segments. \subsection{Options} \code{plot.haploNet} has a relatively large number of options: <<>>= args(pegas:::plot.haploNet) @ \noindent Like for most \code{plot} methods, the first argument (\code{x}) is the object to be plotted. Until \pegas\ 0.14, all other arguments were defined with default values. In recent versions, as shown above, only \code{size} and \code{shape} are defined with default values; the other options, if not modified in the call to \code{plot}, are taken from a set of parameters which can be modified as explained in Section~\ref{sec:getsetopts}. The motivation for this new function definition is that in most cases users need to modify \code{size} and \code{shape} with their own data, such as haplotype frequencies or else, and these might be changed repeatedly (e.g., with different data sets or subsets). On the other hand, the other options are more likely to be used to modify the visual aspect of the graph, so it could be more useful to change them once during a session as explained later in this document. The size of the haplotype symbols can be used to display haplotype frequencies. The function \code{summary} can extract these frequencies from the \code{"haplotype"} object: <<>>= (sz <- summary(h)) @ \noindent It is likely that these values are not ordered in the same way than haplotypes are ordered in the network: <<>>= (nt.labs <- attr(nt, "labels")) @ \noindent It is simple to reorder the frequencies before using them into \code{plot}: <>= sz <- sz[nt.labs] plot(nt, size = sz) @ A similar mechanism can be used to show variables such as \code{region} or \code{pop}. The function \code{haploFreq} is useful here because it computes the frequencies of haplotypes for each region or population: <<>>= (R <- haploFreq(x, fac = region, haplo = h)) (P <- haploFreq(x, fac = pop, haplo = h)) @ \noindent Like with \code{size}, we have to reorder these matrices so that their rows are in the same order than in the network: <<>>= R <- R[nt.labs, ] P <- P[nt.labs, ] @ \noindent We may now plot the network with either information on haplotype frequencies by just changing the argument \code{pie}: <>= plot(nt, size = sz, pie = R, legend = c(-25, 30)) @ <>= plot(nt, size = sz, pie = P, legend = c(-25, 30)) @ The option \code{legend} can be: \begin{itemize} \item \code{FALSE} (the default): no legend is shown; \item \code{TRUE}: the user is asked to click where the legend should be printed; \item a vector of two values with the coordinates where the print the legend (for non-interactive use like in this vignette). \end{itemize} \section{New Features in \pegas\ 1.0} This section details some of the improvements made to haplotype network drawing after \pegas\ 0.14. \subsection{Improved `Replotting'} The graphical display of networks is a notoriously difficult problem, especially when there is an undefined number of links (or edges). The occurrence of reticulations makes line crossings almost inevitable. The packages \pkg{igraph} and \pkg{network} have algorithms to optimise the layouts of nodes and edges when plotting such networks. The function \code{replot} (introduced in \pegas\ 0.7, March 2015) lets the user modify the layout of nodes interactively by clicking on the graphical window where a network has been plotted beforehand. \code{replot}---which cannot be used in this non-interactive vignette---has been improved substantially: \begin{itemize} \item The explanations printed when the function is called are more detailed and the node to be moved is visually identified after clicking. \item The final coordinates, for instance saved with \code{xy <- replot()}, can be used directly into \code{plot(nt, xy = xy)}. This also makes possible to input coordinates calculated with another software. \item In previous versions, the limits of the plot tended to drift when increasing the number of node moves. This has been fixed, and the network is correctly displayed whatever the number of moves done. \end{itemize} \subsection{Haplotype Symbol Shapes} Haplotypes can be represented with three different shapes: circles, squares, or diamonds. The argument \code{shape} of \code{plot.haploNet} is used in the same way than \code{size} as explained above (including the evental need to reorder the values). Some details are given below about how these symbols are scaled. There are two ways to display a quantitative variable using the size of a circle: either with its radius ($r$) or with the area of the disc defined by the circle. This area is $\pi r^2$, so if we want the area of the symbols to be proportional to \code{size}, we should square-root these last values. However, in practice this masks variation if most values in \code{size} are not very different (see below). In \pegas, the diameters of the circles ($2r$) are equal to the values given by \code{size}. If these are very heterogeneous, they could be transformed with \code{size = sqrt(....} keeping in mind that the legend will be relative to this new scale. The next figure shows both ways of scaling the size of the circles: the top one is the scaling used in \pegas. <>= par(xpd = TRUE) size <- c(1, 3, 5, 10) x <- c(0, 5, 10, 20) plot(0, 0, type="n", xlim=c(-2, 30), asp=1, bty="n", ann=FALSE) other.args <- list(y = -5, inches = FALSE, add = TRUE, bg = rgb(1, 1, 0, .3)) o <- mapply(symbols, x = x, circles = sqrt(size / pi), MoreArgs = other.args) other.args$y <- 5 o <- mapply(symbols, x = x, circles = size / 2, MoreArgs = other.args) text(x, -1, paste("size =", size), font = 2, col = "blue") text(30, -5, expression("circles = "*sqrt(size / pi))) text(30, 5, "circles = size / 2") @ For squares and diamonds (\code{shape = "s"} and \code{shape = "d"}, respectively), they are scaled so that their areas are equal to the disc areas for the same values given to \code{size}. The figure below shows these three symbol shapes superposed for several values of this parameter. Note that a diamond is a square rotated 45\textdegree\ around its center. <>= x <- c(0, 6, 13, 25) plot(0, 0, type="n", xlim=c(-2, 30), asp=1, bty="n", ann=FALSE) other.args$y <- 0 o <- mapply(symbols, x = x, circles = size/2, MoreArgs = other.args) other.args$col <- "black" other.args$add <- other.args$inches <- NULL o <- mapply(pegas:::square, x = x, size = size, MoreArgs = other.args) o <- mapply(pegas:::diamond, x = x, size = size, MoreArgs = other.args) text(x, -7, paste("size =", size), font = 2, col = "blue") @ \subsection{The Function \code{mutations}} \code{mutations()} is a low-level plotting function which displays information about the mutations related to a particular link of the network. This function can be used interactively. For instance, the following is copied from an interactive \R\ session: \begin{verbatim} > mutations(nt) Link is missing: select one below 1: VII-I 2: VII-VIII 3: V-XI 4: III-VI 5: IX-X 6: IX-II 7: III-IX 8: VII-III 9: XV-V 10: XIII-IX 11: IX-V 12: IX-XII 13: III-XIV 14: III-IV 15: IX-XI 16: XV-XI 17: IX-IV 18: I-VIII 19: I-III 20: XIV-IV 21: II-XI 22: II-V Enter a link number: 18 Coordinates are missing: click where you want to place the annotations: The coordinates x = -8.880335, y = 16.313 are used \end{verbatim} \noindent The values entered interactively can be written in a script to reproduce the figure: <>= plot(nt) mutations(nt, 18, x = -8.9, y = 16.3, data = h) @ \noindent Like any low-level plotting function, \code{mutations()} can be called as many times as needed to display similar information on other links. The option \code{style} takes the value \code{"table"} (the default) or \code{"sequence"}. In the second, the positions of the mutations are drawn on a horizontal segment representing the sequence: <>= plot(nt) mutations(nt, 18, x = -8.9, y = 16.3, data = h) mutations(nt, 18, x = 10, y = 17, data = h, style = "s") @ \noindent The visual aspect of these annotations is controlled by parameters as explained in the next section. \subsection{Getting and Setting Options}\label{sec:getsetopts} The new version of \pegas\ has two ways to change some of the parameters of the plot: either by changing the appropriate option(s) in one of the above functions, or by setting these values with the function \code{setHaploNetOptions}, in which case all subsequent plots will be affected.\footnote{See \code{?par} for a similar mechanism with basic \R\ graphical functions.} The list of the option values currently in use can be printed with \code{getHaploNetOptions}. There is a relatively large number of options that affect either \code{plot.haploNet()} or \code{mutations()}. Their names are quite explicit so that the user should find which one(s) to modify easily: <<>>= names(getHaploNetOptions()) @ We see here several examples with the command \code{plot(nt, size = 2)} which is repeated after calling \code{setHaploNetOptions}: <>= plot(nt, size = 2) @ <>= setHaploNetOptions(haplotype.inner.color = "#CCCC4D", haplotype.outer.color = "#CCCC4D", show.mutation = 3, labels = FALSE) plot(nt, size = 2) @ <>= setHaploNetOptions(haplotype.inner.color = "blue", haplotype.outer.color = "blue", show.mutation = 1) par(bg = "yellow3") plot(nt, size = 2) @ <>= setHaploNetOptions(haplotype.inner.color = "navy", haplotype.outer.color = "navy") par(bg = "lightblue") plot(nt, size = 2) @ \bibliographystyle{plain} \bibliography{pegas} \end{document}