--- title: "corHMM 2.1: Generalized hidden Markov models" author: "James D. Boyko and Jeremy M. Beaulieu" output: pdf_document: fig_caption: yes vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Generalized corHMM} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(cache=FALSE) knitr::opts_chunk$set(tidy.opts=list(width.cutoff=75),tidy=TRUE) ``` The vignette is comprised of three sections, where we demonstrate all new extensions as well as other new and useful features: * __Background information__ * __Section 1 Default use of corHMM__ * 1.1: No hidden rate categores * 1.2: Any number of hidden rate categories * __Section 2 How to make and interpret custom models__ * 2.1: Creating and using custom rate matrices * 2.1.1: One rate category * 2.1.2: Any number of rate categories * 2.2: Some examples of "biologically informed" models * 2.2.1: Ordered habitat change * 2.2.2: Precursor model * 2.2.3: Ontological relationship of multiple characters * __Section 3 Estimating models when node states are fixed__ * 3.1: Fixing a single node * 3.2: Estimating rates under a parsimony reconstruction * 3.3: Fixing nodes when the model contains hidden states #Background information The original version of `corHMM` contained a number of distinct functions for conducting analyses of discrete morphological characters. This included the `corHMM()` function for fitting a hidden rates model, which uses "hidden" states as a means of allowing transition rates in a binary character to vary across a tree. In reality, the hidden rates model falls within a general class of models, hidden Markov models (HMM), that may also be applied to multistate characters. So, whether the focal trait is binary or contains multiple states, or whether the observed states represents a set of binary and multistate characters, hidden states can be applied as a means of allowing heterogeneity in the transition model. Choosing a model specific to your question is of utmost importance in any comparative method, and in this new version of `corHMM` we provide users with the tools to create their own hidden Markov models. Before delving into this further it may be worth showing a little of what is underneath the hood. To begin, consider a single binary character with states *0* and *1*. If the question was to understand the asymmetry in the transition between these two states, the model, **Q**, would be a simple 2x2 matrix, $$ Q= \begin{bmatrix} - & q_{0 \rightarrow 1} \\ q_{1 \rightarrow 0} & - \\ \end{bmatrix} $$ This *transition rate matrix* is read as describing the transition rate *from* ROW *to* COLUMN. Thus, there are only two states, 0 and 1, and two transitions going from 0 $\rightarrow$ 1, and from 1 $\rightarrow$ 0. However, if we introduce a second binary character, the number of possible states you *could* observe is expanded to account for all the combination of states between two characters -- that is, you could observe *00*, *01*, *10*, or *11*. To accommodate this, we need to expand our model such that it becomes a 4x4 matrix, $$ Q = \begin{bmatrix} - & q_{00 \rightarrow 01} & q_{00 \rightarrow 10} & q_{00 \rightarrow 11}\\ q_{01 \rightarrow 00} & - & q_{01 \rightarrow 10} & q_{01 \rightarrow 11}\\ q_{10 \rightarrow 00} & q_{10 \rightarrow 01} & - & q_{10 \rightarrow 11}\\ q_{11 \rightarrow 00} & q_{11 \rightarrow 01} & q_{11 \rightarrow 10} & -\\ \end{bmatrix} $$ This model is considerably more complex, as the number of transitions in this rate matrix now goes from 2 to 12. However, with these models we often make a simplifying assumption that we do not allow for transitions in two states to occur at the same time. In other words, if a lineage is in state *00* it must first transition to either state *01* or *10*, before transitioning to state *11*. Therefore, we can simplify the matrix by removing these "dual" transitions from the model completely, $$ Q = \begin{bmatrix} - & q_{00 \rightarrow 01} & q_{00 \rightarrow 10} & -\\ q_{01 \rightarrow 00} & - & - & q_{01 \rightarrow 11}\\ q_{10 \rightarrow 00} & - & - & q_{10 \rightarrow 11}\\ - & q_{11 \rightarrow 01} & q_{11 \rightarrow 10} & -\\ \end{bmatrix} $$ What we just described is the popular model of Pagel (1994), which tests for correlated evolution between two binary characters. But, one thing that is not obvious: the states in the model need not be represented as combinations of binary characters. For example, the focal character may be two characters, like say, flowers that are red with and without petals, and blue flowers with and without petals. One could just code it as a single multistate character, where *1*=red petals, *2*=red with no petals (i.e., sepals are red), *3*=blue petals, and *4*=blue with no petals (i.e., sepals are blue). The model would then be, $$ Q = \begin{bmatrix} - & q_{1 \rightarrow 2} & q_{1 \rightarrow 3} & q_{1 \rightarrow 4}\\ q_{2 \rightarrow 1} & - & q_{2 \rightarrow 3} & q_{2 \rightarrow 4}\\ q_{3 \rightarrow 1} & q_{3 \rightarrow 2} & - & q_{3 \rightarrow 4}\\ q_{4 \rightarrow 1} & q_{4 \rightarrow 2} & q_{4 \rightarrow 3} & -\\ \end{bmatrix} $$ Notice it is the same as before, but the states are transformed from binary combinations to a multistate character. As before, we may assume that transitions in two states cannot occur at the same time and remove the "dual" transitions. $$ Q = \begin{bmatrix} - & q_{1 \rightarrow 2} & q_{1 \rightarrow 3} & -\\ q_{2 \rightarrow 1} & - & - & q_{2 \rightarrow 4}\\ q_{3 \rightarrow 1} & - & - & q_{3 \rightarrow 4}\\ - & q_{4 \rightarrow 2} & q_{4 \rightarrow 3} & -\\ \end{bmatrix} $$ Again, exactly the same. The updated version of `corHMM()` now lets users transform a set of characters into a *single* multistate character. This means that two characters need not have the same number of character states -- that is, one trait could have four states, and the other could be binary. We also allow any model to be expanded to accomodate an arbitrary number of hidden states. Thus, `corHMM()` is completely flexible and naturally contains `rayDISC()` and `corDISC()` capabilities - standalone functions in previous versions of `corHMM` that may have been mistaken as different "methods." As this vignette will show, they are indeed nested within a broader class of HMMs. # Section 1: Default use of corHMM ## 1.1: No hidden rate categories To start, we'll use the primate dataset from Pagel and Meade (2006) that comes with `corHMM`: ```{r, message=FALSE, warning=FALSE} set.seed(1985) require(ape) require(expm) require(corHMM) data(primates) phy <- primates[[1]] phy <- multi2di(phy) data <- primates[[2]] ``` ```{r, fig.align="center", fig.width=6, fig.height=6} plot(phy, show.tip.label = FALSE) data.sort <- data.frame(data[, 2], data[, 3], row.names = data[,1]) data.sort <- data.sort[phy$tip.label, ] tiplabels(pch = 16, col = data.sort[,1]+1, cex = 0.5) tiplabels(pch = 16, col = data.sort[,2]+3, cex = 0.5, offset = 0.5) ``` We have two characters each with two possible states: trait 1 is the absence (black) or presence (red) of estrus advertisement in females, and trait 2 is single male (green) or multimale (blue) mating system in primates. The default use of `corHMM()` only requires that you declare your *phylogeny*, your *dataset*, and the number of *rate categories* (more detail about this later). We have updated `corHMM()` to handle different types of input data. Now to use `corHMM()`, the first column must be species names (as in the previous version), but there can be any number of data columns. If your dataset does have 2 or more columns of trait information, each column is taken to describe a separate character. Note that when the `corHMM()` call is used, the function automatically determines all the unique character combinations *observed* in the data set. In our primate example only 3 of the 4 possible combinations are observed, and so the model is constructed accordingly. Also, dual transitions are automatically disallowed. In other words, we expect that a species cannot go directly from estrus advertisement being absent in a single male mating system to having estrus advertisement in a multimale mating system. They must first evolve either estrus advertisement or multimale mating system. Let's give this a try: ```{r, eval=-1} MK_3state <- corHMM(phy = phy, data = data, rate.cat = 1) load("corHMMResults.Rsave") MK_3state ``` When you run your `corHMM` object you are greeted with a summary of the model. Your model fit is described by the log likelihood (lnL), Akaike information criterion (AIC), and sample size corrected Akaike information criterion (AICc). You are also given the number of rate categories (Rate.cat) and number of taxa (ntax). The *Rates* section of the output describes transition rates between states and are organized as a matrix. Again, the *transition rate matrix* is read as the transition rate **from** ROW **to** COLUMN. For example, if you were interested in the transition rate from State 1 (i.e., absence of estrus advertisement in a single male mating system) to State 2 (i.e., absence of estrus advertisement in a multimale mating system) you would be looking at the Row 1, Column 2, entry. For a time calibrated ultrametric tree, these rates will depend on the age of your phylogeny. You may also notice that `corHMM()` printed a state legend to the screen. Thus, you can obtain the exact coding for each species in an augmented dataframe provided by the `corHMM()` results object itself. This dataframe uses the initial user data to create columns that corresponds to how each species was represented in `corHMM()`: ```{r} head(MK_3state$data.legend) ``` Alternatively, a user can supply their dataset to getStateMat4Dat, which outputs a legend that is consistent with the `corHMM()` function. The other output is an index matrix (or rate matrix) that describes which rates are to be estimated in `corHMM()`. We provide an in-depth discussion about this part of the index matrix later: ```{r} getStateMat4Dat(data) ``` Finally, interpreting a Markov matrix can be difficult, especially when you're just starting out. This problem is compounded when users begin to apply the more complex hidden Markov models (i.e. setting rate.cat > 1). To help users, we have implemented a new plotting function: ```{r, fig.align="center", fig.width=6, fig.height=4} plotMKmodel(MK_3state) ``` This function uses a `corHMM` object (which is the result of running `corHMM()`) or a custom rate matrix (discussed in a later section) to plot the model in two parts. On the left is a ball and stick diagram that depicts the state transitions. On the right is a simplified rate matrix that contains rounded values from the solution output of `corHMM()`. The colors of the arrows correspond to the magnitude of the rates. The final new plotting tool we have made available to users is a stochastic character mapping function, `makeSimmap` (Bollback, 2006). We can use `makeSimmap` to create a character history for any `corHMM` model and then use plotSimmap (from the popular R-package, `phytools`) to plot the output. ```{r} phy = MK_3state$phy data = MK_3state$data model = MK_3state$solution model[is.na(model)] <- 0 diag(model) <- -rowSums(model) # run get simmap (can be plotted using phytools) simmap <- makeSimmap(tree=phy, data=data, model=model, rate.cat=1, nSim=1, nCores=1) # we import phytools plotSimmap for plotting phytools::plotSimmap(simmap[[1]], fsize=.5) ``` #### 1.2: A trait with any number of states and any number of hidden rate categories The major difference between this version of `corHMM` and previous versions is allowing models of any number of states and any number of hidden rate categories (*hidden rate categories will be explained in more depth in section 2*). Running a hidden Markov model (HMM) only requires assigning a value greater than 1 to the rate.cat input. Below, we have assigned 2 rate categories to the data from above: ```{r, eval=-1} HMM_3state <- corHMM(phy = phy, data = data, rate.cat = 2, model = "SYM", get.tip.states = TRUE) HMM_3state ``` Models with more states (larger state space) take longer to estimate because the number of transition rates increases. Hidden rate models further expand state space. For example, adding a second rate category incerases the number of transition rates from 4 to 10 (if the model is left as the default "ARD"). In section 1.1 we left our parameters unconstrained. We estimated all transisions as independent and allowed for transitions from all states to any other state. However, we can constrain a model in `corHMM` in two different ways. The easiest way is to set the model to either "SYM" or "ER". This is what we've done for the HMM_3state model above. By setting model = "SYM", we have forced the transition rates between any two states to be equal. In comparison, model = "ER" constrains all transition rates between states to be the same. Finally, model = "ARD" (the default) allows all transition rates to be independently estimated. Although "ER" and "SYM are common restrictions, it is often more useful to manually restrict your model to match a biological hypothesis (which is described in the next section). Finally, we set get.tip.states to be true because it is necessary for simmaps. Interpreting the estimated rate matrix for this hidden Markov model is intimidating. But, the same principles of interpreting the transition rate matrices apply -- that is, you still read rates from row to column. However, there is the added complexity of transitions among the different rate categories (as represented by R1 and R2). `plotMKmodel()` will plot the underlying structure of model in discrete parts. In the following example, the first 2 panels show how observed states transition within each rate category, and the last panel shows transitions among the different rate classes: ```{r, fig.align = "center", fig.width=10, fig.height=3} plotMKmodel(HMM_3state, display = "row") ``` And again we can plot the simmap of this `corHMM` result. It is important to note that a character history not only generates hypotheses about ancestral states, but is an effective way to visualize the tempo of evolution. This is particularly important for HMMs where rates of evolution can vary drastically across the tree. ```{r} #get simmap inputs from corhmm outputs phy = HMM_3state$phy data = HMM_3state$data model = HMM_3state$solution model[is.na(model)] <- 0 diag(model) <- -rowSums(model) # run get simmap (can be plotted using phytools) simmap <- makeSimmap(tree=phy, data=data, model=model, rate.cat=2, nSim=1, nCores=1) # we import phytools plotSimmap for plotting phytools::plotSimmap(simmap[[1]], fsize=.5) ``` # Section 2: How to make and interpret custom models ## 2.1: Creating and using custom rate matrices ### 2.1.1: One rate category At its core, the purpose of a rate matrix (i.e., `rate.mat`) is to indicate to `corHMM` which parameters are being estimated. It specifies to `corHMM()` which rates in the matrix are being estimated and if any of them are expected to be identical. A custom rate matrix allows you to specify explicit hypotheses. For example, such an approach allows for tests of evolution of traits in a particular order, tests of different rates of evolution in different clades, or tests of the presence of hidden precursors before a state can evolve. Let's start by using the `getStateMat4Dat()` function to get a generic `rate.mat` object: ```{r} LegendAndRateMat <- getStateMat4Dat(data) RateMat <- LegendAndRateMat$rate.mat RateMat ``` The numbers in this matrix are not rates, they are used to index the unique parameters to be estimated by `corHMM()`. Each distinct number is a parameter to be estimated independently from all others. Let's manually create the symmetric model we used in secion 1.2. In the symmetric model we want transitions *to* a state to be the same as *from* that state. This means that (1) $\rightarrow$ (2) & (2) $\rightarrow$ (1) are equal AND that (3) $\rightarrow$(2) and (2) $\rightarrow$ (3) are equal. In other words, based on the `rate.mat` above, we want parameters 1 & 2 to be equal and we want parameters 3 & 4 to be equal as shown below: ```{r} pars2equal <- list(c(1,2), c(3,4)) StateMatA_constrained <- equateStateMatPars(RateMat, pars2equal) StateMatA_constrained ``` To manually create a symmetric model, we used the equateStateMatPars() function, in which the first argument is the rate matrix being modified (i.e., rate.mat object) and second argument is list of the parameters to be equated. One thing to note is that you must have the appropriate number of rate categories since a user rate matrix is not duplicated or changed by `corHMM()`. Thus, this custom model can only be used if we set rate.cat=1 since that is the appropriate number of rate categories. We can now provide this customized `rate.mat` to `corHMM()`: ```{r, eval=-1} MK_3state_customSYM <- corHMM(phy = phy, data = data, rate.cat = 1, rate.mat = StateMatA_constrained) MK_3state_customSYM ``` ### 2.1.2: Any number of rate categories From a technical standpoint, hidden Markov models have a hierarchical structure that can be broken down into two components: a "state-dependent process" and an unobserved "parameter process" (Zucchini et al. 2017). In comparative biology, the standard "state-dependent process" model is a continuous-time Markov chain. The observed states could be any discretized trait such as presence or absence of extrafloral nectaries (Marazzi et al. 2012), woody or herbaceous growth habit (Beaulieu et al. 2013), or diet state across all animals (Roman-Palacios et al. 2019). However, a simple Markov process alone that assumes homogeneity through time and across taxa is often not adequate to capture the variation of real datasets (e.g. Beaulieu et al. 2013). One option is to say that the observed data is the product of several processes occurring in different parts of a phylogeny. The parameter process describes how several state-dependent processes relate to one another. Thus, observations are generated by a given state-dependent process depending on the state of the parameter process. It is initially unknown what the parameter process corresponds to biologically, hence the moniker "hidden" state. If you wanted to add hidden rate categories, you need to know: (1) the dynamics *within* each rate category (state-dependent processes), and (2) transitions *between* the different rate classes (parameter process). We begin by constructing two *within* rate category `rate.mat` objects (R1 and R2). In R1, we assume a drift-like hypothesis where all transition rates are equal. In R2, we assume that the combination of estrus advertisement and multimale mating systems are not lost once they evolve: ```{r} RateCat1 <- getStateMat4Dat(data)$rate.mat # R1 RateCat1 <- equateStateMatPars(RateCat1, c(1:4)) RateCat1 RateCat2 <- getStateMat4Dat(data)$rate.mat # R2 RateCat2 <- dropStateMatPars(RateCat2, 3) RateCat2 ``` With respect to transitions *among* the different rate classes, we have implemented a separate matrix generator, `getRateCatMat()`. By default, this function will assume that all transitions among the specified number of rate classes occur independently. In our example, we will generate a matrix that specifies how transitions between R1 and R2 occur. Note that R1 and R2 could represent a biologically-relevant, but unmeasured factor, such as, say, temperate or tropical environments, island or mainland, presence or absence of a third trait. Basically, it is everything and anything that can influence the evolution of your observed characters. For illustrative purposes, we will specify that the transition rate from R1 to R2 is the same as the rate from R2 to R1: ```{r} RateClassMat <- getRateCatMat(2) # RateClassMat <- equateStateMatPars(RateClassMat, c(1,2)) RateClassMat ``` We now group all of our rate classes together in a list. The first element of the list corresponds to R1, the second to R2, etc. ```{r} StateMats <- list(RateCat1, RateCat2) StateMats ``` We now have all the components necessary to create the full model using the `getFullMat()` function. This function requires that the first input be a list of the within rate class matrices and the second argument be the among rate class matrices: ```{r} FullMat <- getFullMat(StateMats, RateClassMat) FullMat ``` Even though we created this larger index matrix from individuals components, we may not be sure it's exactly what we want. We can use `plotMKmodel()` to take a look at the model setup *before* running the analysis. Here's an example function call: ```{r, eval=FALSE} plotMKmodel(FullMat, rate.cat = 2, display = "row", text.scale = 0.7) ``` Since this is the model we intended on making, we can run `corHMM()` with our custom matrix: ```{r, eval=-1} HMM_3state_custom <- corHMM(phy = phy, data = data, rate.cat = 2, rate.mat = FullMat, node.states = "none") HMM_3state_custom ``` We may plot the resulting parameter estimates as before: ```{r, fig.width=10, fig.height=3} plotMKmodel(HMM_3state_custom, display = "row", text.scale = 0.7) ``` ## 2.2: Biological examples ### 2.2.1: Ordered habitat change A lot of these new capabilities in `corHMM` were inspired by our current project examining the ancestral habitat during primary endosymbiosis. In our model, we have three possible habitats: marine, freshwater, and terrestrial. Our very large phylogeny of green plants contains many species with a diverse range of life histories. For example, cyanobacteria can move freely between all of these states, whereas some species may move between terrestrial and marine through freshwater. Some species may even move freely between aquatic states, but once they become terrestrial they are stuck there. In this section we will demonstrate how to create a custom hidden Markov model which satisfies all of these requirements. To do this we will use a simulated a dataset that contains these 3 states: ```{r} phy <- read.tree("randomBD.tree") load("simulatedData.Rsave") head(MFT_dat) summary(as.factor(MFT_dat[,2])) # how many of each state do we have? ``` As before, start off by getting a legend and rate matrix from this dataset: ```{r} MFT_LegendAndRate <- getStateMat4Dat(MFT_dat) MFT_LegendAndRate ``` Here, freshwater habitat will be State 1, marine habitat will be State 2,and terrestrial habitat will be State 3. Now, we need to create 3 different rate classes that are consistent with our hypotheses of how habitat changes occurs. We'll say that rate class R1 is one in which lineages cannot leave a terrestrial habitat, rate class R2 will allow linneages to transition between marine and terrestrial *only* through freshwater, and rate class R3 will be unrestricted movement between the habitats. For R1 we need terrestrial to be an absorbing state, meaning once terrestriality evolves it is not lost. Since 1 = freshwater, 2 = marine, and 3 = terrestrial, that means removing from (3) to (1) and from (3) to (2). ```{r} MFT_R1 <- dropStateMatPars(MFT_LegendAndRate$rate.mat, c(2,4)) MFT_R1 ``` For R2, we need to disallow transitions between terrestrial and marine. We disallow the positions (1,3) and (3,1) in the rate matrix. In this case, any lineage can move into freshwater and move out of freshwater, but they are not allowed to transition directly between terrestrial and marine habitats: ```{r} MFT_R2 <- dropStateMatPars(MFT_LegendAndRate$rate.mat, c(4,6)) MFT_R2 ``` For R3, we allow all possible transitions to occur, which is the default matrix provided by getStateMat4Dat: ```{r} MFT_R3 <- MFT_LegendAndRate$rate.mat MFT_R3 ``` Let's put all these matrices in a list, ```{r} MFT_ObsStateClasses <- list(MFT_R1, MFT_R2, MFT_R3) ``` Since we only have 100 species in this example, let's constrain our parameters a bit further and state that transitions between rate classes occur at the same rate: ```{r} MFT_RateClassMat <- getRateCatMat(3) # we have 3 rate classes MFT_RateClassMat <- equateStateMatPars(MFT_RateClassMat, 1:6) ``` Next, we put it all together into a corHMM compatible rate.mat: ```{r} MFT_FullMat <- getFullMat(MFT_ObsStateClasses, MFT_RateClassMat) MFT_FullMat ``` That's kind of difficult to interpret, so be sure to plot it out using `plotMKmodel()` ```{r, eval=FALSE} plotMKmodel(MFT_FullMat, rate.cat = 3, display = "square", text.scale = 0.9) ``` To run this model, we would only need to specify 1) the data, 2) the phylogeny, 3) this matrix, and 4) that this matrix has 3 rate categories: ```{r, eval=-1} MFT_res.corHMM <- corHMM(phy = phy, data = MFT_dat, rate.cat = 3, rate.mat = MFT_FullMat, node.states = "none", root.p = "maddfitz") MFT_res.corHMM ``` ### 2.2.2: The precursor model The precursor model of Marazzi et al. (2012) marks the beginning of HMMs being used in a phylogenetic comparative context. Marazzi et al. (2012) were interested in locating putative evolutionary precursors of plant extrafloral nectaries (EFNs). Specifically, there were 2 states, presence (1) or absence (0) of EFNs, but that only species with an unobserved, hidden "precursor" trait could gain EFNs. Here we show how you could design the canonical precursor model in `corHMM` using custom rate matrices. We will start by loading a simulated dataset of presence and absence of extrafloral nectaries a randomly generated birth-death tree: ```{r} head(Precur_Dat) ``` Next, generate an observed states only matrix using the input single binary trait data set: ```{r} Precur_LegendAndMat <- getStateMat4Dat(Precur_Dat) Precur_LegendAndMat ``` Based on the legend, the absence of EFNs will be State 1 and the presence of EFNs will be State 2. For a precursor model the transitions between the two observed states, 1 and 2, are modulated by a third, hidden trait, which we will call a precursor. The precursor is represented by being in State 1 (lacking EFNs), but being in the "precursor rate class" (R2 in this case). In other words, if we observe that a species lacks EFN's, we do not know if they also have the precursor (i.e., 1,R2) or not (i.e., 1,R1). We do know, however, that under a precursor model that if we observe EFN, they must always also have the precursor trait, and so the presence of EFNs is always (2,R2). So, we will use rate class R2 as a direct measurement of transitioning between presence and absence of EFNs. The first rate class, R1, will represent character changes in the absence of the precursor, which is not possible without first gaining the "precursor". So, we will generate the default matrix, then drop all possible transitions from this matrix: ```{r} Precur_R1 <- Precur_LegendAndMat$rate.mat Precur_R1 <- dropStateMatPars(Precur_R1, c(1,2)) Precur_R1 ``` The second rate class, R2, will represent how our character changes in the presence of the precursor. In this rate class, we expect that species can either gain or lose EFNs at the same rate: ```{r} Precur_R2 <- Precur_LegendAndMat$rate.mat Precur_R2 <- equateStateMatPars(Precur_R2, c(1,2)) Precur_R2 ``` Finally, we set up a matrix for that governs the transitions *among* the rate classes: ```{r} RateClassMat <- getRateCatMat(2) # RateClassMat <- equateStateMatPars(RateClassMat, c(1,2)) RateClassMat ``` Putting them rate classes together we almost get the right model, but we need to remove one extra transition rate between that connects rate class R1 and R2 in the presence of EFNs, because, again, the precursor model assumes that EFNs can *only* be gained in rate class 1. ```{r} Precur_FullMat <- getFullMat(list(Precur_R1, Precur_R2), RateClassMat) Precur_FullMat[c(4,2), c(2,4)] <- 0 Precur_FullMat ``` We now run `corHMM()` making sure to specify that we have 2 rate categories (or rate classes or hidden states - it's all the same). ```{r} Precur_res.corHMM <- corHMM(phy = phy, data = Precur_Dat, rate.cat = 2, rate.mat = Precur_FullMat, root.p = "maddfitz") Precur_res.corHMM ``` ### 2.2.3: Ontological relationship of multiple characters Lets say we had a dataset with multiple characters: 1) presence or absence of limbs, 2) presence or absence of fingers, 3) corporeal or incorporeal form. It could look something like this: ```{r} data(primates) phy <- primates[[1]] phy <- multi2di(phy) data <- primates[[2]] Limbs <- c("Limbs", "noLimbs")[data[,2]+1] Fings <- vector("numeric", length(phy$tip.label)) Fings[which(Limbs == "Limbs")] <- round(runif(length(which(Limbs == "Limbs")), 0, 1)) Corpo <- rep("corporeal", length(phy$tip.label)) Ont_Dat <- data.frame(sp = phy$tip.label, limbs = Limbs, fings = Fings, corp = Corpo) head(Ont_Dat) ``` Previously, the user would have had to convert this dataset into a format that could be read by the `rayDISC()` function. This task previously involved taking all possible unique combinations and creating a multistate character, but this version of `corHMM()` will internally do this for you: ```{r} Ont_LegendAndMat <- getStateMat4Dat(Ont_Dat) Ont_LegendAndMat ``` Even though there were 3 binary characters (meaning 8 possible states), only 3 combinations were actually observed. This is because all of the species were corporeal and thus the incorporeal form didn't factor into the matrix structure. The next thing to notice is that one of the potential states (No Limbs, Yes Fingers) is not present in the dataset and thus not included in the model. In addition, the transition from 3 (No Limbs, No Fingers) to 2 (Yes Limbs, Yes Fingers) is not allowed because it is impossible to have fingers without having limbs. Finally, all dual transitions have been removed. ```{r, eval=-1} Ont_res.corHMM <- corHMM(phy = phy, data = Ont_Dat, rate.cat = 1, rate.mat = Ont_LegendAndMat$rate.mat, node.states = "none") ``` Note that hidden states can be added to this model by following the examples above. # Section 3: Estimating models when node states are fixed ## 3.1: Fixing a single node We also added the ability to fix any or all nodes in the input phylogeny while estimating a model. This new feature was inspired by a request from Scott Edwards, who was interested in whether the range of rates of flight gain and loss will result in the highest probability of a volant ancestor to flightless lineages. He ran a series of ancestral state reconstructions under a range of rates of gain and loss of flight. These ancestral state reconstructions focused on a single ancestor at a time (since each ancestor will have a slightly different set of parameters) and recorded the probability of a volant ancestor. These analyses are included in the supplemental of Sackton et al. (2019). In this updated version of `corHMM`, a user can fix anywhere from a single node in a tree to an entire reconstruction from, say parsimony, and estimate the transitions rate. One can even obtain the likelihood of a reconstruction based on a fixed set of rates. To demonstrate, let's start by running a simple analysis of a binary character, but fixing the state of a single node in the the primate tree. Specifically, we are going to fix the most recent common ancestor (MRCA) of Gorilla gorilla and Homo sapiens as exhibiting estrus advertisement (i.e., State 1). The first step is to determine the the indices for each state: ```{r, eval=TRUE} data(primates) phy <- primates[[1]] phy <- multi2di(phy) data <- primates[[2]] getStateMat4Dat(data[,c(1,2)]) ``` Here, index 2 represents the presence of estrus advertisement in the model. The next step is to create a vector of node states. We start by generating a string of `NA` of length equal to the number of nodes plus the number of tips in the tree. The `NA` simply tells `corHMM()` to ignore as these are nodes that are not fixed. We then have to determine which node is the MRCA of *Gorilla gorilla* and *Homo sapiens*: ```{r, eval=TRUE} label.vector <- rep(NA, Ntip(phy) + Nnode(phy)) homo_gorilla <- getMRCA(phy,tip=c("Homo_sapiens", "Gorilla_gorilla")) homo_gorilla ``` The node number should be 64. To set the state of the node, simply replace the `NA` with the state of interest in the `label.vector`, in this case with a `2`, in the 64th element. To set the node labels, we will clip off the first 60 elements, as these represent the states of the tips, which we set differently: ```{r, eval=TRUE} label.vector[homo_gorilla] <- 2 phy$node.label <- label.vector[-c(1:Ntip(phy))] ``` Plotting the tree allows users to visually check whether the right node was fixed: ```{r, fig.align="center", fig.width=6, fig.height=6} plot(phy, cex=.5) nodelabels(phy$node.label) ``` From here simply input the `tree` object in `corHMM()` as normal, but the option fix.nodes needs to be set to `TRUE`: ```{r, echo=FALSE} fix.node64.estrus <- corHMM(phy, data[,c(1,2)], model="ER", rate.cat=1, fixed.nodes=TRUE) ``` We can then compare the fit of this model with another model where the same node is fixed to lacking estrus advertisement: ```{r, eval=TRUE} label.vector[homo_gorilla] <- 1 phy$node.label <- label.vector[-c(1:Ntip(phy))] fix.node64.noestrus <- corHMM(phy, data[,c(1,2)], model="ER", rate.cat=1, fixed.nodes=TRUE) fix.node64.noestrus ``` This comparison shows that the model where the MRCA of Gorilla gorilla and Homo sapiens is assumed to have exhibited estrus advertisement requires higher rates, and produces a substantially worse likelihood, than the model that assumes the MRCA lacked estrus advertisement. ## 3.2: Estimating rates under a parsimony reconstruction It is also possible to estimate transition rates where all nodes are fixed in the tree. For example, what if one wanted to examine the fit of a model where all nodes are fixed to according to a maximum parsimony reconstruction. Here we will use `phangorn` for these purposes. However, with `phangorn` there is the burden of dealing with their unique `phyDat` format. To deal with this, we implemented a function, `ConvertPhangornReconstructions()` that will convert the `phyDat` formatted output into something we can modify and input into `corHMM()`. Specifically, we can take the `mpr.recon` object from `phangorn` and convert the output as a vector and add them as node states in the phylogeny: ```{r, eval=TRUE, message=FALSE} library(phangorn) data.sort <- data.frame(data[,2], row.names=data[,1]) data.sort <- data.sort[phy$tip.label,] dat<-as.matrix(data.sort) rownames(dat) <- phy$tip.label dat<-phyDat(dat,type="USER", levels=c("0","1")) mpr.recon <- ancestral.pars(phy, dat, type = c("MPR")) mpr.recon.converted <- ConvertPhangornReconstructions(mpr.recon) phy$node.label <- mpr.recon.converted[(Ntip(phy)+1):length(mpr.recon.converted)] ``` Plotting the tree shows the parsimony reconstruction: ```{r, fig.align="center", fig.width=6, fig.height=6} plot(phy, cex=.5) nodelabels(phy$node.label) ``` Next, input the tree into `corHMM()` and obtain a rate estimate for this reconstruction: ```{r, eval=TRUE} fixed.parsimony.recon <- corHMM(phy, data[,c(1,2)], model="ER", rate.cat=1, fixed.nodes=TRUE) fixed.parsimony.recon ``` Interestingly, the parsimony reconstruction suggests a lot more change than if we estimated the states from the model itself. ## 3.3: Fixing nodes when the model contains hidden states Finally, if the model contains hidden states, the user needs to fix the state of the node based on the observed state *only*. Remember, since we cannot actually observe hidden states, we must treat the state of the node as ambiguous across all possible rate classes like we would a tip. Let's run a quick example where we fix the MRCA of *Gorilla gorilla* and *Homo sapiens* as lacking estrus advertisement: ```{r, eval=TRUE} label.vector <- rep(NA, Ntip(phy) + Nnode(phy)) homo_gorilla <- getMRCA(phy,tip=c("Homo_sapiens", "Gorilla_gorilla")) label.vector[homo_gorilla] <- 1 phy$node.label <- label.vector[-c(1:Ntip(phy))] fix.node64.noestrus <- corHMM(phy, data[,c(1,2)], model="ARD", rate.cat=2, fixed.nodes=TRUE) ``` Now, if we print out the line corresponding to our fixed node, ```{r, eval=TRUE} fix.node64.noestrus$states[homo_gorilla-Ntip(phy),] ``` there should be some uncertainty as to whether the absence of estrus advertisement is in R1 or R2. The total probability, however, of the node being in observed state 1 should sum to 1: ```{r, eval=TRUE} sum(fix.node64.noestrus$states[homo_gorilla-Ntip(phy),]) ``` ## References Beaulieu J.M., B.C. O'Meara, and M.J. Donoghue. 2013. Identifying hidden rate changes in the evolution of a binary morphologicalcharacter: the evolution of plant habit in campanulid angiosperms. Systematic Biology 62:725-737. Marazzi B., Ane C., Simon M.F., Delgado-Salinas A., Luckow M., Sanderson M.J. 2012. Locating Evolutionary Precursors on a Phylogenetic Tree. Evolution. 66:3918-3930. Pagel, M. 1994. Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters. Proceedings of the Royal Society, B. 255:37-45. Pagel, M., and A. Meade. 2006. Bayesian analysis of correlated evolution of discrete characters by reversible-jump Markov chain Monte Carlo. American Naturalist 167:808:825. Roman-Palacios C., Scholl J.P., Wiens J.J. 2019. Evolution of diet across the animal tree of life. Evolution Letters. 3:339-347. Sackton, T.B., P. Grayson, A. Cloutier, Z. Hu, J.S. Liu, N.E. Wheeler, P.P. Gardner, J.A. Clarke, A.J. Baker, M. Clamp, and S.V. Edwards. 2019. Convergent regulatory evolution and loss of flight in paleognathous birds. Science 364:74-78. Zucchini W., MacDonald I.L., Langrock R. 2017. Hidden Markov models for time series: an introduction using R. Chapman and Hall/CRC.