--- title: "queeems: Quantify the Extent of Evolutionary Evidence in Molecular Sequences" author: - name: Hassan Sadiq affiliation: - Department of Statistics and Actuarial Science, Stellenbosch University, South Africa email: hassan.t.sadiq@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show package: queeems bibliography: queeems.bib vignette: | %\VignetteIndexEntry{queeems Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction `queeems` is a R [@cran2025] package hosted on the Bioconductor [@bioc2004] platform. The package is useful for assessing the quality of genetic data that are meant for evolutionary investigations. Some invaluable methods and resources are available in the literature for similar purpose. These include: DAMBE [@xia2002], SatuTe [@manuel2025], TreSpEx [@struck2014], PAUP [@swoff2002], RASA [@lyons1996] and MUST [@herve1993]. Interested reader may consult @fleming2023 for a short review of some of the works. These existing techniques are designed primarily for investigations related to phylogeny reconstruction and many of them rely on adhoc frequentist-based hypotheses tests. `queeems` contains Bayesian functions that can be used to assess the quality of genetic data that are intended for use in phylogeny reconstruction or in broader evolutionary inquiries, including natural selection investigations. In this practical application of some of the primary functions in the package, I present analyses of simulated sequences to illustrate how to get the most out of this resource. # Installation Use the following code to install `queeems` from the Bioconductor platform. ```{r install, eval=FALSE} if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("queeems") ``` The developmental version of the package that contains the most recent updates, is available under the `devel` tab of the Bioconductor platform and can be installed by running the following code, just before the final install line above: `BiocManager::install(version="devel")`. # Data The molecular sequences analysed herein are generated with the `scoup` [@sadiq2026] package that allows genetic sequences to be simulated following a simplified Ornstein-Uhlenbeck (OU) process. A R file that contains the simulation code and a folder that contains the simulated data are available in the `inst/extdata/example/` folder within the `queeems` package. The folder also contains all the external data files relevant to any of the discussions presented here. In general, genetic sequence alignments were simulated on a balanced bifurcating evolutionary tree that has 64 leaves. For each data simulated, the length of the internal and the terminal branches were all kept equal to the same value. Some of the other necessary `scoup` inputs that are common for all the simulated data analysed here, are presented in the table below. *** | | | |:----------------------------------------------|---:| |*Number of terminal tree branches* | 64| |*Number of codon sites* | 100| |*Effective population size* |1000| |*Variance of synonymous selection coefficients*|0.01| |*OU reversion rate* |0.00| |*OU asymptotic variance* |0.00| |*Number of simulated data replicate per query* | 5| Table: Inputs of the `scoup` genetic sequence simulator that were kept unchanged for all the executions that generated the data sets analysed with `queeems` herein. Any other input of the simulator not listed in the table were left at their default values. # Example application of `queeems` There are two primary uses of `queeems`: (a.) to generate molecular entropy estimates and (b.) to make inference about the saturation status of genetic data. I demonstrate both cases by considering two sources of molecular substitution saturation: (a.) time since divergence from common ancestor as measured by tree lengths [@xia2002] and (b.) speed of molecular substitution as quantified by the magnitude of positive natural selection pressure [@manuel2025]. ## Tree length analyses In the code chunk below, I used the `queeems::seqSaturation` function to obtain site-wise and overall Bayes factors so that I may assess the saturation level of the corresponding simulated sequences. I also used the `queeems::molentropy` function to obtain @shannon1948 information entropy indices with respect to nucleotide, codon and amino acid bases. In addition to the tabulated `scoup` inputs, with respect to all the data sets generated for this section, the variance of non-synonymous selection coefficients ($\sigma_{`n`}^{2}$) was set equal to `0.25`. `queeems` is designed with optimal user flexibility in mind. Therefore, there is need to initiate the packages' operations with certain parameter values. The parameter settings that produced the results for this section are listed in the table below. *** | | | |------------------------------------|----:| |*Inference approach* | "A"| |*Discretised weight size* | 6| |*Proportion of weights to be $<$ 1* | 0.40| |*Weight upper-bound* | 7.00| |*Bayes factor threshold* | 3.00| |*Tolerance for saturated sites* | 0.01| Table: Parameter settings used to initiate the `queeems` analyses of saturation influenced by tree lengths. Readers should consult the function manual available with the package, for extensive discussions about the implications of the listed inputs. ```{r TL queeems, eval=TRUE} # ><>< # Load package into current R session library("queeems") # ><>< # Create useful summary function infoests <- function(xdata){ serr <- sd(xdata) / sqrt( length(xdata)) avg <- mean( xdata ) low <- avg - serr upp <- avg + serr outs <- c(low=low, avg=avg, upp=upp) return( outs ) } # ><>< # Retrieve path to simulated tutorial data folder datadir <- system.file("extdata/example/stemL/", package="queeems") # ><>< # Indicate number of simulated data replicates nreps <- 5 # ><>< # Set preferred Bayes factor threshold bflimit <- 3 # ><>< # Set preferred saturated sites tolerance ssprop <- 0.01 # Branch length blent <- seq(0.03, 0.15, 0.03) btags <- sprintf("%02.0f", blent*100) # ><>< # Create data frames to save interesting outputs cnames <- paste("bl", btags, sep="") rnames <- c("low","avg","upp") vnames <- paste("rep.", seq(1,nreps), sep="") homoBF <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames)) sbayes <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames)) smrybx <- matrix(NA, 3, length(btags), dimnames=list(rnames,cnames)) cdnTmp <- aasTmp <- vector("numeric", 100*nreps) nucTmp <- vector("numeric", 300*nreps) nucent <- codent <- aasent <- smrybx for(a1 in seq(1,length(btags))){ # ><>< # Read merged molecular sequences data as text seqsname <- file.path(datadir, paste("seqs",btags[a1],".txt", sep="")) seqstext <- readLines(seqsname) # ><>< # Extract information about number of simulated replicates sbreaks <- which(seqstext == " 64 300") # ><>< # Analyse data replicates captured from merged data file for(a0 in seq(1,nreps)){ nucentID <- seq(1+(a0-1)*300, a0*300) entropyID <- seq(1+(a0-1)*100, a0*100) # ><>< # Extract and temporarily save a sequence replicate init <- sbreaks[a0] + 1 halt <- sbreaks[a0] + 128 seqrep <- seqstext[seq(init,halt)] tempfile <- file.path(tempdir(), "iseqs.fasta") write.table(seqrep, file=tempfile, append=FALSE, quote=FALSE, row.names=FALSE, col.names=FALSE) # ><>< # Execute saturation analysis satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop) # ><>< # Extract Bayes factors homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",]) sbayes[a0,a1] <- sum(BFs(satest) > bflimit) # ><>< # Execute codon entropy analysis tempentropy <- molentropy(tempfile, "codon", "shannon") cdnTmp[entropyID] <- sitentropies(tempentropy) # ><>< # Process nucleotide entropy analysis tempentropy <- molentropy(tempfile, "dna", "shannon") nucTmp[nucentID] <- sitentropies(tempentropy) # ><>< # Transform nucleotide to amino acid sequences dnaformat <- Biostrings::readDNAStringSet(tempfile) aaformat <- Biostrings::translate(dnaformat) Biostrings::writeXStringSet(aaformat, tempfile) # ><>< # Implement amino acid entropy analysis tempentropy <- molentropy(tempfile, "aa", "shannon") aasTmp[entropyID] <- sitentropies(tempentropy) } # ><>< # Summarise and save entropy values nucent[rnames,a1] <- infoests( nucTmp )[rnames] codent[rnames,a1] <- infoests( cdnTmp )[rnames] aasent[rnames,a1] <- infoests( aasTmp )[rnames] # ><>< # Provide analysis progress update message("Completed analyses of stem length 0.", btags[a1]) } # ><>< # Site-wise Bayes factors peak print( head( BFs(satest))) # ><>< # Saturated site counts print(sbayes) # ><>< # Overall Bayes factor estimates print(homoBF) ``` Next, I present visual summaries of `queeems` analyses output. All the Bayes factors are based on the alternative hypothesis that the corresponding sequences are saturated. Therefore, it is expected that the Bayes factor estimates will increase as the tree length increases. Tree length is calculated as the distance from the origin of the rooted balanced bifurcating phylogeny to the tip of the terminal branches. The increasing pattern in Figure A, consequently demonstrates that the outputs from `queeems` are reliable. ```{r treeLength BFs, eval=TRUE} # ><>< # Function that makes plotting easier tfunc <- function(i, pty, clr, cx, mat, agl=90){ arrows(i, mat["low",i], i, mat["upp",i], .1, agl, 3, colors()[clr], lwd=3) points(i,mat["avg",i],pch=pty,lwd=2,col=colors()[clr],cex=cx,bg="white") return(0) } # ><>< # Summarise overall Bayes factor values bfsumm <- apply(homoBF,2,infoests) par(mar=c(4.2,4.2,0.2,0.2)) plot(0, 0, col="white", bty="n", xlab="", ylab="", xlim=c(1.0,5.0), ylim=c(0.0,7.0), xaxt="n", yaxt="n") phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0) axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.2, cex.axis=1.5) axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.2) text(1.3, 6.5, "A", cex=2.5, font=2) mtext("Bayes factor", 2, 2.2, cex=2) mtext("Tree Length", 1, 2.4, cex=2) box(lwd=2) ``` The inference from Figure A is interrogated further by plotting the Shannon entropy estimates, presented in Figure A.1 and the site-wise Bayes factors, presented in Figure A.2. To ensure accurate comparisons, the entropy estimates are scaled with the maximum possible values. The maximum Shannon entropy for a sequence with `n` possible bases is $-\log(1/n)$. For nucleotide, sense codon and amino acid bases, `n = 4`, `n = 61` and `n = 20`, respectively. The points in the Figure A.1 are the average and the arrow widths are proportional to the standard errors. The plot justifies the choice of codon as a robust unit for quantifying saturation. Amino acid appear to be moving faster to saturation (that is, closer to one) for shorter lengths. As the lengths get longer, nucleotide bases appears to close the gap with amino acids and most probably surpass it. The entropy analyses presented here were obtained by applying the popular @shannon1948 method on the base frequencies. Other counting metrics are accessible in `queeems`. These include: the `cncsentropy` and the `codondifferindex` functions. The package is also designed in a way that permits users to opt for `Renyi` entropy calculation method, if preferred. Readers may consult the help pages of the respective functions for further details. The histogram in Figure A.2 depicts the distribution of the site-wise Bayes factors. For brevity, only the output for the fifth replicate for the 1.05 tree length analysis is shown. As expected for non-saturated sequences, the distribution is positively skewed with majority of the sites returning Bayes factors less than 3.0. This implies that the corresponding genetic sequence is most likely not saturated. ```{r entropy, eval=TRUE} # ><>< # Scale entropy estimates by the maximum value eScaledNuc <- nucent / (-log(1/4)) eScaledCodon <- codent / (-log(1/61)) eScaledAmino <- aasent / (-log(1/20)) # ><>< # Plot distributions of entropy estimates par(mar=c(4.2,4.2,0.2,0.2)) bases <- c("Nucleotide","Codon","Amino acid") plot(0, 0, col="white", bty="n", xlab="", ylab="", xlim=c(1.0,5.0), ylim=c(.19,0.57), xaxt="n", yaxt="n") phold <- vapply(seq(1,5), tfunc, 0, 17, 1.5, eScaledNuc, FUN.VALUE=0) phold <- vapply(seq(1,5), tfunc, 1, 60, 1.5, eScaledCodon, FUN.VALUE=0) phold <- vapply(seq(1,5), tfunc, 8, 74, 1.5, eScaledAmino, FUN.VALUE=0) axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5) text(rep(3.7,3), c(.27,.24,.21)-.002, bases, cex=1.5, pos=4) axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7) points(rep(3.6,3), c(.27,.24,.21), cex=2.5, lwd=2, pch=c(0,1,8), col=colors()[c(17,60,74)]) text(3.8, 0.30, "Base", cex=2, font=2) mtext("Scaled Entropy", 2, 2.4, cex=2) text(1.3, 0.54, "A.1", cex=2.5, font=2) mtext("Tree Length", 1, 2.2, cex=2) box(lwd=2) ``` ```{r site BFs, eval=TRUE} # ><>< # Site-wise Bayes factors from replicate 5 of length 1.05 par(mar=c(4.2,4.2,0.2,0.2)) plot(0, 0, col="white", bty="n", xlab="", ylab="", xlim=c(.6,4.4), ylim=c(.03,0.85), xaxt="n", yaxt="n") mtext(c("Site-Wise Bayes Factors","Density"), c(1,2), 2.4, cex=2) axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7) hist(BFs(satest), freq=FALSE, add=TRUE, xaxt="n", yaxt="n", main="", col=colors()[17], lwd=2) axis(1, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5) text(4.1, 0.81, "A.2", cex=2, font=2) box(lwd=2) ``` ## Natural selection analyses The parameters input for the analyses in this section are the same as those used for the tree length analyses. The tree length was however fixed as `0.7` for all simulations. Non-synonymous selection coefficient variance ($\sigma_{`n`}^{2}$) is the variable in this section and values considered are, (0.1, 0.2, 0.3, 0.4, 0.5). Figure B contains summaries of the corresponding overall sequence Bayes factors. The points represent the averages and the arrows are proportional to the standard errors estimated over the five data replicates analysed for each $\sigma_{`n`}^{2}$ value. The analytical estimates of the ratio of non-synonymous to synonymous substitution rates ($\mathrm{d}N/\mathrm{d}S$) were obtained along with the sequences from `scoup`, where they were calculated using expressions from @spielman2015. ```{r dnds test, eval=TRUE} # ><>< # Retrieve path to simulated tutorial data folder datadir <- system.file("extdata/example/varNS/", package="queeems") # Non-synonymous variance value nsvary <- seq(1, 5) / 10 ntags <- sprintf("%.0f", nsvary*10) # ><>< # Create data frames to save interesting outputs cnames <- paste("ns", ntags, sep="") rnames <- c("low","avg","upp") homoBF <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames)) sbayes <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames)) for(a1 in seq(1,length(ntags))){ # ><>< # Read merged molecular sequences data as text seqsname <- file.path(datadir, paste("gdata",ntags[a1],".txt",sep="")) seqstext <- readLines(seqsname) # ><>< # Extract information about number of simulated replicates sbreaks <- which(seqstext == " 64 300") # ><>< # Analyse data replicates captured from merged data file for(a0 in seq(1,nreps)){ # ><>< # Extract and temporarily save a sequence replicate init <- sbreaks[a0] + 1 halt <- sbreaks[a0] + 128 seqrep <- seqstext[seq(init,halt)] tempfile <- file.path(tempdir(), "iseqs.fasta") write.table(seqrep, file=tempfile, append=FALSE, quote=FALSE, row.names=FALSE, col.names=FALSE) # ><>< # Execute saturation analysis satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop) # ><>< # Extract Bayes factors homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",]) sbayes[a0,a1] <- sum(BFs(satest) > bflimit) } # ><>< # Provide analysis progress update message("Completed analyses for \U03c3\U00b2 = 0.", ntags[a1]) } # ><>< # Overall log(Bayes factor) estimates print( log(homoBF)) # ><>< # Summarise overall Bayes factor and dN/dS values bfsumm <- apply(log(homoBF),2,infoests) dndsPath <- file.path(datadir, "dnds.csv") dndses <- read.table(dndsPath, sep=";", header=TRUE) # ><>< # Facilitate super-imposition of dN/dS and BFs transplot <- function(x, xrng, newrng){ xmin <- xrng[1] ymin <- newrng[1] xdiff <- diff(xrng) ydiff <- diff(newrng) newx <- (((x - xmin) / xdiff) * ydiff) + ymin return(newx) } newaxs <- seq(4,9) oldaxs <- transplot(newaxs, c(5,9), c(-.4,2.3)) newDnDs <- transplot(dndses, c(5,9), c(-.4,2.3)) dndsumm <- apply(newDnDs, 2, infoests) # ><>< # Add summaries of the BF values to plot par(mar=c(4.4,4.2,0.2,3.0)) plot(0, 0, col="white", bty="n", xlab="", ylab="", xlim=c(1.0,5.0), ylim=c(-.4,2.3), xaxt="n", yaxt="n") phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0) axis(1, seq(1,5), nsvary, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5) axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7) mtext(expression(sigma[n]^2), 1, 3.4, cex=2) mtext("Log(Bayes factor)", 2, 2.5, cex=2) text(1.2, 2.2, "B", cex=2.5, font=2) # ><>< # Include summaries of the dN/dS values phold <- vapply(seq(1,5), tfunc, 23, 17, 2, dndsumm, 60, FUN.VALUE=0) axis(4, oldaxs, newaxs, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7) mtext("dN/dS", 4, 2.0, cex=2) # ><>< # Add legend to plot text(c(2,2), c(2.2,2.0), c("BF","dN/dS"), cex=1.8, pos=4) arrows(1.7, 2.21, 2.0, 2.21, .1, 90, 3, colors()[50], 1, 3) arrows(1.7, 2.0, 2.0, 2.0, .1, 60, 3, colors()[17], 1, 3) points(c(1.85,1.85), c(2.21,2.0), cex=2, bg="white", lwd=3, pch=c(15,23), col=colors()[c(50,17)]) box(lwd=2) ``` It is illustrated in Figure B that higher values of $\sigma_{`n`}^{2}$ translates to higher magnitude of positive natural selection pressure, as measured by $\mathrm{d}N/\mathrm{d}S$. Therefore, as expected, the magnitude of the corresponding Bayes factors mostly increased accordingly. This inference further supports the authenticity of the inferences generated by the `queeems` package. # Conclusion I demonstrated that `queeems` avails increased flexibility that enables researchers to investigate genetic sequence saturation with better insights. The choice of the `queeems` parameters applied for the applications presented here are arbitrary. I appreciate that separate studies may require different values depending on the tolerance afforded by the conditions of the investigations being undertaken. A study about the sensitivity of the parameters is nearing completion. # Citation A more appropriate citation will be provided for the package in due course. In the meantime, kindly cite the package as, Sadiq, H. 2026. "queeems: Quantify the Extent of Evolutionary Evidence in Molecular Sequences". Bioconductor R Package. # References
# Session info The output of `sessionInfo()` from the computer where this file is generated is provided below. ```{r sessionInfo, echo=FALSE} sessionInfo() ```