---
title: "queeems: Quantify the Extent of Evolutionary Evidence in Molecular
Sequences"
author:
  - name: Hassan Sadiq
    affiliation:
    - Department of Statistics and Actuarial Science,
        Stellenbosch University, South Africa
    email: hassan.t.sadiq@gmail.com
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
package: queeems
bibliography: queeems.bib
vignette: |
    %\VignetteIndexEntry{queeems Tutorial}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
    BiocStyle::markdown()
```

# Introduction
`queeems` is a R [@cran2025] package hosted on the Bioconductor [@bioc2004]
platform. The package is useful for assessing the quality of genetic data
that are meant for evolutionary investigations. Some invaluable methods
and resources are available in the literature for similar purpose. These
include: DAMBE [@xia2002], SatuTe [@manuel2025], TreSpEx [@struck2014],
PAUP [@swoff2002], RASA [@lyons1996] and MUST [@herve1993]. Interested
reader may consult @fleming2023 for a short review of some of the works.
These existing techniques are designed primarily for investigations related
to phylogeny reconstruction and many of them rely on adhoc frequentist-based
hypotheses tests. `queeems` contains Bayesian functions that can be used
to assess the quality of genetic data that are intended for use in
phylogeny reconstruction or in broader evolutionary inquiries, including
natural selection investigations. In this practical application of some
of the primary functions in the package, I present analyses of simulated
sequences to illustrate how to get the most out of this resource.

# Installation
Use the following code to install `queeems` from the Bioconductor platform.
```{r install, eval=FALSE}
if(!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("queeems")
```
The developmental version of the package that contains the most recent
updates, is available under the `devel` tab of the Bioconductor platform
and can be installed by running the following code, just before the final
install line above: `BiocManager::install(version="devel")`.

# Data
The molecular sequences analysed herein are generated with the `scoup`
[@sadiq2026] package that allows genetic sequences to be simulated following
a simplified Ornstein-Uhlenbeck (OU) process. A R file that contains the
simulation code and a folder that contains the simulated data are available
in the `inst/extdata/example/` folder within the `queeems` package. The
folder also contains all the external data files relevant to any of the
discussions presented here.

In general, genetic sequence alignments were simulated on a balanced
bifurcating evolutionary tree that has 64 leaves. For each data simulated,
the length of the internal and the terminal branches were all kept equal
to the same value. Some of the other necessary `scoup` inputs that are
common for all the simulated data analysed here, are presented in the
table below.

***
|                                               |    |
|:----------------------------------------------|---:|
|*Number of terminal tree branches*             |  64|
|*Number of codon sites*                        | 100|
|*Effective population size*                    |1000|
|*Variance of synonymous selection coefficients*|0.01|
|*OU reversion rate*                            |0.00|
|*OU asymptotic variance*                       |0.00|
|*Number of simulated data replicate per query* |   5|
Table: Inputs of the `scoup` genetic sequence simulator that were kept
unchanged for all the executions that generated the data sets analysed
with `queeems` herein. Any other input of the simulator not listed in
the table were left at their default values.

# Example application of `queeems`
There are two primary uses of `queeems`: (a.) to generate molecular entropy
estimates and (b.) to make inference about the saturation status of genetic
data. I demonstrate both cases by considering two sources of molecular
substitution saturation: (a.) time since divergence from common ancestor
as measured by tree lengths [@xia2002] and (b.) speed of molecular
substitution as quantified by the magnitude of positive natural selection
pressure [@manuel2025].

## Tree length analyses
In the code chunk below, I used the `queeems::seqSaturation` function to
obtain site-wise and overall Bayes factors so that I may assess the
saturation level of the corresponding simulated sequences. I also used the
`queeems::molentropy` function to obtain @shannon1948 information entropy
indices with respect to nucleotide, codon and amino acid bases. In addition
to the tabulated `scoup` inputs, with respect to all the data sets
generated for this section, the variance of non-synonymous selection
coefficients ($\sigma_{`n`}^{2}$) was set equal to `0.25`. `queeems` is
designed with optimal user flexibility in mind. Therefore, there is need
to initiate the packages' operations with certain parameter values. The
parameter settings that produced the results for this section are listed
in the table below.

***
|                                    |     |
|------------------------------------|----:|
|*Inference approach*                |  "A"|
|*Discretised weight size*           |    6|
|*Proportion of weights to be $<$ 1* | 0.40|
|*Weight upper-bound*                | 7.00|
|*Bayes factor threshold*            | 3.00|
|*Tolerance for saturated sites*     | 0.01|
Table: Parameter settings used to initiate the `queeems` analyses
of saturation influenced by tree lengths. Readers should consult
the function manual available with the package, for extensive
discussions about the implications of the listed inputs.

```{r TL queeems, eval=TRUE}
# ><>< # Load package into current R session
library("queeems")

# ><>< # Create useful summary function
infoests <- function(xdata){
    serr <- sd(xdata) / sqrt( length(xdata))
    avg <- mean( xdata )
    low <- avg  -  serr
    upp <- avg  +  serr
    outs <- c(low=low, avg=avg, upp=upp)
    return( outs )
}

# ><>< # Retrieve path to simulated tutorial data folder
datadir <- system.file("extdata/example/stemL/", package="queeems")

# ><>< # Indicate number of simulated data replicates
nreps <- 5

# ><>< # Set preferred Bayes factor threshold
bflimit <- 3

# ><>< # Set preferred saturated sites tolerance
ssprop <- 0.01

# Branch length
blent <- seq(0.03, 0.15, 0.03)
btags <- sprintf("%02.0f", blent*100)

# ><>< # Create data frames to save interesting outputs
cnames <- paste("bl", btags, sep="")
rnames <- c("low","avg","upp")
vnames <- paste("rep.", seq(1,nreps), sep="")
homoBF <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames))
sbayes <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames))
smrybx <- matrix(NA, 3, length(btags), dimnames=list(rnames,cnames))
cdnTmp <- aasTmp <- vector("numeric", 100*nreps)
nucTmp <- vector("numeric", 300*nreps)
nucent <- codent <- aasent <- smrybx

for(a1 in seq(1,length(btags))){
    # ><>< # Read merged molecular sequences data as text
    seqsname <- file.path(datadir, paste("seqs",btags[a1],".txt", sep=""))
    seqstext <- readLines(seqsname)

    # ><>< # Extract information about number of simulated replicates
    sbreaks <- which(seqstext == "  64  300")

    # ><>< # Analyse data replicates captured from merged data file
    for(a0 in seq(1,nreps)){
        nucentID <- seq(1+(a0-1)*300, a0*300)
        entropyID <- seq(1+(a0-1)*100, a0*100)

        # ><>< # Extract and temporarily save a sequence replicate
        init <- sbreaks[a0] + 1
        halt <- sbreaks[a0] + 128
        seqrep <- seqstext[seq(init,halt)]
        tempfile <- file.path(tempdir(), "iseqs.fasta")
        write.table(seqrep, file=tempfile, append=FALSE,
            quote=FALSE, row.names=FALSE, col.names=FALSE)

        # ><>< # Execute saturation analysis
        satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop)

        # ><>< # Extract Bayes factors
        homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",])
        sbayes[a0,a1] <- sum(BFs(satest) > bflimit)

        # ><>< # Execute codon entropy analysis
        tempentropy <- molentropy(tempfile, "codon", "shannon")
        cdnTmp[entropyID] <- sitentropies(tempentropy)

        # ><>< # Process nucleotide entropy analysis
        tempentropy <- molentropy(tempfile, "dna", "shannon")
        nucTmp[nucentID] <- sitentropies(tempentropy)

        # ><>< # Transform nucleotide to amino acid sequences
        dnaformat <- Biostrings::readDNAStringSet(tempfile)
        aaformat <- Biostrings::translate(dnaformat)
        Biostrings::writeXStringSet(aaformat, tempfile)

        # ><>< # Implement amino acid entropy analysis
        tempentropy <- molentropy(tempfile, "aa", "shannon")
        aasTmp[entropyID] <- sitentropies(tempentropy)
    }
    # ><>< # Summarise and save entropy values
    nucent[rnames,a1] <- infoests( nucTmp )[rnames]
    codent[rnames,a1] <- infoests( cdnTmp )[rnames]
    aasent[rnames,a1] <- infoests( aasTmp )[rnames]

    # ><>< # Provide analysis progress update
    message("Completed analyses of stem length 0.", btags[a1])
}
# ><>< # Site-wise Bayes factors peak
print( head( BFs(satest)))

# ><>< # Saturated site counts
print(sbayes)

# ><>< # Overall Bayes factor estimates
print(homoBF)
```

Next, I present visual summaries of `queeems` analyses output. All the
Bayes factors are based on the alternative hypothesis that the
corresponding sequences are saturated. Therefore, it is expected that
the Bayes factor estimates will increase as the tree length increases.
Tree length is calculated as the distance from the origin of the rooted
balanced bifurcating phylogeny to the tip of the terminal branches. The
increasing pattern in Figure A, consequently demonstrates that the outputs
from `queeems` are reliable.

```{r treeLength BFs, eval=TRUE}
# ><>< # Function that makes plotting easier
tfunc <- function(i, pty, clr, cx, mat, agl=90){
    arrows(i, mat["low",i], i, mat["upp",i], .1, agl, 3, colors()[clr], lwd=3)
    points(i,mat["avg",i],pch=pty,lwd=2,col=colors()[clr],cex=cx,bg="white")
    return(0)
}

# ><>< # Summarise overall Bayes factor values
bfsumm <- apply(homoBF,2,infoests)

par(mar=c(4.2,4.2,0.2,0.2))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
    xlim=c(1.0,5.0), ylim=c(0.0,7.0), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0)
axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.2, cex.axis=1.5)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.2)
text(1.3, 6.5, "A", cex=2.5, font=2)
mtext("Bayes factor", 2, 2.2, cex=2)
mtext("Tree Length", 1, 2.4, cex=2)
box(lwd=2)
```

The inference from Figure A is interrogated further by plotting the Shannon
entropy estimates, presented in Figure A.1 and the site-wise Bayes factors,
presented in Figure A.2. To ensure accurate comparisons, the entropy
estimates are scaled with the maximum possible values. The maximum Shannon
entropy for a sequence with `n` possible bases is $-\log(1/n)$. For
nucleotide, sense codon and amino acid bases, `n = 4`, `n = 61` and
`n = 20`, respectively.

The points in the Figure A.1 are the average and the arrow widths are
proportional to the standard errors. The plot justifies the choice of codon
as a robust unit for quantifying saturation. Amino acid appear to be moving
faster to saturation (that is, closer to one) for shorter lengths. As the
lengths get longer, nucleotide bases appears to close the gap with amino
acids and most probably surpass it. The entropy analyses presented here
were obtained by applying the popular @shannon1948 method on the base
frequencies. Other counting metrics are accessible in `queeems`. These
include: the `cncsentropy` and the `codondifferindex` functions. The
package is also designed in a way that permits users to opt for `Renyi`
entropy calculation method, if preferred.  Readers may consult the help
pages of the respective functions for further details.

The histogram in Figure A.2 depicts the distribution of the site-wise Bayes
factors. For brevity, only the output for the fifth replicate for the 1.05
tree length analysis is shown. As expected for non-saturated sequences, the
distribution is positively skewed with majority of the sites returning Bayes
factors less than 3.0. This implies that the corresponding genetic sequence
is most likely not saturated. 

```{r entropy, eval=TRUE}
# ><>< # Scale entropy estimates by the maximum value
eScaledNuc <- nucent / (-log(1/4))
eScaledCodon <- codent / (-log(1/61))
eScaledAmino <- aasent / (-log(1/20))

# ><>< # Plot distributions of entropy estimates
par(mar=c(4.2,4.2,0.2,0.2))
bases <- c("Nucleotide","Codon","Amino acid")
plot(0, 0, col="white", bty="n", xlab="", ylab="",
    xlim=c(1.0,5.0), ylim=c(.19,0.57), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 0, 17, 1.5, eScaledNuc, FUN.VALUE=0)
phold <- vapply(seq(1,5), tfunc, 1, 60, 1.5, eScaledCodon, FUN.VALUE=0)
phold <- vapply(seq(1,5), tfunc, 8, 74, 1.5, eScaledAmino, FUN.VALUE=0)
axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
text(rep(3.7,3), c(.27,.24,.21)-.002, bases, cex=1.5, pos=4)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
points(rep(3.6,3), c(.27,.24,.21), cex=2.5, lwd=2,
    pch=c(0,1,8), col=colors()[c(17,60,74)])
text(3.8, 0.30, "Base", cex=2, font=2)
mtext("Scaled Entropy", 2, 2.4, cex=2)
text(1.3, 0.54, "A.1", cex=2.5, font=2)
mtext("Tree Length", 1, 2.2, cex=2)
box(lwd=2)
```

```{r site BFs, eval=TRUE}
# ><>< # Site-wise Bayes factors from replicate 5 of length 1.05
par(mar=c(4.2,4.2,0.2,0.2))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
    xlim=c(.6,4.4), ylim=c(.03,0.85), xaxt="n", yaxt="n")
mtext(c("Site-Wise Bayes Factors","Density"), c(1,2), 2.4, cex=2)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
hist(BFs(satest), freq=FALSE, add=TRUE, xaxt="n",
    yaxt="n", main="", col=colors()[17], lwd=2)
axis(1, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
text(4.1, 0.81, "A.2", cex=2, font=2)
box(lwd=2)
```

## Natural selection analyses
The parameters input for the analyses in this section are the same as those
used for the tree length analyses. The tree length was however fixed as
`0.7` for all simulations. Non-synonymous selection coefficient variance
($\sigma_{`n`}^{2}$) is the variable in this section and values considered
are, (0.1, 0.2, 0.3, 0.4, 0.5). Figure B contains summaries of the
corresponding overall sequence Bayes factors. The points represent the
averages and the arrows are proportional to the standard errors estimated
over the five data replicates analysed for each $\sigma_{`n`}^{2}$ value.
The analytical estimates of the ratio of non-synonymous to synonymous
substitution rates ($\mathrm{d}N/\mathrm{d}S$) were obtained along with
the sequences from `scoup`, where they were calculated using expressions
from @spielman2015.

```{r dnds test, eval=TRUE}
# ><>< # Retrieve path to simulated tutorial data folder
datadir <- system.file("extdata/example/varNS/", package="queeems")

# Non-synonymous variance value
nsvary <- seq(1, 5) / 10
ntags <- sprintf("%.0f", nsvary*10)

# ><>< # Create data frames to save interesting outputs
cnames <- paste("ns", ntags, sep="")
rnames <- c("low","avg","upp")
homoBF <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames))
sbayes <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames))

for(a1 in seq(1,length(ntags))){
    # ><>< # Read merged molecular sequences data as text
    seqsname <- file.path(datadir, paste("gdata",ntags[a1],".txt",sep=""))
    seqstext <- readLines(seqsname)

    # ><>< # Extract information about number of simulated replicates
    sbreaks <- which(seqstext == "  64  300")

    # ><>< # Analyse data replicates captured from merged data file
    for(a0 in seq(1,nreps)){

        # ><>< # Extract and temporarily save a sequence replicate
        init <- sbreaks[a0] + 1
        halt <- sbreaks[a0] + 128
        seqrep <- seqstext[seq(init,halt)]
        tempfile <- file.path(tempdir(), "iseqs.fasta")
        write.table(seqrep, file=tempfile, append=FALSE,
            quote=FALSE, row.names=FALSE, col.names=FALSE)

        # ><>< # Execute saturation analysis
        satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop)

        # ><>< # Extract Bayes factors
        homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",])
        sbayes[a0,a1] <- sum(BFs(satest) > bflimit)
    }
    # ><>< # Provide analysis progress update
    message("Completed analyses for \U03c3\U00b2 = 0.", ntags[a1])
}
# ><>< # Overall log(Bayes factor) estimates
print( log(homoBF))

# ><>< # Summarise overall Bayes factor and dN/dS values
bfsumm <- apply(log(homoBF),2,infoests)
dndsPath <- file.path(datadir, "dnds.csv")
dndses <- read.table(dndsPath, sep=";", header=TRUE)

# ><>< # Facilitate super-imposition of dN/dS and BFs
transplot <- function(x, xrng, newrng){
    xmin <- xrng[1]
    ymin <- newrng[1]
    xdiff <- diff(xrng)
    ydiff <- diff(newrng)
    newx <- (((x - xmin) / xdiff) * ydiff) + ymin
    return(newx)
}
newaxs <- seq(4,9)
oldaxs <- transplot(newaxs, c(5,9), c(-.4,2.3))
newDnDs <- transplot(dndses, c(5,9), c(-.4,2.3))
dndsumm <- apply(newDnDs, 2, infoests)

# ><>< # Add summaries of the BF values to plot
par(mar=c(4.4,4.2,0.2,3.0))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
    xlim=c(1.0,5.0), ylim=c(-.4,2.3), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0)
axis(1, seq(1,5), nsvary, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
mtext(expression(sigma[n]^2), 1, 3.4, cex=2)
mtext("Log(Bayes factor)", 2, 2.5, cex=2)
text(1.2, 2.2, "B", cex=2.5, font=2)

# ><>< # Include summaries of the dN/dS values
phold <- vapply(seq(1,5), tfunc, 23, 17, 2, dndsumm, 60, FUN.VALUE=0)
axis(4, oldaxs, newaxs, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
mtext("dN/dS", 4, 2.0, cex=2)

# ><>< # Add legend to plot
text(c(2,2), c(2.2,2.0), c("BF","dN/dS"), cex=1.8, pos=4)
arrows(1.7, 2.21, 2.0, 2.21, .1, 90, 3, colors()[50], 1, 3)
arrows(1.7, 2.0, 2.0, 2.0, .1, 60, 3, colors()[17], 1, 3)
points(c(1.85,1.85), c(2.21,2.0), cex=2, bg="white",
    lwd=3, pch=c(15,23), col=colors()[c(50,17)])
box(lwd=2)
```

It is illustrated in Figure B that higher values of $\sigma_{`n`}^{2}$
translates to higher magnitude of positive natural selection pressure,
as measured by $\mathrm{d}N/\mathrm{d}S$. Therefore, as expected, the
magnitude of the corresponding Bayes factors mostly increased accordingly.
This inference further supports the authenticity of the inferences
generated by the `queeems` package.

# Conclusion
I demonstrated that `queeems` avails increased flexibility that enables
researchers to investigate genetic sequence saturation with better insights.
The choice of the `queeems` parameters applied for the applications
presented here are arbitrary. I appreciate that separate studies
may require different values depending on the tolerance afforded by the
conditions of the investigations being undertaken. A study about the
sensitivity of the parameters is nearing completion.

# Citation
A more appropriate citation will be provided for the package in due course.
In the meantime, kindly cite the package as, <em>Sadiq, H. 2026. "queeems:
Quantify the Extent of Evolutionary Evidence in Molecular Sequences".
Bioconductor R Package.</em>

# References

<div id="refs"></div>

# Session info

The output of `sessionInfo()` from the computer where this file is generated
is provided below.

```{r sessionInfo, echo=FALSE}
sessionInfo()
```