queeems is a R (???) package hosted on the Bioconductor (???)
platform. The package is useful for assessing the quality of genetic data
that are meant for evolutionary investigations. Some invaluable methods
and resources are available in the literature for similar purpose. These
include: DAMBE (???), SatuTe (???), TreSpEx (???),
PAUP (???), RASA (???) and MUST (???). Interested
reader may consult (???) for a short review of some of the works.
These existing techniques are designed primarily for investigations related
to phylogeny reconstruction and many of them rely on adhoc frequentist-based
hypotheses tests. queeems contains Bayesian functions that can be used
to assess the quality of genetic data that are intended for use in
phylogeny reconstruction or in broader evolutionary inquiries, including
natural selection investigations. In this practical application of some
of the primary functions in the package, I present analyses of simulated
sequences to illustrate how to get the most out of this resource.
Use the following code to install queeems from the Bioconductor platform.
if(!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("queeems")
The developmental version of the package that contains the most recent
updates, is available under the devel tab of the Bioconductor platform
and can be installed by running the following code, just before the final
install line above: BiocManager::install(version="devel").
The molecular sequences analysed herein are generated with the scoup
(???) package that allows genetic sequences to be simulated following
a simplified Ornstein-Uhlenbeck (OU) process. A R file that contains the
simulation code and a folder that contains the simulated data are available
in the inst/extdata/example/ folder within the queeems package. The
folder also contains all the external data files relevant to any of the
discussions presented here.
In general, genetic sequence alignments were simulated on a balanced
bifurcating evolutionary tree that has 64 leaves. For each data simulated,
the length of the internal and the terminal branches were all kept equal
to the same value. Some of the other necessary scoup inputs that are
common for all the simulated data analysed here, are presented in the
table below.
| Number of terminal tree branches | 64 |
| Number of codon sites | 100 |
| Effective population size | 1000 |
| Variance of synonymous selection coefficients | 0.01 |
| OU reversion rate | 0.00 |
| OU asymptotic variance | 0.00 |
| Number of simulated data replicate per query | 5 |
queeemsThere are two primary uses of queeems: (a.) to generate molecular entropy
estimates and (b.) to make inference about the saturation status of genetic
data. I demonstrate both cases by considering two sources of molecular
substitution saturation: (a.) time since divergence from common ancestor
as measured by tree lengths (???) and (b.) speed of molecular
substitution as quantified by the magnitude of positive natural selection
pressure (???).
In the code chunk below, I used the queeems::seqSaturation function to
obtain site-wise and overall Bayes factors so that I may assess the
saturation level of the corresponding simulated sequences. I also used the
queeems::molentropy function to obtain (???) information entropy
indices with respect to nucleotide, codon and amino acid bases. In addition
to the tabulated scoup inputs, with respect to all the data sets
generated for this section, the variance of non-synonymous selection
coefficients (\(\sigma_{`n`}^{2}\)) was set equal to 0.25. queeems is
designed with optimal user flexibility in mind. Therefore, there is need
to initiate the packages’ operations with certain parameter values. The
parameter settings that produced the results for this section are listed
in the table below.
| Inference approach | “A” |
| Discretised weight size | 6 |
| Proportion of weights to be \(<\) 1 | 0.40 |
| Weight upper-bound | 7.00 |
| Bayes factor threshold | 3.00 |
| Tolerance for saturated sites | 0.01 |
# ><>< # Load package into current R session
library("queeems")
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Seqinfo
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
# ><>< # Create useful summary function
infoests <- function(xdata){
serr <- sd(xdata) / sqrt( length(xdata))
avg <- mean( xdata )
low <- avg - serr
upp <- avg + serr
outs <- c(low=low, avg=avg, upp=upp)
return( outs )
}
# ><>< # Retrieve path to simulated tutorial data folder
datadir <- system.file("extdata/example/stemL/", package="queeems")
# ><>< # Indicate number of simulated data replicates
nreps <- 5
# ><>< # Set preferred Bayes factor threshold
bflimit <- 3
# ><>< # Set preferred saturated sites tolerance
ssprop <- 0.01
# Branch length
blent <- seq(0.03, 0.15, 0.03)
btags <- sprintf("%02.0f", blent*100)
# ><>< # Create data frames to save interesting outputs
cnames <- paste("bl", btags, sep="")
rnames <- c("low","avg","upp")
vnames <- paste("rep.", seq(1,nreps), sep="")
homoBF <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames))
sbayes <- matrix(NA, nreps, length(btags), dimnames=list(vnames,cnames))
smrybx <- matrix(NA, 3, length(btags), dimnames=list(rnames,cnames))
cdnTmp <- aasTmp <- vector("numeric", 100*nreps)
nucTmp <- vector("numeric", 300*nreps)
nucent <- codent <- aasent <- smrybx
for(a1 in seq(1,length(btags))){
# ><>< # Read merged molecular sequences data as text
seqsname <- file.path(datadir, paste("seqs",btags[a1],".txt", sep=""))
seqstext <- readLines(seqsname)
# ><>< # Extract information about number of simulated replicates
sbreaks <- which(seqstext == " 64 300")
# ><>< # Analyse data replicates captured from merged data file
for(a0 in seq(1,nreps)){
nucentID <- seq(1+(a0-1)*300, a0*300)
entropyID <- seq(1+(a0-1)*100, a0*100)
# ><>< # Extract and temporarily save a sequence replicate
init <- sbreaks[a0] + 1
halt <- sbreaks[a0] + 128
seqrep <- seqstext[seq(init,halt)]
tempfile <- file.path(tempdir(), "iseqs.fasta")
write.table(seqrep, file=tempfile, append=FALSE,
quote=FALSE, row.names=FALSE, col.names=FALSE)
# ><>< # Execute saturation analysis
satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop)
# ><>< # Extract Bayes factors
homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",])
sbayes[a0,a1] <- sum(BFs(satest) > bflimit)
# ><>< # Execute codon entropy analysis
tempentropy <- molentropy(tempfile, "codon", "shannon")
cdnTmp[entropyID] <- sitentropies(tempentropy)
# ><>< # Process nucleotide entropy analysis
tempentropy <- molentropy(tempfile, "dna", "shannon")
nucTmp[nucentID] <- sitentropies(tempentropy)
# ><>< # Transform nucleotide to amino acid sequences
dnaformat <- Biostrings::readDNAStringSet(tempfile)
aaformat <- Biostrings::translate(dnaformat)
Biostrings::writeXStringSet(aaformat, tempfile)
# ><>< # Implement amino acid entropy analysis
tempentropy <- molentropy(tempfile, "aa", "shannon")
aasTmp[entropyID] <- sitentropies(tempentropy)
}
# ><>< # Summarise and save entropy values
nucent[rnames,a1] <- infoests( nucTmp )[rnames]
codent[rnames,a1] <- infoests( cdnTmp )[rnames]
aasent[rnames,a1] <- infoests( aasTmp )[rnames]
# ><>< # Provide analysis progress update
message("Completed analyses of stem length 0.", btags[a1])
}
## Completed analyses of stem length 0.03
## Completed analyses of stem length 0.06
## Completed analyses of stem length 0.09
## Completed analyses of stem length 0.12
## Completed analyses of stem length 0.15
# ><>< # Site-wise Bayes factors peak
print( head( BFs(satest)))
## [1] 1.957368 2.039552 1.311460 1.100639 1.297898 1.487825
# ><>< # Saturated site counts
print(sbayes)
## bl03 bl06 bl09 bl12 bl15
## rep.1 0 1 2 2 3
## rep.2 1 0 1 1 1
## rep.3 0 1 0 3 4
## rep.4 0 1 2 1 2
## rep.5 1 2 1 0 2
# ><>< # Overall Bayes factor estimates
print(homoBF)
## bl03 bl06 bl09 bl12 bl15
## rep.1 0.3660 0.9109 1.4864 1.4864 3.7668
## rep.2 0.9109 0.3660 0.9109 0.9109 0.9109
## rep.3 0.3660 0.9109 0.3660 3.7668 13.5471
## rep.4 0.3660 0.9109 1.4864 0.9109 1.4864
## rep.5 0.9109 1.4864 0.9109 0.3660 1.4864
Next, I present visual summaries of queeems analyses output. All the
Bayes factors are based on the alternative hypothesis that the
corresponding sequences are saturated. Therefore, it is expected that
the Bayes factor estimates will increase as the tree length increases.
Tree length is calculated as the distance from the origin of the rooted
balanced bifurcating phylogeny to the tip of the terminal branches. The
increasing pattern in Figure A, consequently demonstrates that the outputs
from queeems are reliable.
# ><>< # Function that makes plotting easier
tfunc <- function(i, pty, clr, cx, mat, agl=90){
arrows(i, mat["low",i], i, mat["upp",i], .1, agl, 3, colors()[clr], lwd=3)
points(i,mat["avg",i],pch=pty,lwd=2,col=colors()[clr],cex=cx,bg="white")
return(0)
}
# ><>< # Summarise overall Bayes factor values
bfsumm <- apply(homoBF,2,infoests)
par(mar=c(4.2,4.2,0.2,0.2))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
xlim=c(1.0,5.0), ylim=c(0.0,7.0), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0)
axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.2, cex.axis=1.5)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.2)
text(1.3, 6.5, "A", cex=2.5, font=2)
mtext("Bayes factor", 2, 2.2, cex=2)
mtext("Tree Length", 1, 2.4, cex=2)
box(lwd=2)
The inference from Figure A is interrogated further by plotting the Shannon
entropy estimates, presented in Figure A.1 and the site-wise Bayes factors,
presented in Figure A.2. To ensure accurate comparisons, the entropy
estimates are scaled with the maximum possible values. The maximum Shannon
entropy for a sequence with n possible bases is \(-\log(1/n)\). For
nucleotide, sense codon and amino acid bases, n = 4, n = 61 and
n = 20, respectively.
The points in the Figure A.1 are the average and the arrow widths are
proportional to the standard errors. The plot justifies the choice of codon
as a robust unit for quantifying saturation. Amino acid appear to be moving
faster to saturation (that is, closer to one) for shorter lengths. As the
lengths get longer, nucleotide bases appears to close the gap with amino
acids and most probably surpass it. The entropy analyses presented here
were obtained by applying the popular (???) method on the base
frequencies. Other counting metrics are accessible in queeems. These
include: the cncsentropy and the codondifferindex functions. The
package is also designed in a way that permits users to opt for Renyi
entropy calculation method, if preferred. Readers may consult the help
pages of the respective functions for further details.
The histogram in Figure A.2 depicts the distribution of the site-wise Bayes factors. For brevity, only the output for the fifth replicate for the 1.05 tree length analysis is shown. As expected for non-saturated sequences, the distribution is positively skewed with majority of the sites returning Bayes factors less than 3.0. This implies that the corresponding genetic sequence is most likely not saturated.
# ><>< # Scale entropy estimates by the maximum value
eScaledNuc <- nucent / (-log(1/4))
eScaledCodon <- codent / (-log(1/61))
eScaledAmino <- aasent / (-log(1/20))
# ><>< # Plot distributions of entropy estimates
par(mar=c(4.2,4.2,0.2,0.2))
bases <- c("Nucleotide","Codon","Amino acid")
plot(0, 0, col="white", bty="n", xlab="", ylab="",
xlim=c(1.0,5.0), ylim=c(.19,0.57), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 0, 17, 1.5, eScaledNuc, FUN.VALUE=0)
phold <- vapply(seq(1,5), tfunc, 1, 60, 1.5, eScaledCodon, FUN.VALUE=0)
phold <- vapply(seq(1,5), tfunc, 8, 74, 1.5, eScaledAmino, FUN.VALUE=0)
axis(1, seq(1,5), seq(3,15,3)/100*7, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
text(rep(3.7,3), c(.27,.24,.21)-.002, bases, cex=1.5, pos=4)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
points(rep(3.6,3), c(.27,.24,.21), cex=2.5, lwd=2,
pch=c(0,1,8), col=colors()[c(17,60,74)])
text(3.8, 0.30, "Base", cex=2, font=2)
mtext("Scaled Entropy", 2, 2.4, cex=2)
text(1.3, 0.54, "A.1", cex=2.5, font=2)
mtext("Tree Length", 1, 2.2, cex=2)
box(lwd=2)
# ><>< # Site-wise Bayes factors from replicate 5 of length 1.05
par(mar=c(4.2,4.2,0.2,0.2))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
xlim=c(.6,4.4), ylim=c(.03,0.85), xaxt="n", yaxt="n")
mtext(c("Site-Wise Bayes Factors","Density"), c(1,2), 2.4, cex=2)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
hist(BFs(satest), freq=FALSE, add=TRUE, xaxt="n",
yaxt="n", main="", col=colors()[17], lwd=2)
axis(1, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
text(4.1, 0.81, "A.2", cex=2, font=2)
box(lwd=2)
The parameters input for the analyses in this section are the same as those
used for the tree length analyses. The tree length was however fixed as
0.7 for all simulations. Non-synonymous selection coefficient variance
(\(\sigma_{`n`}^{2}\)) is the variable in this section and values considered
are, (0.1, 0.2, 0.3, 0.4, 0.5). Figure B contains summaries of the
corresponding overall sequence Bayes factors. The points represent the
averages and the arrows are proportional to the standard errors estimated
over the five data replicates analysed for each \(\sigma_{`n`}^{2}\) value.
The analytical estimates of the ratio of non-synonymous to synonymous
substitution rates (\(\mathrm{d}N/\mathrm{d}S\)) were obtained along with
the sequences from scoup, where they were calculated using expressions
from (???).
# ><>< # Retrieve path to simulated tutorial data folder
datadir <- system.file("extdata/example/varNS/", package="queeems")
# Non-synonymous variance value
nsvary <- seq(1, 5) / 10
ntags <- sprintf("%.0f", nsvary*10)
# ><>< # Create data frames to save interesting outputs
cnames <- paste("ns", ntags, sep="")
rnames <- c("low","avg","upp")
homoBF <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames))
sbayes <- matrix(NA, nreps, length(ntags), dimnames=list(NULL,cnames))
for(a1 in seq(1,length(ntags))){
# ><>< # Read merged molecular sequences data as text
seqsname <- file.path(datadir, paste("gdata",ntags[a1],".txt",sep=""))
seqstext <- readLines(seqsname)
# ><>< # Extract information about number of simulated replicates
sbreaks <- which(seqstext == " 64 300")
# ><>< # Analyse data replicates captured from merged data file
for(a0 in seq(1,nreps)){
# ><>< # Extract and temporarily save a sequence replicate
init <- sbreaks[a0] + 1
halt <- sbreaks[a0] + 128
seqrep <- seqstext[seq(init,halt)]
tempfile <- file.path(tempdir(), "iseqs.fasta")
write.table(seqrep, file=tempfile, append=FALSE,
quote=FALSE, row.names=FALSE, col.names=FALSE)
# ><>< # Execute saturation analysis
satest <- seqSaturation(tempfile, "A", 6, 0.4, 7, bflimit, ssprop)
# ><>< # Extract Bayes factors
homoBF[a0,a1] <- as.numeric( summary(satest)["fullBF",])
sbayes[a0,a1] <- sum(BFs(satest) > bflimit)
}
# ><>< # Provide analysis progress update
message("Completed analyses for \U03c3\U00b2 = 0.", ntags[a1])
}
## Completed analyses for σ² = 0.1
## Completed analyses for σ² = 0.2
## Completed analyses for σ² = 0.3
## Completed analyses for σ² = 0.4
## Completed analyses for σ² = 0.5
# ><>< # Overall log(Bayes factor) estimates
print( log(homoBF))
## ns1 ns2 ns3 ns4 ns5
## [1,] -1.00512195 -0.09332216 -0.09332216 2.6061725 -0.09332216
## [2,] 1.32622583 0.39635709 -0.09332216 0.3963571 -0.09332216
## [3,] -0.09332216 4.14920673 2.60617250 1.3262258 4.14920673
## [4,] -0.09332216 -1.00512195 -1.00512195 0.3963571 2.60617250
## [5,] -0.09332216 -0.09332216 0.39635709 -1.0051219 0.39635709
# ><>< # Summarise overall Bayes factor and dN/dS values
bfsumm <- apply(log(homoBF),2,infoests)
dndsPath <- file.path(datadir, "dnds.csv")
dndses <- read.table(dndsPath, sep=";", header=TRUE)
# ><>< # Facilitate super-imposition of dN/dS and BFs
transplot <- function(x, xrng, newrng){
xmin <- xrng[1]
ymin <- newrng[1]
xdiff <- diff(xrng)
ydiff <- diff(newrng)
newx <- (((x - xmin) / xdiff) * ydiff) + ymin
return(newx)
}
newaxs <- seq(4,9)
oldaxs <- transplot(newaxs, c(5,9), c(-.4,2.3))
newDnDs <- transplot(dndses, c(5,9), c(-.4,2.3))
dndsumm <- apply(newDnDs, 2, infoests)
# ><>< # Add summaries of the BF values to plot
par(mar=c(4.4,4.2,0.2,3.0))
plot(0, 0, col="white", bty="n", xlab="", ylab="",
xlim=c(1.0,5.0), ylim=c(-.4,2.3), xaxt="n", yaxt="n")
phold <- vapply(seq(1,5), tfunc, 15, 50, 2.5, bfsumm, FUN.VALUE=0)
axis(1, seq(1,5), nsvary, tck=-.01, lwd=2, padj=-.5, cex.axis=1.5)
axis(2, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
mtext(expression(sigma[n]^2), 1, 3.4, cex=2)
mtext("Log(Bayes factor)", 2, 2.5, cex=2)
text(1.2, 2.2, "B", cex=2.5, font=2)
# ><>< # Include summaries of the dN/dS values
phold <- vapply(seq(1,5), tfunc, 23, 17, 2, dndsumm, 60, FUN.VALUE=0)
axis(4, oldaxs, newaxs, las=1, tck=-0.01, lwd=2, cex.axis=1.5, hadj=0.7)
mtext("dN/dS", 4, 2.0, cex=2)
# ><>< # Add legend to plot
text(c(2,2), c(2.2,2.0), c("BF","dN/dS"), cex=1.8, pos=4)
arrows(1.7, 2.21, 2.0, 2.21, .1, 90, 3, colors()[50], 1, 3)
arrows(1.7, 2.0, 2.0, 2.0, .1, 60, 3, colors()[17], 1, 3)
points(c(1.85,1.85), c(2.21,2.0), cex=2, bg="white",
lwd=3, pch=c(15,23), col=colors()[c(50,17)])
box(lwd=2)
It is illustrated in Figure B that higher values of \(\sigma_{`n`}^{2}\)
translates to higher magnitude of positive natural selection pressure,
as measured by \(\mathrm{d}N/\mathrm{d}S\). Therefore, as expected, the
magnitude of the corresponding Bayes factors mostly increased accordingly.
This inference further supports the authenticity of the inferences
generated by the queeems package.
I demonstrated that queeems avails increased flexibility that enables
researchers to investigate genetic sequence saturation with better insights.
The choice of the queeems parameters applied for the applications
presented here are arbitrary. I appreciate that separate studies
may require different values depending on the tolerance afforded by the
conditions of the investigations being undertaken. A study about the
sensitivity of the parameters is nearing completion.
A more appropriate citation will be provided for the package in due course. In the meantime, kindly cite the package as, Sadiq, H. 2026. “queeems: Quantify the Extent of Evolutionary Evidence in Molecular Sequences”. Bioconductor R Package.
The output of sessionInfo() from the computer where this file is generated
is provided below.
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] queeems_0.99.7 Biostrings_2.79.5 Seqinfo_1.1.0
## [4] XVector_0.51.0 IRanges_2.45.0 S4Vectors_0.49.3
## [7] BiocGenerics_0.57.1 generics_0.1.4 BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 cli_3.6.6 knitr_1.51
## [4] magick_2.9.1 rlang_1.2.0 xfun_0.57
## [7] otel_0.2.0 gtools_3.9.5 jsonlite_2.0.0
## [10] htmltools_0.5.9 tinytex_0.59 sass_0.4.10
## [13] rmarkdown_2.31 grid_4.6.0 evaluate_1.0.5
## [16] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.12
## [19] lifecycle_1.0.5 bookdown_0.46 BiocManager_1.30.27
## [22] compiler_4.6.0 Rcpp_1.1.1-1.1 lattice_0.22-9
## [25] digest_0.6.39 R6_2.6.1 magrittr_2.0.5
## [28] Matrix_1.7-5 bslib_0.10.0 tools_4.6.0
## [31] cachem_1.1.0