Title: | Minicircle Sequence Classes (MSC) Analyses |
Version: | 1.3 |
Date: | 2023-06-28 |
Description: | This is a analysis toolkit to streamline the analyses of minicircle sequence diversity in population-scale genome projects. rKOMICS is a user-friendly R package that has simple installation requirements and that is applicable to all 27 trypanosomatid genera. Once minicircle sequence alignments are generated, rKOMICS allows to examine, summarize and visualize minicircle sequence diversity within and between samples through the analyses of minicircle sequence clusters. We showcase the functionalities of the (r)KOMICS tool suite using a whole-genome sequencing dataset from a recently published study on the history of diversification of the Leishmania braziliensis species complex in Peru. Analyses of population diversity and structure highlighted differences in minicircle sequence richness and composition between Leishmania subspecies, and between subpopulations within subspecies. The rKOMICS package establishes a critical framework to manipulate, explore and extract biologically relevant information from mitochondrial minicircle assemblies in tens to hundreds of samples simultaneously and efficiently. This should facilitate research that aims to develop new molecular markers for identifying species-specific minicircles, or to study the ancestry of parasites for complementary insights into their evolutionary history. ***** !! WARNING: this package relies on dependencies from Bioconductor. For Mac users, this can generate errors when installing rKOMICS. Install Bioconductor and ComplexHeatmap at advance: install.packages("BiocManager"); BiocManager::install("ComplexHeatmap") *****. |
License: | GPL-2 | GPL-3 [expanded from: GPL] |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.1 |
Imports: | ggplot2, ape, circlize, ComplexHeatmap, reshape2, utils, stats, dplyr, factoextra, FactoMineR, ggpubr, magrittr, stringr |
Suggests: | viridis |
Depends: | R (≥ 2.10) |
NeedsCompilation: | no |
Packaged: | 2023-06-28 11:25:36 UTC; mgeerts |
Author: | Frederik Van den Broeck [aut], Manon Geerts [aut, cre] |
Maintainer: | Manon Geerts <geertsmanon@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-06-29 22:40:03 UTC |
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Value
No return value, called for side effect
Example dataset
Description
A dataset containing example inputs.
Usage
exData
Format
A list with seven objects
- strain
a character vector containing the strain names.
- subspecies
a factor specifying to which subspecie the strains belong to.
- HCN
a numerical vector with their corresponding median genome wide coverage.
- fastafiles
a character vector containing the file names of the minicircle sequences in fasta format.
- ucs
a character vector containing the file names of the cluster information in uc format.
- mapstats
a character vector containing the files names of the mapping statistics in text format.
- depthstats
a character vector containing the files names of the depth statistics in text format.
Example cluster matrices
Description
A list containing 15 example cluster matrices with percent identities of 80, 85 and 88:100.
Usage
matrices
Format
a list of different cluster matrices.
Check the read depth of assembled minicircles
Description
The msc.depth function allows you to analyze the read depth of assembled minicircles using depth statistics generated by KOMICS. These statistics provide information such as the average, median, minimum, and maximum read depth per site for each minicircle contig. By standardizing the median read depths per minicircle contig to the median genome-wide read depths, you can estimate minicircle copy numbers.
Usage
msc.depth(depthstats, groups, HCN = NULL)
Arguments
depthstats |
character vector containing the file names of the depth statistics generated by KOMICS. Each file should be in the format "sample_name.depthstats.txt" (e.g., sampleA.depthstats.txt, sampleB.depthstats.txt,...). |
groups |
a vector specifying the groups (e.g., species) to which the samples belong. |
HCN |
an optional numeric vector containing haploid copy numbers of the corresponding samples. This argument is set to null by default. |
Value
all |
a table that merges the depth statistics of all samples. The table includes the average, median, minimum, and maximum per site read depth. |
plots |
a plot per sample that visualizes the median read depth distribution. |
medianRD |
a graph summarizing the median read depth distribution of all samples. |
CN |
a graph summarizing the copy number (if HCN is not null) of all samples. |
Examples
require(ggpubr)
data(exData)
### run function
depth <- msc.depth(depthstats = system.file("extdata",
exData$depthstats, package = "rKOMICS"), groups = exData$species,
HCN = exData$medGWD/2)
### visualize results
hist(depth$all[,"MEDIAN.DEPTH"], breaks=100,
main="Global median depth distribution",xlab = (''))
### alter plot
annotate_figure(depth$plots$CUM29A1, fig.lab = "CUM29A1",
fig.lab.pos = "bottom.right", fig.lab.face = 'italic')
Visualization of cluster matrices
Description
The msc.heatmap function generates a heatmap that summarizes the presence or absence of Minicircle Sequence Classes (MCSs) between groups of samples. It takes an input cluster matrix, generated using the msc.matrix function, and visualizes the clustering patterns of MCSs.
Usage
msc.heatmap(clustmatrix, samples, groups)
Arguments
clustmatrix |
a cluster matrix generated with the msc.matrix function. This matrix represents the presence or absence of MCSs in each sample. |
samples |
a vector containing the sample names. |
groups |
a vector specifying the groups (e.g., species) to which the samples belong. |
Value
a heatmap that visualizes the clustering patterns of MCSs between sample groups. The heatmap provides an overview of the presence or absence of MCSs and helps identify shared or distinct MCS patterns among the groups.
Examples
data(exData)
data(matrices)
### run function
msc.heatmap(matrices[["id80"]], groups = exData$species,
samples = exData$samples )
### run function on every cluster matrix with subset of samples
### you will be asked to confirm
table(exData$species)
hybrid <- which(exData$species=="hybrid")
# msc.heatmap(matrices[["id97"]], groups = exData$species[hybrid],
# samples = exData$samples[hybrid])
Length of minicircles
Description
The msc.length function allows you to check the length of minicircle sequences based on a single FASTA file. This function helps determine the size distribution of minicircle sequences.
Usage
msc.length(file, samples, groups)
Arguments
file |
the name of the FASTA file that contains all the minicircle sequences. The file should be in the format "all.minicircles.circ.fasta". |
samples |
a character vector containing the sample names. |
groups |
a vector of the same length as the samples, specifying the groups (e.g., subspecies) to which the samples belong. |
Value
length |
a numerical vector containing the lengths of the minicircle sequences. Each element corresponds to the length of a specific minicircle sequence. |
plot |
a histogram that visualizes the frequency distribution of minicircle sequence lengths. The histogram provides an overview of the length distribution of the minicircles. |
Examples
require(ggplot2)
require(ggpubr)
### run function
bf <- msc.length(file = system.file("extdata", "all.minicircles.fasta", package="rKOMICS"),
samples = exData$samples, groups = exData$subspecies)
af <- msc.length(file = system.file("extdata", "all.minicircles.circ.fasta", package="rKOMICS"),
samples = exData$samples, groups = exData$subspecies)
length(which(bf$length<800))
length(which(bf$length>1400))
### visualize results
hist(af$length, breaks=50)
### alter plot
ggarrange(bf$plot + labs(caption = "Before filtering"),
af$plot + labs(caption = "After filtering"), nrow=2)
Build cluster matrix
Description
The msc.matrix function reads the output of clustering analyses (UC file) for specified minimum percent identity (MPI) values and organizes the data into a matrix format. This matrix represents the presence or absence of Minicircle Sequence Classes (MSCs) in each sample. The resulting matrix simplifies downstream analyses and visualizations by eliminating the need for manual data manipulation and reformatting.
Usage
msc.matrix(files, samples, groups)
Arguments
files |
a character vector containing the names of the UC files generated by the VSEARCH tool. Each file represents the output of clustering analysis for a specific minimum percent identity (MPI), such as all.minicircles.circ.id70.uc, all.minicircles.circ.id80.uc, and so on. Please ensure that your file names end with 'idxx.uc' for this function to work properly. |
samples |
a character vector containing the sample names. |
groups |
a vector of the same length as the samples, specifying the groups (e.g., species) to which the samples belong. |
Value
a a list that contains one cluster matrix per percent identity. Each matrix represents the presence or absence of MSCs in each sample. In the cluster matrix, a value of 0 indicates that the MSC is not present in the sample, while a value higher than 0 indicates that the MSC is found at least once in the sample.
Examples
data(exData)
### run function
matrices <- msc.matrix(files = system.file("extdata", exData$ucs, package="rKOMICS"),
samples = exData$samples,
groups = exData$species)
### or:
data(matrices)
### show matrix with id 95%
matrices[["id95"]]
rowSums(matrices[["id95"]]) # --> frequency of MSC across all samples
colSums(matrices[["id95"]]) # --> number of MSC per sample
Prinicple Component Analysis based on MSC
Description
The msc.pca function allows you to perform Principle Component Analysis (PCA) to summarize the variation of Minicircle Sequence Classes (MSCs) in all samples or in a subset of samples.
Usage
msc.pca(clustmatrix, samples, groups, n = 20, labels = TRUE, title = NULL)
Arguments
clustmatrix |
a cluster matrix obtained from the msc.matrix function. The cluster matrix represents the presence or absence of MSCs in each sample, where rows represent MSCs and columns represent samples. |
samples |
a vector containing the names of the samples. This can include all samples or a subset of samples that you want to analyze. |
groups |
a vector specifying the groups (e.g., species) to which the samples belong. |
n |
the number of clusters to select with the highest contribution to PCA. By default, it is set to 20. |
labels |
a logical parameter indicating whether to use labels on the PCA plot or not. If set to TRUE (default), the plot will display sample labels. |
title |
the title of the graph. You can provide a title for the PCA plot if desired. |
Value
plot |
a PCA plot that visualizes the clustering of samples based on the presence/absence of MSCs. The plot helps identify clusters and patterns of similarity or dissimilarity between samples. |
eigenvalues |
a barplot showing the percentage of explained variances by each principal component. This plot provides insights into the contribution of each principal component to the overall variation in the data. |
clustnames |
a A list of cluster names with the highest contribution to PCA. This list helps identify the MSC clusters that have the most influence on the PCA results. |
Examples
data(matrices)
data(exData)
### run function with all samples
res.pca <- lapply(matrices, function(x) msc.pca(x, samples = exData$samples,
groups = exData$species, n=30, labels=FALSE, title=NULL))
res.pca$id95$eigenvalues
res.pca$id95$plot
### use clusters with highest contribution to visualize in a heatmap
msc.heatmap(matrices[["id95"]][res.pca$id95$clustnames,], samples = exData$samples,
groups = exData$species)
### run function with a subset of samples
### you will be asked to confirm
table(exData$species)
hybrid <- which(exData$species=="hybrid")
# pca.subset <- msc.pca(clustmatrix = matrices[["id97"]],
# samples = exData$samples[hybrid],
# groups = exData$species[hybrid], labels = TRUE,
# title = "PCA only with hybrids")
Check the quality of the assembly
Description
The msc.quality function allows you to summarize mapping statistics generated by KOMICS to assess the quality of minicircle assembly and mapping. It focuses on examining the frequencies of various mapped read categories, including read frequency, mapped read frequency, high-quality mapped read frequency, and the proportion of (near-)perfect alignments of CSB3-containing reads.
Usage
msc.quality(mapstats, groups)
Arguments
mapstats |
a character vector containing the file names of mapping statistics generated by KOMICS. These files provide information about the mapping and quality of the assembled minicircles. |
groups |
a vector specifying the groups (e.g., species) to which the samples belong. |
Value
all |
a table merging the mapping statistics of all samples. The mapping statistics include the number of mapped reads (MR), mapped reads with high quality (MR_HQ), CSB3-containing mapped reads (MR_CSB3), and CSB3-containing mapped reads with high quality (MR_CSB3_HQ). |
proportions |
a list of tables containing the proportions of the mentioned mapping statistics. It provides insights into the relative frequencies and proportions of each category. |
plots |
barplots visualizing the mapping statistics and proportions. These plots help visualize and compare the mapping quality across samples. |
Examples
data(exData)
### run function
map <- msc.quality(mapstats = system.file("extdata", exData$mapstats, package = "rKOMICS"),
exData$species)
lapply(map$proportions, mean)$MR_HQ
lapply(map$proportions, mean)$MR_CSB3_HQ
### visualize results
barplot(map$proportions$MR)
Minicircle Sequence Cluster richness
Description
The msc.richness function calculates the measure of minicircle richness per sample by estimating the number of Minicircle Sequence Classes (MSCs) in each sample. It takes into account different minimum percent identities (MPIs) and returns a table of richness estimates per sample and per MPI. Additionally, it generates a boxplot that illustrates the minicircle richness across samples based on the estimated MSCs over a range of MPIs.
Usage
msc.richness(clustmatrices, samples, groups)
Arguments
clustmatrices |
a list of cluster matrices obtained from the msc.matrix function. Each cluster matrix represents the presence or absence of MSCs in each sample. |
samples |
a vector containing the names of the samples. |
groups |
a vector specifying the group (e.g., species) to which each sample belongs. |
Value
table |
a table summarizing the number of MSCs per sample at different percent identities. The table provides an overview of the estimated minicircle richness in each sample across the MPIs. |
plot |
a boxplot visualizing the minicircle richness across samples. The boxplot represents the distribution of richness estimates in each sample over the range of MPIs considered. |
Examples
require(ggplot2)
data(matrices)
data(exData)
#### run function
richness <- msc.richness(matrices, samples = exData$samples, groups = exData$species)
apply(richness$table[which(richness$table$group=="L. peruviana"),-(1:2)], 2, mean)
apply(richness$table[which(richness$table$group=="L. braziliensis"),-(1:2)], 2, mean)
apply(richness$table[which(richness$table$group=="hybrid"),-(1:2)], 2, mean)
#### visualize results
barplot(richness$table[,"id93"], names.arg = richness$table[,1],
las=2, cex.names=0.4, main="N of MSC at id 93")
#### adjust plot
richness$plot + ggtitle("MSC richness across % id") +
theme(axis.text.x = element_text(angle=45, hjust=1))
### show results of subset
table(exData$species)
hybrid <- which(exData$species=="hybrid")
# richness.subset <- msc.richness(matrices, samples = exData$samples[hybrid],
# groups = exData$species[hybrid])
Retrieve sequences
Description
The msc.seqs function retrieves the DNA sequence of a Minicircle Sequence Classes (MSC) together with all its hit sequences from a FASTA file and a corresponding UC file. This function is useful for extracting and analyzing specific MSCs and their associated hit sequences.
Usage
msc.seqs(fastafile, ucfile, clustnumbers, writeDNA = TRUE)
Arguments
fastafile |
the name of the FASTA file containing all minicircle sequences. |
ucfile |
the name of the UC file. |
clustnumbers |
a character vector containing the cluster numbers (in the format "C0", "C1", etc.) of the MSCs for which you want to retrieve the sequences. These cluster numbers specify the MSCs and their associated hit sequences that need to be extracted from the FASTA file and UC file. |
writeDNA |
a logical parameter that is set to TRUE by default. When set to TRUE, this parameter will write the extracted sequences into separate FASTA files in the current directory. |
Value
a table that summarizes the number of hit sequences found in each MSC, the MSC names, and the samples where the MSCs are present. This table provides an overview of the extracted sequences and their distribution across samples.
one FASTA file per MSC with all its hit sequences. These FASTA files can be further used for downstream analyses or sequence comparisons.
Examples
data(exData)
### select a subset of MSC
Lpe <- which(exData$species == "L. peruviana")
specific <- msc.subset(matrices[[7]], subset = Lpe)
### run function
seq <- msc.seqs(fastafile = system.file("extdata", "all.minicircles.circ.fasta", package="rKOMICS"),
ucfile = system.file("extdata", exData$ucs, package="rKOMICS")[7],
clustnumbers = specific$clustnumbers, writeDNA = FALSE)
Minicircle Sequence Classes similarity
Description
The function msc.similarity returns a measure of minicircle sequence composition within and between groups of samples. Specifically, it estimates the absolute and relative number of Minicircle Sequence Classes (MSCs) that are unique to each group or shared between two or more groups. The function returns tables and barplots that summarize the number of unique or shared MSCs for each minimum percent identity (MPI) separately or combined over all MPIs.
Usage
msc.similarity(clustmatrices, samples, groups)
Arguments
clustmatrices |
a list of cluster matrices. |
samples |
a vector containing the names of the samples. This can include all samples or it can be a subset. |
groups |
a vector, of equal length as samples, specifying to which group (e.g. species) the samples belong to. |
Value
absfreq |
a list per percent identity containing absolute frequency values of shared and unique MSCs. |
absfreq.plot |
a list of barplots visualizing previous results. |
relfreq |
a list per percent identity containing relative frequency values of shared and unique MSCs. |
relfreq.plot |
one barplot visualizing previous results. |
Examples
require(viridis)
data(matrices)
data(exData)
### run function
sim <- msc.similarity(matrices, samples = exData$samples,
groups = exData$species)
### visualize results (absolute frequencies)
barplot(sim$absfreq$id93)
### adjust plot (relative frequencies)
sim$relfreq.plot + scale_fill_viridis(discrete = TRUE)
sim$relfreq$id97["2"]*100
sim$relfreq$id97["3"]*100
### reduce number of groups
groups <- exData$species
levels(groups)[levels(groups)!='hybrid'] <- "non-hybrid"
sim.red <- msc.similarity(matrices, samples = exData$samples, groups = groups)
sim.red$relfreq.plot + scale_fill_viridis(discrete = TRUE)
Specific Minicircle Sequence Classes
Description
The msc.subset function allows you to identify specific Minicircle Sequence Classes (MSCs) for a subset of samples based on the output of the msc.matrix function. It helps in extracting and analyzing MSCs that are present in a particular subset of samples.
Usage
msc.subset(clustmatrix, subset)
Arguments
clustmatrix |
a cluster matrix obtained from the msc.matrix function. The cluster matrix represents the presence or absence of MSCs in each sample. |
subset |
a numerical vector indicating the subset of samples for which you want to identify specific MSCs. The values in the subset vector correspond to the indices of the samples to be included. |
Value
clustnumbers |
a vector containing the names of the specific MSCs present in the subset of samples. These are the MSCs that are found in the indicated subset. |
freq |
frequency values indicating the occurrence of the specific MSCs in the subset of samples. These values represent the number of times each MSC appears in the subset. |
matrix |
a subset of the cluster matrix containing only the specific MSCs found in the subset of samples. For samples not included in the subset, the values in the matrix should have the value 0, indicating the absence of the MSC. |
sum |
the total number of MSC found in the indicated subset of samples. |
Examples
data(matrices)
data(exData)
### selecting a group of samples e.g. all L. peruviana species
Lpe <- which(exData$species == "L. peruviana")
### run function
specific <- msc.subset(matrices[["id97"]], subset = Lpe)
### visualize results (check if it is indeed specific)
heatmap(specific$matrix) # or:
msc.heatmap(specific$matrix, samples = exData$samples, groups = exData$species)
### find specific MSC with highest frequency
which.max(specific$freq)
Cluster Analyses
Description
The function msc.uc reads the output of the clustering analyses (UC file) for each specified minimum percent identity (MPI) into a single list, which will be analyzed automatically to calculate and visualize, per MPI, the number of minicircle sequence classes (MSCs), the proportion of perfect alignments (i.e. alignments without any insertion/deletion, but allowing point mutations) and the number of alignment gaps. Gaps are defined by i) the number of insertions/deletions and ii) the length in base pairs of each individual insertion/deletion. It also issues a warning when large gaps (>500 bp) are found, which points the user to anomalous alignments due to e.g. artificial dimers introduced by the assembly process. This allows the user to make an informed decision about the MPI (or MPI's) that best captures minicircle sequence richness within a (group of) sample(s) while minimizing the number and length of alignment gaps.
Usage
msc.uc(files)
Arguments
files |
a character vector that includes the file names of UC files (produced by USEARCH or VSEARCH), such as all.minicircles.circ.id70.uc, all.minicircles.circ.id80.uc, and so on. Please ensure that your file names end with 'idxx.uc' for this function to work properly. |
Value
MSCs |
a numerical vector containing the number of MSC per MPI. |
perfect aligments |
a numerical vector containing the proportions of perfect alignments per MPI. |
insertions |
a list showing the insertion lengths per MPI. Each element in the list corresponds to a specific MPI, and it provides the lengths of identified insertions. |
deletions |
a list showing the deletions lengths per MPI. Each element in the list corresponds to a specific MPI, and it provides the lengths of identified deletions. |
insertions summary |
a table showing the length and the number of insertions across different MPIs. |
deletions summary |
a table showing the length and the number of deletions across different MPIs. |
plots |
various plots showing previous results. |
Examples
data(exData)
### run function
ucs <- msc.uc(files = system.file("extdata", exData$ucs, package="rKOMICS"))
ucs$MSCs["100"]
ucs$MSCs["97"]
### results
ucs$plots
Filtering of minicircle sequences
Description
The preprocess function is used to filter minicircle sequences based on sequence length and circularization success. When minicircle sequences are assembled with KOMICS, individual fasta files are generated for each sample. This function allows you to filter these sequences based on their length and whether they are circularized or not. The filtered sequences are then written into individual FASTA files in the current working directory.
Usage
preprocess(files, groups, circ = TRUE, min = 500, max = 1500, writeDNA = TRUE)
Arguments
files |
a character vector containing the names of the fasta files. Each file corresponds to the minicircle sequences of a specific sample. The file names should be in the format sampleA.minicircles.fasta, sampleB.minicircles.fasta, and so on (output of KOMICS). |
groups |
a factor specifying the group (e.g., species) to which each sample belongs. It should have the same length as the list of files, indicating the group assignment for each sample. |
circ |
a logical parameter that determines whether non-circularized minicircle sequences should be included or excluded from the filtering process. By default, non-circularized sequences are excluded (circ = TRUE). If you are interested in including non-circularized sequences, you can set the parameter to FALSE. |
min |
the minimum length threshold for filtering minicircle sequences. Sequences with a length below this threshold will be excluded. The default value is set to 500. |
max |
the maximum length threshold for filtering minicircle sequences. Sequences with a length above this threshold will be excluded. The default value is set to 1500. |
writeDNA |
a logical parameter that determines whether the filtered minicircle sequences should be written in FASTA format to the current working directory. By default, the filtered sequences are written (writeDNA = TRUE). If you are only interested in other output values like plots and summary, you can set this parameter to FALSE. |
Value
samples |
the sample names based on the input files. |
N_MC |
a table containing the sample names, the corresponding group assignment, and the number of minicircle sequences (N_MC) before and after filtering. |
plot |
a barplot visualizing the number of minicircle sequences per sample before and after filtering. |
summary |
the total number of minicircle sequences before and after filtering. |
Examples
require(ggplot2)
data(exData)
### setwd("")
### run function
table(exData$species)
pre <- preprocess(files = system.file("extdata", exData$fastafiles, package="rKOMICS"),
groups = exData$species,
circ = TRUE, min = 500, max = 1200, writeDNA = FALSE)
pre$summary
### visualize results
barplot(pre$N_MC[,"beforefiltering"],
names.arg = pre$N_MC[,1], las=2, cex.names=0.4)
### alter plot
pre$plot + labs(caption = paste0('N of MC sequences before and after filtering, ', Sys.Date()))
Minicircle Sequence Classes (MSC) Analyses
Description
This is a analysis toolkit to streamline the analyses of minicircle sequence diversity in population-scale genome projects. rKOMICS is a user-friendly R package that has simple installation requirements and that is applicable to all 27 trypanosomatid genera. Once minicircle sequence alignments are generated, rKOMICS allows to examine, summarize and visualize minicircle sequence diversity within and between samples through the analyses of minicircle sequence clusters. We showcase the functionalities of the (r)KOMICS tool suite using a whole-genome sequencing dataset from a recently published study on the history of diversification of the Leishmania braziliensis species complex in Peru. Analyses of population diversity and structure highlighted differences in minicircle sequence richness and composition between Leishmania subspecies, and between subpopulations within subspecies. The rKOMICS package establishes a critical framework to manipulate, explore and extract biologically relevant information from mitochondrial minicircle assemblies in tens to hundreds of samples simultaneously and efficiently. This should facilitate research that aims to develop new molecular markers for identifying species-specific minicircles, or to study the ancestry of parasites for complementary insights into their evolutionary history. ***** !! WARNING: this package relies on dependencies from Bioconductor. For Mac users, this can generate errors when installing rKOMICS. Install Bioconductor and ComplexHeatmap at advance: install.packages("BiocManager"); BiocManager::install("ComplexHeatmap") *****.
Details
The DESCRIPTION file:
This package was not yet installed at build time.
Index of help topics:
exData Example dataset matrices Example cluster matrices msc.depth Check the read depth of assembled minicircles msc.heatmap Visualization of cluster matrices msc.length Length of minicircles msc.matrix Build cluster matrix msc.pca Prinicple Component Analysis based on MSC msc.quality Check the quality of the assembly msc.richness Minicircle Sequence Cluster richness msc.seqs Retrieve sequences msc.similarity Minicircle Sequence Classes similarity msc.subset Specific Minicircle Sequence Classes msc.uc Cluster Analyses preprocess Filtering of minicircle sequences rKOMICS Minicircle Sequence Classes (MSC) Analyses read.uc Read in uc files
Author(s)
Frederik Van den Broeck frederik.vandenbroeck@kuleuven.be
Manon Geerts mgeerts@itg.be
References
Geerts, M., Schnaufer, A. & Van den Broeck, F. rKOMICS: an R package for processing mitochondrial minicircle assemblies in population-scale genome projects. BMC Bioinformatics 22, 468 (2021). doi: 10.1186/s12859-021-04384-1
Van den Broeck F, Savill NJ, Imamura H, Sanders M, Maes I, Cooper S, et al. Ecological divergence and hybridization of Neotropical Leishmania parasites. Proc Natl Acad Sci U S A. 2020;117. doi: 10.1073/pnas.1920136117.
See Also
Github: https://frebio.github.io/
komics-suite: https://frebio.github.io/komics/
Read in uc files
Description
The read.uc function is used to read the output of clustering analyses from a UC file. The function stores the information in two tables: one for hit records (H) and one for cluster records (C).
Usage
read.uc(file)
Arguments
file |
the name of the UC file that contains the clustering analysis results. The file should be specified with its full name, including the extension (e.g., all.minicircles.circ.id70.uc). |
Value
hits |
a table containing all the hit records from the UC file. Each row of the table represents a hit record, providing information about the alignment between a query sequence and a target sequence. |
clusters |
a table containing all the cluster records from the UC file. Each row of the table represents a cluster record, providing information about the clustering of sequences into clusters. |
clustnumbers |
a vector containing the cluster numbers (0-based). Each element of the vector represents a cluster identified in the clustering analysis. |