Title: | Biologically Informed Metabolomic Databases from 'PubChem' |
Version: | 1.0.0 |
Description: | All 'PubChem' compounds are downloaded to a local computer, but for each compound, only partial records are used. The data are organized into small files referenced by 'PubChem' CID. This package also contains functions to parse the biologically relevant compounds from all 'PubChem' compounds, using biological database sources, pathway presence, and taxonomic relationships. Taxonomy is used to generate a lowest common ancestor taxonomy ID (NCBI) for each biological metabolite, which then enables creation of taxonomically specific metabolome databases for any taxon. |
License: | GPL-3 |
Encoding: | UTF-8 |
Imports: | foreach, doParallel, R.utils, data.table, dplyr, rcdk, stringr |
Depends: | R (≥ 3.5.0) |
LazyData: | true |
Suggests: | utils, knitr, rmarkdown, formatR |
RoxygenNote: | 7.3.2 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-08-22 23:01:51 UTC; cbroeckl |
Author: | Corey Broeckling [aut, cre] |
Maintainer: | Corey Broeckling <cbroeckl@colostate.edu> |
Repository: | CRAN |
Date/Publication: | 2025-08-28 07:40:07 UTC |
build.cid.lca
Description
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' as input to generate a relationship between pubchem CID and the lowest common ancestor NCBI taxid
Usage
build.cid.lca(
pc.directory = NULL,
tax.sources = "LOTUS - the natural products occurrence database",
use.pathways = TRUE,
use.conserved.pathways = FALSE,
threads = 8,
cid.taxid.object = NULL,
taxid.hierarchy.object = NULL,
cid.pwid.object = NULL,
min.taxid.table.length = 3,
output.directory = NULL
)
Arguments
pc.directory |
directory from which to load pubchem .Rdata files. alternatively provide cid.taxid.object, taxid.hierarchy.object, and cid.pwid.object as data.table R objects. |
tax.sources |
vector. which taxonomy sources should be used? defaults to c("LOTUS - the natural products occurrence database", "The Natural Products Atlas", "KNApSAcK Species-Metabolite Database", "Natural Product Activity and Species Source (NPASS)"). |
use.pathways |
logical. default = TRUE, should pathway data be used in building lowest common ancestor, when taxonomy is associated with a pathway? |
use.conserved.pathways |
logical. default = FALSE, should 'conserved' pathways be used? when false, only pathways with an assigned taxonomy are used. |
threads |
integer. number of threads to use when finding lowest common ancestor. parallel processing via DoParallel and foreach packages. |
cid.taxid.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
taxid.hierarchy.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.pwid.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
min.taxid.table.length |
integer. when there are few taxa reported to synthesize a particular compound, and those few taxa are spread widely across biology, the LCA concept breaks down. This value controls the decision as to whether to determine LCA within taxonomic ranks, rather within the full taxonomy hierarchy. see details. |
output.directory |
directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided). If both directories are NULL, not saved, only returned as in memory |
Details
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function
Some metabolism is highly conserved - all species perform those reactions. Other metabolism is highly specific - there is one know species to produce that metabolite. Sometimes, it is in between. The lowest common ancestor approach allows us to analyze these patterns and put them to use to generalize metabolites for metabolomics across species.
Biology is more complex than that though. Natural products are often reported as being synthesized by an organism which is in symbiosis with a second organism. The taxonomic assignment is sometimes both organisms, even if neither would create that product in isolation, or if only one is actually capable of producing that metabolite. In these situations, the LCA approach can break down. For example, if a bacteria is in symbiosis with an algae, and each is listed as producing the metabolite, the LCA will be assigned as '1' - the root of all biology, since we have to go back to the base of the taxonomic tree to find the common taxonomic ancestor of prokaryotes and eukaryotes. In this example, there are two unique species, genera, families, orders, etc listed in the full taxonomic hierarchy for this metabolite.
The 'min.unique.taxid.ct' variable controls sensitivity to this phenomenon in assigning LCA. The number of unique taxa which are mapped to each metabolite varies by taxonomic level. it may map to two species, but only one genus. in that case, the genus is assigned as the LCA. However, if the metabolite maps to two unique species, two unique genera, two unique families, two unique kingdoms, and one unique domain, we should ask ourselves whether this sparse patterns supports that this metabolite should be marked as conserved' or 'primary.' What makes more intuitive sense is to conclude that there are may be extenuating circumstances which have resulted from unique biology. For example, Ceratodictyol B is reported from Haliclona cymaeformis and Ceratodictyon spongiosum, one of which is a red algal symbiont of the other. At each taxonomic level, there are either 0, 1, or 2 unique taxonomy IDs. 0 unique levels is uninteresting - that just reflects that there is no taxonomy assigned for those lineages at that level.
What is more interesting is the number of unique levels of the number of unique taxonomy ids. in the case of Ceratodictyol B, the only other value is '2'. There are 2 unique taxonomy IDs at each level species, genus, order, class, and phylum. So there are five taxonomic levels that have exactly 2 unique taxonomy IDs, and there are no taxonomic levels which have more than 2 unique taxids. We will call this the taxid.ct.table length, where the taxid.ct.table is the table of frequencies of the number of unique taxids at each taxonomic level. the length is the number of unique values when IGNORING '0' or '1'. When the taxid.ct.table length is less than or equal to min.taxid.table.length, the lca is calcluated within the lowest taxonomic level that has the most frequent unique taxonomy ID count.
For the Ceratodictyol B example, this would mean that we would find that '2' was the most common number of unique taxids reported, so we find that the lowest taxonomic level which reports two unique taxids is 'species'. LCA is for assigned to those two species. If however, there were two Ceratodicyon spp reported, then the species level would have 3 unique taxids, and there would be 4 levels (rather than five) which have 2unique taxids. the lowest taxonomic level with 2 unique taxids, the most frequent count observed, would now be 'genus', so LCA would be assigned for within each level of 'genus'. This would mean that the first LCA would be assigned to the Ceratodicyon genus, since there are multiple Ceratodicyon species reported, and then a second LCA would be assigned to the Haliclona cymaeformis species. Sorry it is so complicated. Life is complicated.
Value
a data frame containing pubchem CID ('cid'), and lowest common ancestor ('lca') NCBI taxonomy ID integer. will also save to pc.directory as .Rdata file.
Author(s)
Corey Broeckling
Examples
data('cid.taxid', package = "pubchem.bio")
data('taxid.hierarchy', package = "pubchem.bio")
data('cid.pwid', package = "pubchem.bio")
cid.lca.out <- build.cid.lca(
tax.sources = "LOTUS - the natural products occurrence database",
use.pathways = FALSE,
threads = 1, cid.taxid.object = cid.taxid,
taxid.hierarchy.object = taxid.hierarchy,
cid.pwid.object = cid.pwid)
head(cid.lca.out)
build.primary.metabolome
Description
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function to filter a dataset created by 'build.pubchem.bio' function
Usage
build.primary.metabolome(
pc.directory = NULL,
get.properties = FALSE,
threads = 8,
db.name = "primary.metabolome",
rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
pubchem.bio.object = NULL,
output.directory = NULL,
keep.primary.only = TRUE,
min.tax.ct = 3
)
Arguments
pc.directory |
directory from which to load pubchem .Rdata files |
get.properties |
logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount. |
threads |
integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages. |
db.name |
character. what do you wish the file name for the saved version of this database to be? default = 'primary.metabolome.' Saved as an .Rdata file in the 'pc.directory' location. |
rcdk.desc |
vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added. |
pubchem.bio.object |
R data.table, generally produced by build.pubchem.bio; preferably, define pc.directory |
output.directory |
directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved. |
keep.primary.only |
logical. If TRUE, only biological metabolites scored as 'primary' are returned. If FALSE, full dataset of metabolites is returned, with new logical column, 'primary' |
min.tax.ct |
integer. if assigned an integer value, only those metabolites with at least min.tax.ct unique taxonomy assigments are considered 'primary'. default = 3. |
Details
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function
Value
a data frame containing pubchem CID ('cid'), and lowest common ancestor ('lca') NCBI taxonomy ID integer. will also save to pc.directory as .Rdata file.
Author(s)
Corey Broeckling data('pubchem.bio', package = "pubchem.bio") my.primary.db <- build.primary.metabolome( pubchem.bio.object = pubchem.bio, get.properties = FALSE, threads = 1) head(my.taxon.db)
build.pubchem.bio
Description
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function
Usage
build.pubchem.bio(
pc.directory = NULL,
use.bio.sources = TRUE,
bio.sources = c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI",
"LIPID MAPS", "MassBank of North America (MoNA)"),
use.pathways = TRUE,
pathway.sources = NULL,
use.taxid = TRUE,
taxonomy.sources = NULL,
use.parent.cid = TRUE,
remove.salts = TRUE,
get.properties = TRUE,
threads = 8,
rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
cid.lca.object = NULL,
cid.sid.object = NULL,
cid.pwid.object = NULL,
cid.parent.object = NULL,
cid.taxid.object = NULL,
cid.formula.object = NULL,
cid.smiles.object = NULL,
cid.inchikey.object = NULL,
cid.monoisotopic.mass.object = NULL,
cid.title.object = NULL,
cid.cas.object = NULL,
cid.pmid.ct.object = NULL,
output.directory = NULL
)
Arguments
pc.directory |
directory from which to load pubchem .Rdata files. alternatively, provide R data.tables for ALL cid.property.object options defined below. |
use.bio.sources |
logical. If TRUE (default) use the bio.source vector of sources, incorporating all CIDs from those bio databases. |
bio.sources |
vector of source names from which to extract pubchem CIDs. all can be found here: https://pubchem.ncbi.nlm.nih.gov/sources/. deafults to c("Metabolomics Workbench", "Human Metabolome Database (HMDB)", "ChEBI", "LIPID MAPS", "MassBank of North America (MoNA)") |
use.pathways |
logical. should all CIDs from any biological pathway data be incorporated into database? |
pathway.sources |
character. vector of sources to be used when adding metabolites to pubchem bio database. default = NULL, using all pathway sources. |
use.taxid |
logical. should all CIDs associated with a taxonomic identifier (taxid) be used? |
taxonomy.sources |
character. vector of sources to be used when adding taxonomically related metabolites to database. Default = NULL, using all sources. |
use.parent.cid |
logical. should CIDs be replaced with parent CIDs? default = TRUE. |
remove.salts |
logical. should salts be removed from dataset? default = TRUE. salts recognized as '.' in smiles string. performed after 'use.parent.cid'. |
get.properties |
logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount. |
threads |
integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages. |
rcdk.desc |
vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added. |
cid.lca.object |
R data.table, generally produced by build.cid.lca; preferably, define pc.directory |
cid.sid.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.pwid.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.parent.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.taxid.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.formula.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.smiles.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.inchikey.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.monoisotopic.mass.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.title.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.cas.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
cid.pmid.ct.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
output.directory |
directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved. |
Details
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function
Value
a data frame containing pubchem CID, title, formula, monoisotopic molecular weight, inchikey, smiles, cas, optionally rcdk properties
Author(s)
Corey Broeckling
Examples
data('cid.sid', package = "pubchem.bio")
data('cid.pwid', package = "pubchem.bio")
data('cid.parent', package = "pubchem.bio")
data('cid.taxid', package = "pubchem.bio")
data('cid.formula', package = "pubchem.bio")
data('cid.smiles', package = "pubchem.bio")
data('cid.inchikey', package = "pubchem.bio")
data('cid.monoisotopic.mass', package = "pubchem.bio")
data('cid.title', package = "pubchem.bio")
data('cid.cas', package = "pubchem.bio")
data('cid.pmid.ct', package = "pubchem.bio")
data('cid.lca', package = "pubchem.bio")
pc.bio.out <- build.pubchem.bio(use.pathways = FALSE, use.parent.cid = FALSE,
get.properties = FALSE, threads = 1,
cid.sid.object = cid.sid, cid.pwid.object = cid.pwid,
cid.parent.object = cid.parent, cid.taxid.object = cid.taxid,
cid.formula.object = cid.formula, cid.smiles.object = cid.smiles,
cid.inchikey.object = cid.inchikey,
cid.monoisotopic.mass.object = cid.monoisotopic.mass,
cid.title.object = cid.title, cid.cas.object = cid.cas,
cid.pmid.ct.object = cid.pmid.ct, cid.lca.object = cid.lca)
head(pc.bio.out)
build.taxon.metabolome
Description
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function to filter a dataset created by 'build.pubchem.bio' function
Usage
build.taxon.metabolome(
pc.directory = NULL,
taxid = c(),
get.properties = FALSE,
full.scored = TRUE,
keep.scored.only = FALSE,
aggregation.function = max,
threads = 8,
db.name = "custom.metabolome",
rcdk.desc = c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor",
"org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"),
pubchem.bio.object = NULL,
cid.lca.object = NULL,
taxid.hierarchy.object = NULL,
output.directory = NULL
)
Arguments
pc.directory |
directory from which to load pubchem .Rdata files |
taxid |
integer vector of integer NCBI taxonomy IDs. i.e. c(9606, 1425170 ) for Homo sapiens and Homo heidelbergensis. |
get.properties |
logical. if TRUE, will return rcdk calculated properties: XLogP, TPSA, HBondDonorCount and HBondAcceptorCount. |
full.scored |
logincal. default = FALSE. When false, only metabolites which map to the taxid(s) are returned. When TRUE, all metabolites are returned, with scores assigned based on the distance of non-mapped metabolites to the root node. i.e. specialized metabolites from distantly related species are going to be scored at or near zero, specialized metabolites of mores similar species higher, and more conserved metabolites will score higher than ore specialized. |
keep.scored.only |
logical. If TRUE, biological metabolites with NA for the taxonomy score are removed before returning. |
aggregation.function |
function. default = max. can use mean, median, min, etc, or a custom function. Defines how the aggregate score will be calculated when multiple taxids are used. |
threads |
integer. how many threads to use when calculating rcdk properties. parallel processing via DoParallel and foreach packages. |
db.name |
character. what do you wish the file name for the saved version of this database to be? default = 'custom.metabolome', but could be 'taxid.4071' or 'Streptomyces', etc. Saved as an .Rdata file in the 'pc.directory' location. |
rcdk.desc |
vector. character vector of valid rcdk descriptors. default = rcdk.desc <- c("org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.AcidicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.BasicGroupCountDescriptor", "org.openscience.cdk.qsar.descriptors.molecular.TPSADescriptor"). To see descriptor categories: 'dc <- rcdk::get.desc.categories(); dc' . To see the descriptors within one category: 'dn <- rcdk::get.desc.names(dc[4]); dn'. Note that the four default parameters are relatively fast to calculate - some descriptors take a very long time to calculate. you can calculate as many as you wish, but processing time will increase the more descriptors are added. |
pubchem.bio.object |
R data.table, generally produced by build.pubchem.bio; preferably, define pc.directory |
cid.lca.object |
R data.table, generally produced by build.cid.lca; preferably, define pc.directory |
taxid.hierarchy.object |
R data.table, generally produced by get.pubchem.ftp; preferably, define pc.directory |
output.directory |
directory to which the pubchem.bio database is saved. If NULL, will try to save in pc.directory (if provided), else not saved. |
Details
utilizes downloaded and properly formatted local pubchem data created by 'get.pubchem.ftp' function
Value
a data frame containing pubchem CID ('cid'), and lowest common ancestor ('lca') NCBI taxonomy ID integer. will also save to pc.directory as .Rdata file.
Author(s)
Corey Broeckling
Examples
data('cid.lca', package = "pubchem.bio")
data('pubchem.bio', package = "pubchem.bio")
data('taxid.hierarchy', package = "pubchem.bio")
my.taxon.db <- build.taxon.metabolome(
pubchem.bio.object = pubchem.bio,
cid.lca.object = cid.lca, taxid.hierarchy.object = taxid.hierarchy,
get.properties = FALSE, threads = 1, taxid = c(1))
head(my.taxon.db)
cid.accurate.mass.rda
Description
A subset of the full cid.accurate.mass, for example code
Format
data.table, stored in .rda format
Source
subset of cid.accurate.mass file from get.pubchem.ftp
cid.cas.rda
Description
A subset of the full cid.cas, for example code
Format
data.table, stored in .rda format
Source
subset of cid.cas file from get.pubchem.ftp
cid.formula.rda
Description
A subset of the full cid.formula, for example code
Format
data.table, stored in .rda format
Source
subset of cid.formula file from get.pubchem.ftp
cid.inchi.rda
Description
A subset of the full cid.inchi, for example code
Format
data.table, stored in .rda format
Source
subset of cid.inchi file from get.pubchem.ftp
cid.inchikey.rda
Description
A subset of the full cid.inchikey, for example code
Format
data.table, stored in .rda format
Source
subset of cid.inchikey file from get.pubchem.ftp
cid.lca.rda
Description
A subset of the full cid.lca, for example code
Format
data.table, stored in .rda format
Source
subset of cid.lca file from get.pubchem.ftp
cid.mesh.function.rda
Description
A subset of the full cid.mesh, for example code
Format
data.table, stored in .rda format
Source
subset of cid.mesh file from get.pubchem.ftp
cid.mesh.name.rda
Description
A subset of the full cid.mesh.name, for example code
Format
data.table, stored in .rda format
Source
subset of cid.mesh.name file from get.pubchem.ftp
cid.monoisotopic.mass.rda
Description
A subset of the full cid.monoisotopic, for example code
Format
data.table, stored in .rda format
Source
subset of cid.monoisotopic file from get.pubchem.ftp
cid.monoisotopic.mass.rda
Description
A subset of the full cid.monoisotopic.mass, for example code
Format
data.table, stored in .rda format
Source
subset of cid.accurate.mass file from get.pubchem.ftp
cid.parent.rda
Description
A subset of the full cid.parent, for example code
Format
data.table, stored in .rda format
Source
subset of cid.parent file from get.pubchem.ftp
cid.pmid.rda
Description
A subset of the full cid.pmid, for example code
Format
data.table, stored in .rda format
Source
subset of cid.pmid file from get.pubchem.ftp
cid.pmid.ct.rda
Description
A subset of the full cid.pmid.ct, for example code
Format
data.table, stored in .rda format
Source
subset of cid.pmid.ct file from get.pubchem.ftp
cid.preferred.rda
Description
A subset of the full cid.preferred, for example code
Format
data.table, stored in .rda format
Source
subset of cid.preferred file from get.pubchem.ftp
cid.pwid.rda
Description
A subset of the full cid.pwid, for example code
Format
data.table, stored in .rda format
Source
subset of cid.pwid file from get.pubchem.ftp
cid.sid.rda
Description
A subset of the full cid.sid, for example code
Format
data.table, stored in .rda format
Source
subset of cid.sid file from get.pubchem.ftp
cid.smiles.rda
Description
A subset of the full cid.smiles, for example code
Format
data.table, stored in .rda format
Source
subset of cid.smiles file from get.pubchem.ftp
cid.synonym.rda
Description
A subset of the full cid.synonym, for example code
Format
data.table, stored in .rda format
Source
subset of cid.synonym file from get.pubchem.ftp
cid.taxid.rda
Description
A subset of the full cid.taxid, for example code
Format
data.table, stored in .rda format
Source
subset of cid.taxid file from get.pubchem.ftp
cid.title.rda
Description
A subset of the full cid.title, for example code
Format
data.table, stored in .rda format
Source
subset of cid.title file from get.pubchem.ftp
export.msfinder
Description
export pubchem.bio pc.bio syle data.table to format suitable for MSFinder input.
Usage
export.msfinder(pc.bio.object = NULL, export.file.name = NULL)
Arguments
pc.bio.object |
input data.table, generated from 'build.pubchem.bio' or 'build.taxon.metabolome' functions |
export.file.name |
valid file path and name. Extension should be listed as '.tsv'. |
Details
takes output from 'build.pubchem.bio' or 'build.taxon.metabolome' functions, reformatting, and exporting to input format suitable for MSFinder.
Value
nothing - file written to disk.
Author(s)
Corey Broeckling
export.pubchem.bio
Description
export pubchem.bio pc.bio syle data.table to tab delimited text file for import into other programs. all columns exported.
Usage
export.pubchem.bio(pc.bio.object = NULL, export.file.name = NULL)
Arguments
pc.bio.object |
input data.table, generated from 'build.pubchem.bio' or 'build.taxon.metabolome' functions |
export.file.name |
valid file path and name. Extension should be listed as '.tsv'. |
Details
takes output from 'build.pubchem.bio' or 'build.taxon.metabolome' functions, reformatting, and exporting to input format suitable for MSFinder.
Value
nothing - file written to disk.
Author(s)
Corey Broeckling
export.sirius
Description
export pubchem.bio pc.bio syle data.table to format suitable for sirius input.
Usage
export.sirius(pc.bio.object = NULL, export.file.name = NULL)
Arguments
pc.bio.object |
input data.table, generated from 'build.pubchem.bio' or 'build.taxon.metabolome' functions |
export.file.name |
valid file path and name. Extension should be listed as '.tsv'. |
Details
takes output from 'build.pubchem.bio' or 'build.taxon.metabolome' functions, reformatting, and exporting to input format suitable for Sirius.
Value
nothing - file written to disk.
Author(s)
Corey Broeckling
get.pubchem.ftp
Description
first step to building a local selective, biologically focused, pubchem data repository focused on metabolomics informatics
Usage
get.pubchem.ftp(
pc.directory = NULL,
timeout = 50000,
rm.tmp.files = TRUE,
threads = 2
)
Arguments
pc.directory |
character. directory to which data will be saved |
timeout |
numeric. timeout setting for FTP download. setting options(timeout) value too small will generate errors for large files. default = 50000. |
rm.tmp.files |
logical. should temporary files be removed after completion of download and parsing? Default = TRUE. |
threads |
integer. the number of parallel threads to be used by foreach %dopar% during processing of taxonomy hierarchy data. |
Details
this function downloads and unzips files from pubchem and NCBI taxonomy FTP sites as a first step in building a local metabolomics repository.
Value
nothing. all data are saved to disk for later loading
Author(s)
Corey Broeckling
Examples
## Not run:
my.dir <- "C:/Temp/20250725"
# or some other valid directory.
# this will be created assuming 'C:/Temp' exists.
get.pubchem.ftp(
pc.directory = my.dir,
timeout = 50000,
rm.tmp.files = TRUE
)
## End(Not run)
pc.bio.subset.rda
Description
A small dataset of a pubchem.bio metabolome scored by taxon, for inclusion in vignette
Format
data.table, stored in .rda format
Source
pubchem.bio data.table output derived from build.pubchem.bio function
pubchem.bio.rda
Description
A subset of a full pubchem.bio biological metabolome, for example code
Format
data.table, stored in .rda format
Source
subset of pubchem.bio file from build.pubchem.bio function
sub.taxid.hierarchy.rda
Description
A small dataset of a the taxonomy hierarchy, for inclusion in vignette
Format
data.table, stored in .rda format
Source
pubchem.bio data.table output from NCBI Taxonomy data
taxid.hierarchy.rda
Description
A subset of the full taxid.hierarchy, for example code
Format
data.table, stored in .rda format
Source
pubchem.bio data.table output from NCBI Taxonomy data