The ontoProc2 package has two aims:
The best way to work with an ontology in this system is
to use semsql_connect. The ontology argument will
be a short string that the INCAtools project uses as part
of the filename for the ontology. For Gene Ontology, the
string is “go”.
## <SemsqlConn> prefix: GO | labeled terms: 88,356
The report method provides details.
##
## ============================================================
## SemsqlConn Object
## ============================================================
##
## Connection Details:
## ----------------------------------------
## Database path: /home/pkgbuild/.cache/R/BiocFileCache/171028fd52ed1_go.db
## Ontology prefix: GO
## Status: ✓ Connected
##
## Database Statistics:
## ----------------------------------------
## Labeled terms: 88,356
## Direct edges: 214,146
## Entailed edges: 9,336,957
## Definitions: 55,200
##
## Terms by Prefix (top 5):
## ----------------------------------------
## GO: 48,251
## CHEBI: 23,969
## _: 7,497
## UBERON: 4,783
## CL: 1,307
##
## Key Tables Available:
## ----------------------------------------
## ✓ rdfs_label_statement
## ✓ has_text_definition_statement
## ✓ edge
## ✓ entailed_edge
## ✓ rdfs_subclass_of_statement
## ✓ owl_some_values_from
## ✓ has_oio_synonym_statement
##
## ============================================================
## Use methods like search_labels(), get_ancestors(), etc.
## Run ?SemsqlConn for documentation.
## ============================================================
The back end is SQLite. We can enumerate the tables available:
## [1] 100
## [1] "all_problems" "annotation_property_node"
## [3] "anonymous_class_expression" "anonymous_expression"
## [5] "anonymous_individual_expression" "anonymous_property_expression"
Individual tables are readily accessible.
## # Source: table<`statements`> [?? x 8]
## # Database: sqlite 3.51.2 [/home/pkgbuild/.cache/R/BiocFileCache/171028fd52ed1_go.db]
## stanza subject predicate object value datatype language graph
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 obo:go/extensions/go-… obo:go… owl:vers… <NA> 2026… <NA> <NA> <NA>
## 2 obo:go/extensions/go-… obo:go… oio:hasO… <NA> 1.2 <NA> <NA> <NA>
## 3 obo:go/extensions/go-… obo:go… oio:defa… <NA> gene… <NA> <NA> <NA>
## 4 obo:go/extensions/go-… obo:go… dcterms:… cc:by… <NA> <NA> <NA> <NA>
## 5 obo:go/extensions/go-… obo:go… dce:title <NA> Gene… <NA> <NA> <NA>
## 6 obo:go/extensions/go-… obo:go… dce:desc… <NA> The … <NA> <NA> <NA>
## 7 obo:go/extensions/go-… obo:go… IAO:0000… GO:00… <NA> <NA> <NA> <NA>
## 8 obo:go/extensions/go-… obo:go… IAO:0000… GO:00… <NA> <NA> <NA> <NA>
## 9 obo:go/extensions/go-… obo:go… IAO:0000… GO:00… <NA> <NA> <NA> <NA>
## 10 obo:go/extensions/go-… obo:go… owl:vers… obo:g… <NA> <NA> <NA> <NA>
## # ℹ more rows
To investigate the ontology, searching through RDF labels is a natural approach.
Additional filtering could be useful here to focus on GO terms. The _riog...
labels have special roles in RDF inference, and this will be addressed in
vignettes to be added in the future.
Let’s improve the query:
Clearly it will be valuable to filter away obsolete terms. We will investigate the use of edge tables to accomplish this in a future vignette.
The ontologyX suite of Daniel Greene and colleagues provides very convenient ontology handling functions. We can transform the SQLite data to this format. We’ll illustrate with cell ontology.
## Connected to SemanticSQL database: /home/pkgbuild/.cache/R/BiocFileCache/17102851e4bb87_cl.db
## Primary ontology prefix: CL
## Warning in ontologyIndex::ontology_index(name = nn, parents = pl): Some parent
## terms not found: BFO:0000002, BFO:0000004, SO:0000704 (16 more)
## Ontology with 18801 terms
##
## Properties:
## id: character
## name: list
## parents: list
## children: list
## ancestors: list
## obsolete: logical
## Roots:
## UBERON:0000105 - life cycle stage
## GO:0008150 - biological_process
## GO:0003674 - molecular_function
## UBERON:0000465 - material anatomical entity
## UBERON:0000466 - immaterial anatomical entity
## GO:0005575 - cellular_component
## PATO:0000001 - quality
## PR:000010543 - myeloperoxidase
## IAO:0000027 - data item
## NCBITaxon:131567 - cellular organisms
## ... 940 more
A convenience function assists with visualizations:
The S7 class design in this package was initiated by a request to Anthropic Claude to use S7 in establishing code that mirrors the tasks accomplished in the INCAtools jupyter notebook.
The code of search_labels is:
## <S7_method> method(search_labels, ontoProc2::SemsqlConn)
## function (x, pattern, limit = 20L)
## {
## dbGetQuery(x@con, "SELECT subject, value AS label FROM rdfs_label_statement\n WHERE value LIKE ? LIMIT ?",
## param = list(paste0("%", pattern, "%"), as.integer(limit)))
## }
## <environment: namespace:ontoProc2>
The INCAtools notebook discusses the fact that rdfs_label_statement is a SQLite table “view”.
The notebook indicates that a SPARQL query on an RDF store for the following computation would be “quite hard”. We want to find all the “edges” leading from “enteric neuron”, which would constitute the set of subject-predicate-object statements about this cell type with “enteric neuron” as subject.
In this code we use the concept of a “CURIE” (Compact Uniform Resource Identifier): a fixed length numerical identifier with a prefix indicating the source ontology in which the ontologic concept is based.
if (!is_connected(clss)) clss <- reconnect(clss)
entcurie <- search_labels(clss, "enteric neuron") |>
filter(grepl("^CL", subject)) |>
dplyr::select(subject) |>
unlist()
entcurie## subject
## "CL:0007011"
## subject subject_label predicate predicate_label object
## 1 CL:0007011 enteric neuron BFO:0000050 part of UBERON:0002005
## 2 CL:0007011 enteric neuron BFO:0000050 part of UBERON:0002005
## 3 CL:0007011 enteric neuron RO:0002100 has soma location UBERON:0002005
## 4 CL:0007011 enteric neuron RO:0002100 has soma location UBERON:0002005
## 5 CL:0007011 enteric neuron RO:0002202 develops from CL:0002607
## 6 CL:0007011 enteric neuron rdfs:subClassOf <NA> CL:0000029
## 7 CL:0007011 enteric neuron rdfs:subClassOf <NA> CL:0000107
## object_label
## 1 enteric nervous system
## 2 enteric nervous system
## 3 enteric nervous system
## 4 enteric nervous system
## 5 migratory enteric neural crest cell
## 6 neural crest derived neuron
## 7 autonomic neuron
Here the underlying code is performing a join:
## <S7_method> method(get_direct_edges, ontoProc2::SemsqlConn)
## function (x, term_id, direction = "outgoing")
## {
## stopifnot(direction %in% c("outgoing", "incoming", "both"))
## query.init <- "SELECT\n e.subject,\n sl.value AS subject_label,\n e.predicate,\n pl.value AS predicate_label,\n e.object,\n ol.value AS object_label\n FROM edge e\n LEFT JOIN rdfs_label_statement sl ON e.subject = sl.subject\n LEFT JOIN rdfs_label_statement pl ON e.predicate = pl.subject\n LEFT JOIN rdfs_label_statement ol ON e.object = ol.subject\n WHERE"
## if (direction == "outgoing") {
## query.fin <- "e.subject = ?"
## query = paste(query.init, query.fin)
## return(dbGetQuery(x@con, query, param = list(term_id)))
## }
## else if (direction == "incoming") {
## query.fin <- "e.object = ?"
## query = paste(query.init, query.fin)
## return(dbGetQuery(x@con, query, param = list(term_id)))
## }
## else if (direction == "both") {
## query.fin <- "e.subject = ? OR e.object = ?"
## query = paste(query.init, query.fin)
## return(dbGetQuery(x@con, query, param = list(term_id,
## term_id)))
## }
## }
## <environment: namespace:ontoProc2>
The notebook mentions that the “entailed edges” table includes all statements that can be inferred from the application of base axioms of the ontology.
## id label predicate
## 1 BFO:0000002 <NA> rdfs:subClassOf
## 2 BFO:0000004 <NA> rdfs:subClassOf
## 3 BFO:0000040 <NA> rdfs:subClassOf
## 4 UBERON:0001062 anatomical entity rdfs:subClassOf
## 5 UBERON:0000061 anatomical structure rdfs:subClassOf
## 6 CL:0000107 autonomic neuron rdfs:subClassOf
## 7 CL:0000000 cell rdfs:subClassOf
## 8 CL:0000211 electrically active cell rdfs:subClassOf
## 9 CL:0000393 electrically responsive cell rdfs:subClassOf
## 10 CL:0000404 electrically signaling cell rdfs:subClassOf
## 12 CL:0000255 eukaryotic cell rdfs:subClassOf
## 13 UBERON:0000465 material anatomical entity rdfs:subClassOf
## 14 CL:0002319 neural cell rdfs:subClassOf
## 15 CL:0000029 neural crest derived neuron rdfs:subClassOf
## 16 CL:0000540 neuron rdfs:subClassOf
## 17 CL:2000032 peripheral nervous system neuron rdfs:subClassOf
The INCAtools notebook includes an example of finding all neurons that are part of the forebrain. This involves identifying CURIEs for relations and anatomical structures, thus working with the relational ontology (RO) and UBERON.
## Connected to SemanticSQL database: /home/pkgbuild/.cache/R/BiocFileCache/17102852a1be12_uberon.db
## Primary ontology prefix: UBERON
## Connected to SemanticSQL database: /home/pkgbuild/.cache/R/BiocFileCache/1710282ff82f0c_ro.db
## Primary ontology prefix: RO
First question: What’s the CURIE for “forebrain” in UBERON?
fbcur <- search_labels(ub, "forebrain", limit = 1000) |>
filter(label == "forebrain") |>
select(subject) |>
unlist()
fbcur## subject
## "UBERON:0001890"
Second question: What’s the CURIE for “has soma location” in RO?
## subject
## "RO:0002100"
What’s the CURIE for “neuron”?
ncur <- search_labels(clss, "neuron", limit = 1000) |>
filter(label == "neuron") |>
select(subject) |>
unlist()
ncur## subject
## "CL:0000540"
Now we use three steps to obtain the solution.
First, enumerate all cell types that are located in forebrain.
clinfb <- tbl(clss@con, "entailed_edge") |>
filter(predicate == loccur, object == fbcur) |>
select(subject) |>
collect() |>
unlist()
length(clinfb)## [1] 185
Second, filter these to those identified as ‘subclassOf’ “neuron”.
clisneur <- tbl(clss@con, "entailed_edge") |>
filter(predicate == "rdfs:subClassOf", object == ncur) |>
filter(subject %in% clinfb) |>
select(subject) |>
collect() |>
unlist()
length(clisneur)## [1] 185
Finally, get the labels.
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] S7_0.2.2 DBI_1.3.0 dplyr_1.2.1 DT_0.34.0
## [5] ontoProc2_0.99.16 BiocStyle_2.40.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 rappdirs_0.3.4 sass_0.4.10
## [4] generics_0.1.4 RSQLite_2.4.6 digest_0.6.39
## [7] magrittr_2.0.5 evaluate_1.0.5 grid_4.6.0
## [10] bookdown_0.46 fastmap_1.2.0 blob_1.3.0
## [13] R.oo_1.27.1 jsonlite_2.0.0 R.utils_2.13.0
## [16] ontologyIndex_2.12 ontologyPlot_1.7 graph_1.90.0
## [19] tinytex_0.59 BiocManager_1.30.27 purrr_1.2.2
## [22] crosstalk_1.2.2 Rgraphviz_2.56.0 codetools_0.2-20
## [25] httr2_1.2.2 jquerylib_0.1.4 paintmap_1.0
## [28] cli_3.6.6 rlang_1.2.0 dbplyr_2.5.2
## [31] R.methodsS3_1.8.2 bit64_4.8.0 withr_3.0.2
## [34] cachem_1.1.0 yaml_2.3.12 otel_0.2.0
## [37] tools_4.6.0 memoise_2.0.1 filelock_1.0.3
## [40] BiocGenerics_0.58.0 curl_7.1.0 vctrs_0.7.3
## [43] R6_2.6.1 magick_2.9.1 stats4_4.6.0
## [46] BiocFileCache_3.2.0 lifecycle_1.0.5 htmlwidgets_1.6.4
## [49] bit_4.6.0 pkgconfig_2.0.3 pillar_1.11.1
## [52] bslib_0.10.0 Rcpp_1.1.1-1.1 glue_1.8.1
## [55] xfun_0.57 tibble_3.3.1 tidyselect_1.2.1
## [58] knitr_1.51 htmltools_0.5.9 rmarkdown_2.31
## [61] compiler_4.6.0