MonarchR Basics

Shawn T O’Neil

Vignette updated: May-04-2026

Background: Biomedical Knowledge Graphs

Knowledge graphs (KGs) represent entities (such as genes, diseases, or phenotypes) and the relationships between them; for example that the CFTR gene causes the disease Cystic fibrosis. In the Monarch Initiative KG, this information is stored in a directed, labeled property graph, with properties attached to the entities (nodes) and relationships (edges). The KGX format used by Monarch provides a simple specification for KGs:

In the following figure, these required properties are shown in bold:

Small KG visualization showing node and edge properties, with id, category, subject, and object bolded, and other properties such as name present.

The monarchr package provides access to the cloud-hosted Monarch Initiative KG, and others in KGX format via graph-database and file-based “engines”, with a simple but flexible set of query functions. The example below finds all of the sub-types of Neimann-Pick Disease, and all of the genes associated with those. (You an use the Monarch Initiative website to identify IDs for genes, diseases, phenotypes and more, or search_nodes() described below.)

# MONDO:0001982 Niemann-Pick Disease (13 subtypes)

g <- monarch_engine() |>
    fetch_nodes(query_ids = c("MONDO:0001982")) |>
    expand(predicates = "biolink:subclass_of", direction = "in", transitive = TRUE) |>
    expand(categories = "biolink:Gene")

plot(g)
g

Backed by tidygraph and igraph, monarchr is part of a larger ecosystem of graph analyses in R:

Engines

Two kinds of engines are supported, providing largely the same functionality: one supporting access to Neo4j graph databases, and another supporting file-based access to TSV-formatted .tar.gz KGs hosted at kghub.io. Regardless of source, KGs must conform to the KGX specification.

The most accessible engine provided by monarchr is the monarch_engine(), designed specifically to access the cloud-hosted Monarch Initiative KG.

monarch <- monarch_engine()

Using another Neo4j database via neo4j_engine() is as simple as specifying a Neo4j Bolt endpoint, and username and password if required. This may be used to host your own Neo4j database for efficient KG access, based on the Monarch Neo4j docker deployment.

neodb <- neo4j_engine("http://localhost:7687", username = "user", password = "pass")

Finally, the file_engine() provides KG access to KGX-formatted, .tar.gz files containing tab-separated *_nodes.tsv and *_edges.tsv files. The package comes bundled with a small example KG providing information about Ehlers-Danlos and Marfan syndrome for demonstration purposes. You can also provide a local file path, or a URL to a remote file, which will be downloaded to the current working directory. (This example graph can also be loaded with data(eds_marfan_kg).)

filename <- system.file("extdata", "eds_marfan_kg.tar.gz", package = "monarchr")
# filename <- "https://kghub.io/kg-obo/mondo/2024-03-04/mondo_kgx_tsv.tar.gz"

eds_marfan_kg <- file_engine(filename)

Fetching Nodes

Engines support a fetch_nodes() function for pulling nodes by specified criteria. It returns a tidygraph-based graph, but only nodes, not any edges that may connect them. These can be specified as a set of IDs:

g <- monarch |>
    fetch_nodes(query_ids = c("MONDO:0009061", "HGNC:1884"))

g

Nodes can also be fetched in bulk. For example, we can fetch all of the nodes where in_taxon_label equals "Homo sapiens". For demonstration purposes, we’ll limit the number of results to 20 (ordered by node ID).

h_sapiens_nodes <- monarch |>
    fetch_nodes(in_taxon_label == "Homo sapiens", limit = 20)

h_sapiens_nodes

While the syntax does not support all R expressions (for compatibility with database-backed engines), standard comparisons ==, !=, <=, >=, <, and > are, as are logical operators &, |, and !, and grouping with ()s. Since many node properties are multi-valued, we provide a special %in_list% operator to allow exact match against entries. Here we fetch all the Homo sapiens Genes, using %in_list% to account for the multi-valued nature of node category.1

human_genes <- monarch |>
    fetch_nodes(in_taxon_label == "Homo sapiens" & "biolink:Gene" %in_list% category,
        limit = 20
    )

human_genes

A convenience %~% operator provides regular-expression matching against node properties (but not multi-valued ones like category).

fibrosis_disease_matches <- monarch |>
    fetch_nodes(description %~% ".*[fF]ibrosis.*" & "biolink:Disease" %in_list% category,
        limit = 20
    )

fibrosis_disease_matches

Expanding

From a set of nodes (and possibly edges that connect them), we can “expand” by fetching their collective neighborhood from the KG engine. Here we fetch 10 Gene nodes, and in the first call to expand() fetch nodes connected by biolink:causes or biolink:gene_associated_with_condition predicates. We then take the resulting graph (which contains both disease and gene nodes), and fetch the neighborhood of nodes with category biolink:PhenotypicFeature. Since there are many such results, we add a limit for the visualization.2

# MONDO:0018982, MONDO:0011871: Neimann-Pick type B and C

g <- monarch |>
    fetch_nodes(query_ids = c("MONDO:0018982", "MONDO:0011871")) |>
    expand(predicates = c("biolink:causes", "biolink:gene_associated_with_condition")) |>
    expand(categories = c("biolink:PhenotypicFeature"), limit = 10)

plot(g)

We can also expand iteratively using the expand_n function. This allows us to grow our graph to a certain degree of distance from the original input nodes. In the below example, we begin with a graph containing the disease “MONDO:0007525” and its directly associated phenotypes.

But what if we wish to get each of the subphenotypes within each of those phenotypes? And the sub-subphenotypes of those phenotypes? We can use the expand_n function to iteratively get a list of descendant subphenotypes up to 3 degrees of graph distance (n=3). You can adjust the maximum extent of the distance by increasing/decreasing the n parameter.

## Using example KGX file packaged with monarchr
data(eds_marfan_kg)

g <- eds_marfan_kg |>
    fetch_nodes(query_ids = "MONDO:0007525") |>
    expand(
        predicates = "biolink:has_phenotype",
        categories = "biolink:PhenotypicFeature"
    )

g_expanded <- g |> expand_n(predicates = "biolink:subclass_of", n = 3)

We can limit the expansion based on several factors, which are combined in an “and” fashion:

Lastly, the transitive parameter (which defaults to FALSE) can be used to fetch hierarchical, transitive relationships like biolink:subclass_of. When using transitive = TRUE, direction must be "in" or "out", and predicates must be a single entry. The first example above illustrated this; here we explore the super-types and sub-types of Niemann-Pick, by first fetching all of its subtypes, and then considering the collective ancestry of those nodes in the KG. Not all ancestors are diseases (e.g. biolink:Entity), so we restrict the result set of the second expansion.

# MONDO:0001982 Niemann-Pick Disease (13 subtypes)

hierarchy <- monarch |>
    fetch_nodes(query_ids = c("MONDO:0001982")) |>
    expand(predicates = "biolink:subclass_of", direction = "in", transitive = TRUE) |>
    expand(predicates = "biolink:subclass_of", direction = "out", transitive = TRUE, categories = "biolink:Disease")

plot(hierarchy)

Monarch-Specific Features

While most features of monarchr are applicable to any KGX-formatted KG, some are specific to functionality provided by the Monarch Initiative.

Searching

While fetch_nodes() and expand() apply to all types of engines, monarchr also exposes functionality specific to the Monarch Initiatives’ API services. The first of these is monarch_search(), which uses the same search API as the Monarch website, returning a graph with matching nodes backed by a monarch_engine().

cf_hits <- monarch_search("Cystic fibrosis", category = "biolink:Disease", limit = 5)

cf_hits

Semantic Similarity

Monarch’s Semantic Similarity tool provides matches between given query and target nodes, based on their semantic similarity in the KG. Here we grab a few phenotypes of Ehlers-Danlos syndrome, classic type ("MONDO:0007522") and Marfan Syndrome ("MONDO:0007947"), and we find the best matching Marfan phenotype for each EDS phenotype. The result is returned as a graph, containing only computed "computed:best_matches" edges:

eds_phenos <- monarch |>
    fetch_nodes(query_ids = "MONDO:0007522") |>
    expand(predicates = "biolink:has_phenotype", limit = 10)

marfan_phenos <- monarch |>
    fetch_nodes(query_ids = "MONDO:0007947") |>
    expand(predicates = "biolink:has_phenotype", limit = 10)

sim <- monarch_semsim(eds_phenos, marfan_phenos)

plot(sim)
sim

Note that the Disease nodes were included in the match as well. Although not displayed by plot(), edges also include the metric, similarity score, and ID of the least common ancestor relating the nodes.

We can use tidygraph’s graph_join() to see this result in the context of the original query and target graphs.

both_phenos <- eds_phenos |>
    graph_join(marfan_phenos) |>
    graph_join(sim)

plot(both_phenos)
both_phenos

The monarch_semsim result contain additional properties for monarch_semsim_metric, monarch_semsim_score, and monarch_semsim_ancestor_id, which we can use for plotting as well (loading ggraph to avoid this error):

library(ggraph)
plot(both_phenos, edge_color = monarch_semsim_score)

Other Features

Other useful features are covered in dedicated Vignettes:




Session Info

sessioninfo::session_info()

  1. The %in_list% and other operators work the same for file_engine()s, which store multi-valued properties in list columns. Multi-value properties are typically represented as list columns in R, but %in% does not provide the desired semantics–try using both %in% and %in_list% to filter a data.frame on values of a list column to see the difference. Note that %in_list% can only take a length-one character vector as the left-hand side, and the right hand side must be a single node property name in the KG.↩︎

  2. The limit parameter for expand() is more complex than for fetch_nodes(), because the query graph also contains edges which may be part of the neighborhood that would be fetched; expand() attempts to fetch limit new edges that are not part of the query already. Further, unlike in fetch_nodes() where nodes are ordered by ID, edges returned by expand() have no explicit order, and results with limit should not be considered reproducible or representative.↩︎