--- title: "Similarity between two groups of terms" author: "Zuguang Gu ( z.gu@dkfz.de )" date: '`r Sys.Date()`' output: html_vignette: css: main.css toc: true vignette: > %\VignetteIndexEntry{06. Similarity between two groups of terms} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, warning = FALSE, fig.align = "center") ``` The methods of group similarity implemented in **simona** are mainly from the [supplementary file](https://academic.oup.com/bib/article/18/5/886/2562801#supplementary-data) of the paper ["Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017"](https://doi.org/10.1093/bib/bbw067). Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette. There are two groups of terms denoted as $T_p$ and $T_q$ represented as two sets: $$ T_p = \{ a_1, a_2, ...\} \\ T_q = \{ b_1, b_2, ... \} $$ where $a_i$ is a term in set $T_p$ and $b_j$ is a term in set $T_q$. The wrapper function `group_sim()` calculates semantic similarities between two groups of terms with a specific method. Note the method name can be partially matched.
group_sim(dag, group1, group2, method = ..., control = list(...))Some of the group similarity methods have no assumption of which similarity measure between single terms to use. If there are annotation already provided in the DAG object, by default *Sim_Lin_1998* is used, or else *Sim_WP_1994* is used. The term similarity method can be set via the `term_sim_method` parameter in `control`. Additionally parameters for a specific `term_sim_method` can also be set in `control`.
group_sim(dag, group1, group2, method = ...,
control = list(term_sim_method = ...))
All supported group similarity methods are:
```{r}
library(simona)
all_group_sim_methods()
```
## Pairwise term similarity-based methods
### GroupSim_pairwise_avg
Denote $S(a, b)$ as the semantic similarity between term $a$ and $b$ where $a$
is from group $p$ and $b$ is from group $q$, The similarity between group $p$
and group $q$ is the average similarity of every pair of individual terms in
the two groups:
$$ \mathrm{GroupSim}(p, q) = \frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q}S(a, b) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1093/bioinformatics/btg153.
### GroupSim_pairwise_max
The similarity is defined as the maximal $S(a, b)$ among all pairs of terms in group $p$ and $q$:
$$ \mathrm{GroupSim}(p, q) = \max_{a \in T_p, b \in T_q}S(a, b) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_max"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1109/TCBB.2005.50.
### GroupSim_pairwise_BMA
BMA stands for "best-match average". First define similarity of a term $x$ to a group of terms $T$ as
$$ S(x, T) = \max_{y \in T} S(x, y) $$
which corresponds to the most similar term in $T$ to $x$. Then the BMA similarity is calculated as:
$$ \mathrm{GroupSim}(p, q) = \frac{1}{2}\left( \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q) + \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1155/2012/975783.
### GroupSim_pairwise_BMM
BMM stands for "best-match max". It is defined as:
$$ \mathrm{GroupSim}(p, q) = \max \left \{ \frac{1}{|T_p|}\sum_{a \in T_p} S(a, T_q), \frac{1}{|T_q|}\sum_{b \in T_q} S(b, T_p) \right \} $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1186/1471-2105-7-302.
### GroupSim_pairwise_ABM
ABM stands for "average best-match". It is defined as:
$$ \mathrm{GroupSim}(p, q) = \frac{1}{|T_q| + |T_q|} \left( \sum_{a \in T_p} S(a, T_q) + \sum_{b \in T_q} S(b, T_p) \right) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1186/1471-2105-14-284.
### GroupSim_pairwise_HDF
First define the distance of a term $x$ to a group of terms $T$:
$$D(x, T) = 1 - S(x, T)$$
Then the Hausdorff distance between two groups are:
$$ \mathrm{HDF}(p, q) = \max \left\{ \max_{a \in T_p} D(a, T_q), \max_{b \in T_q} D(b, T_q) \right\} $$
This final similarity is:
$$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{HDF}(p, q) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
### GroupSim_pairwise_MHDF
Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:
$$ \mathrm{MHDF}(p, q) = \max \left\{ \frac{1}{|T_p|} \sum_{a \in T_p} D(a, T_q), \frac{1}{|T_q|} \sum_{b \in T_q} D(b, T_q) \right\} $$
This final similarity is:
$$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{MHDF}(p, q) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1109/ICPR.1994.576361.
### GroupSim_pairwise_VHDF
It is defined as:
$$ \mathrm{VHDF}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} D^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} D^2(b, T_q)} \right) $$
This final similarity is:
$$ \mathrm{GroupSim}(p, q) = 1 - \mathrm{VHDF}(p, q) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1073/pnas.0702965104.
### GroupSim_pairwise_Froehlich_2007
The similarity is:
$$ \mathrm{GroupSim}(p, q) = \exp(-\mathrm{HDF}(p, q)) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1186/1471-2105-8-166.
### GroupSim_pairwise_Joeng_2014
Similar to VHDF, but it directly uses the similarity:
$$ \mathrm{GroupSim}(p, q) = \frac{1}{2} \left( \sqrt{\frac{1}{|T_p|} \sum_{a \in T_p} S^2(a, T_q)} + \sqrt{\frac{1}{|T_q|} \sum_{b \in T_q} S^2(b, T_q)} \right) $$
The term semantic similarity method and the IC method can be set via `control` argument, for example:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the `term_sim_method` can also be set in the `control` list.
Paper link: https://doi.org/10.1109/tcbb.2014.2343963.
## Pairwise edge-based methods
### GroupSim_SimALN
It is based on the average distance between every pair of terms in the two groups:
$$ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} D_\mathrm{sp}(a, b)\right) $$
Or use the longest distance between two terms:
$$ \mathrm{GroupSim}(p, q) = \exp\left(-\frac{1}{|T_p|*|T_q|} \sum_{a \in T_p, b \in T_q} \mathrm{len}(a, b)\right) $$
There is a parameter distance which takes value of `"longest_distances_via_LCA"`
(the default) or `"shortest_distances_via_NCA"`:
group_sim(dag, group1, group2, method = "GroupSim_SimALN",
control = list(distance = "shortest_distances_via_NCA"))
Paper link: https://doi.org/10.1109/CBMS.2008.27.
## Groupwise IC-based methods
This category of methods depend on the IC of terms in the two groups as well as their ancestor terms.
### GroupSim_SimGIC, GroupSim_SimDIC and GroupSim_SimUIC,
Denote $A$ and $B$ as the two sets of ancestors of terms in group $p$ and $q$ respectively:
$$
\begin{align*}
\mathcal{A}_p &= \bigcup_{a \in T_p} \mathcal{A}_a \\
\mathcal{A}_q &= \bigcup_{b \in T_q} \mathcal{A}_b \\
\end{align*}
$$
The _GroupSim_SimGIC_, _GroupSim_SimDIC_ and _GroupSim_SimUIC_ are very similar. They are
based on the IC of the ancestor terms, defined as:
$$
\begin{align*}
\mathrm{GroupSim}_\mathrm{SimGIC}(p, q) &= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q} \mathrm{IC}(x)} \\
\mathrm{GroupSim}_\mathrm{SimDIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x) + \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x)} \\
\mathrm{GroupSim}_\mathrm{SimUIC}(p, q) &= \frac{2 * \sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q} \mathrm{IC}(x)}{\max\left\{\sum\limits_{x \in \mathcal{A}_p} \mathrm{IC}(x), \sum\limits_{x \in \mathcal{A}_q} \mathrm{IC}(x) \right\}} \\
\end{align*}
$$
IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimGIC",
control = list(IC_method = ...))
### GroupSim_SimUI, GroupSim_SimDB, GroupSim_SimUB and GroupSim_SimNTO
These four methods are based on the counts of ancestor terms:
$$
\begin{align*}
\mathrm{GroupSim}_\mathrm{SimUI}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p \cup \mathcal{A}_q|} \\
\mathrm{GroupSim}_\mathrm{SimDB}(p, q) &= \frac{2*|\mathcal{A}_p \cap \mathcal{A}_q|}{|\mathcal{A}_p| + |\mathcal{A}_q|} \\
\mathrm{GroupSim}_\mathrm{SimUB}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\max\{|\mathcal{A}_p|, |\mathcal{A}_q|\}} \\
\mathrm{GroupSim}_\mathrm{SimNTO}(p, q) &= \frac{|\mathcal{A}_p \cap \mathcal{A}_q|}{\min\{|\mathcal{A}_p|, |\mathcal{A}_q|\}}
\end{align*}
$$
group_sim(dag, group1, group2, method = "GroupSim_SimUI")### GroupSim_SimCOU Let's write $\mathcal{A}_p$ and $\mathcal{A}_q$ as two vectors $\mathbf{v_p}$ and $\mathbf{v_q}$. Taking $\mathbf{v_p}$ as an example, it is $\mathbf{v_p} = (w_1, ..., w_n)$ where $n$ is the number of total terms in the DAG. The value $w_i$ is assigned to the corresponding term $t_i$ and is defined as: $$ \mathcal{w}_{i} = \left\{ \begin{array}{ll} \mathrm{IC}(t_i) & \textrm{if} t_i \in \mathcal{A}_p \\ 0 & \textrm{otherwise} \end{array} \right. $$ The semantic similarity is defined as the cosine similarity between the two vectors: $$ \mathrm{GroupSim}(a, b) = \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \| \cdot \left \| \mathbf{v_q} \right \|} $$ It can also be written as: $$ \mathrm{GroupSim}(a, b) = \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sqrt{\sum\limits_{x \in \mathcal{A}_p}\mathrm{IC}(x)^2} \cdot \sqrt{\sum\limits_{x \in \mathcal{A}_q}\mathrm{IC}(x)^2}} $$ IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOU",
control = list(IC_method = ...))
### GroupSim_SimCOT
The semantic similarity is defined as:
$$
\begin{align*}
\mathrm{GroupSim}(a, b) &= \frac{ \mathbf{v_p} \cdot \mathbf{v_q} }{\left \| \mathbf{v_p} \right \|^2 + \left \| \mathbf{v_q} \right \|^2 - \mathbf{v_p} \cdot \mathbf{v_q}} \\
&= \frac{\sum\limits_{x \in \mathcal{A}_p \cap \mathcal{A}_q}\mathrm{IC}(x)^2}{\sum\limits_{x \in \mathcal{A}_p \cup \mathcal{A}_q}\mathrm{IC}(x)^2}
\end{align*}
$$
IC method can be set via the `control` argument. By default if there is annotation associated, _IC_annotation_ is used, or else _IC_offspring_ is used.
group_sim(dag, group1, group2, method = "GroupSim_SimCOT",
control = list(IC_method = ...))
## Groupwise edge-based methods
### GroupSim_SimLP
It is the largest depth of terms in $\mathcal{A}_p \cap \mathcal{A}_q$.
$$ \mathrm{GroupSim}(p, q) = \max\{\delta(t): t \in \mathcal{A}_p \cap \mathcal{A}_q\} $$
group_sim(dag, group1, group2, method = "GroupSim_SimLP")Link: https://bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOvis.html#go-induced-distances. ### GroupSim_Ye_2005 It is a normalized version of *GroupSim_SimLP*: $$ \begin{align*} \mathrm{GroupSim}(p, q) &= \max\left\{\frac{\delta(t) - \delta_\mathrm{min}}{\delta_\mathrm{max} - \delta_\mathrm{min}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \\ &= \max\left\{\frac{\delta(t) }{\delta_\mathrm{max}}: t \in \mathcal{A}_p \cap \mathcal{A}_q\right\} \end{align*} $$ Since the minimal depth is zero for root.
group_sim(dag, group1, group2, method = "GroupSim_Ye_2005")Paper link: https://doi.org/10.1038/msb4100034. ## Annotated items-based methods This category of methods consider the items annotated to the two groups of terms. ### GroupSim_SimCHO It is based on the annotated items. Denote $\sigma(t)$ as the total number of annotated items of $t$ (after merging all its offspring terms). The similarity is calculated as: $$ \mathrm{GroupSim}(p, q) = \frac{\log(C_{pq})}{\log(C_\mathrm{min}/C_\mathrm{max})} $$ where $C_{pq} = \min\{\sigma(t): t \in T_p \cap T_q \}$, $C_\mathrm{min}$ is the minimal number of annotated items in the DAG which in most cases is 1, $C_\mathrm{max}$ is the maximal number of annotated items, which is the total number of items annotated to the complete DAG. The similarity can also be written in form of $\mathrm{IC}_\mathrm{anno}$: $$ \mathrm{GroupSim}(p, q) = \frac{\max\limits_{x \in T_p \cup T_q}\mathrm{IC}(x)}{\mathrm{IC}_\mathrm{max}} $$
group_sim(dag, group1, group2, method = "GroupSim_SimCHO")### GroupSim_SimALD The similarity is calculated as: $$ \mathrm{GroupSim}(p, q) = \max\left\{ 1 - \frac{\sigma(x)}{C_\mathrm{max}}: x \in T_p \cap T_q \right\} $$
group_sim(dag, group1, group2, method = "GroupSim_SimALD")## Set-based methods Since $T_p$ and $T_q$ are two sets, the Kappa coeffcient, Jaccard coeffcient, Dice coeffcient and overlap coeffcient can be naturally used.
group_sim(dag, group1, group2, method = "GroupSim_Jaccard",
control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Dice",
control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Overlap",
control = list(universe = ...))
group_sim(dag, group1, group2, method = "GroupSim_Kappa",
control = list(universe = ...))
## Session info
```{r}
sessionInfo()
```