---
title: "Using the R package `motifcluster`"
author: "William G. Underwood"
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output:
  pdf_document:
    citation_package: natbib
    number_sections: true
    highlight: tango
keep_tex: TRUE
bibliography: refs.bib
vignette: >
  %\VignetteIndexEntry{using_the_motifcluster_package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#> "
)
set.seed(200401293)
old_options <- options()
options(digits = 2)
library(mclust)
# nolint start: line_length_linter
```


\tableofcontents
\pagebreak

# Introduction

This vignette demonstrates how to use the R package `motifcluster`.
These methods are detailed in the paper
*Motif-Based Spectral Clustering of Weighted Directed Networks*,
which is available at [`arXiv:2004.01293`](arxivpaper)
[@UnderwoodElliottCucuringu_2020_Motifs].
The functionality of the `motifcluster` package
falls into a few main categories:

- Building motif adjacency matrices
- Sampling random weighted directed networks
- Spectral embedding with motif adjacency matrices
- Motif-based spectral clustering

This vignette comprehensibly demonstrates all of these functionalities,
showcasing the full capability of the `motifcluster` package.
We first load some helpful packages for this tutorial:

```{r, eval = FALSE}
library(devtools)
library(mclust)
```

The `motifcluster` package can be installed from GitHub with:

```{r, eval = FALSE}
install_github("wgunderwood/motifcluster/R")
```

The package can then be loaded from within R with:

```{r, results = "hide"}
library(motifcluster)
```

# Building motif adjacency matrices

The main novelty in the `motifcluster` package is its ability to
build a wide variety of motif adjacency matrices (MAMs), and to do so quickly.
There are several options to consider when building an MAM,
which are covered in this section.

## An example network

In order to demonstrate the construction of MAMs,
we first need a small weighted directed network
$\mathcal{G}_1$ to use as an example.
Note that throughout this package we represent networks by their
weighted directed adjacency matrices (possibly in sparse form).
This means that for use alongside R packages such as `igraph`,
one must manually convert between adjacency matrices and
`igraph` objects.

```{r}
G1 <- matrix(c(
  0, 2, 0, 0,
  0, 0, 2, 3,
  0, 4, 0, 0,
  4, 0, 5, 0
), byrow = TRUE, nrow = 4)
```

## Basic motif adjacency matrix construction

The `build_motif_adjacency_matrix` function is the main workhorse
for building MAMs with `motifcluster`.
Let's use it to build an MAM for the network $\mathcal{G}_1$.
First we must choose a motif to look for.
A full list can be obtained with:

```{r}
get_motif_names()
```

Let's use the 3-cycle motif $\mathcal{M}_1$.

```{r}
build_motif_adjacency_matrix(G1, motif_name = "M1")
```

Note that all the entries are zero except for entries
$(1,2)$,
$(1,4)$,
$(2,1)$,
$(2,4)$,
$(4,1)$,
and
$(4,2)$.
This is because vertices 1, 2 and 4 form an exact copy of
the motif $\mathcal{M}_1$ in the network $\mathcal{G}_1$,
and the  $(i,j)$th MAM entry simply counts the number of instances
containing both vertices $i$ and $j$.

## Functional and structural motif adjacency matrices

Looking at our example network $\mathcal{G}_1$ again,
you might notice that there is seemingly another instance
of the motif $\mathcal{M}_1$
in our network $\mathcal{G}_1$,
on the vertices 2, 3 and 4,
albeit with an "extra" edge from 2 to 3.
The reason for this is that we instructed
`build_motif_adjacency_matrix`
to look for *structural* motif instances (this is the default).
Structural instances require an exact match, with no extra edges.
If we want to also include instances which may have extra edges present,
we must instead use functional motif instances:


```{r}
build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func")
```

This time we also pick up the 3-cycle on vertices 2, 3 and 4.
Vertices 2 and 4 therefore occur in two distinct instances
of the motif, and so their motif adjacency matrix entries are equal
to two.

## Weighted motif adjacency matrices

Our example network $\mathcal{G}_1$ has weighted edges,
which we have not yet used:
so far our MAMs have been simply counting
instances of motifs.
This is because the default weighting scheme is
"unweighted", assigning every instance a weight of one.

### Mean-weighted instances

We could instead use the "mean" weighting scheme,
where every instance is assigned a weight equal to its
mean edge weight.
The $(i,j)$th MAM entry is then defined as the sum of these
instance weights across all instances containing
both vertices $i$ and $j$:

```{r}
build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func",
  mam_weight_type = "mean")
```

The 3-cycle on vertices 1, 2 and 4
has edge weights of 2, 3 and 4, so its mean edge weight is 3.
Similarly the 3-cycle on vertices 2, 3 and 4 has mean edge weight of 4.
Vertices 2 and 4 appear in both, so their mutual MAM entries are the
sum of these two mean weights, which is 7.

### Product-weighted instances

We can also use the "product" weighting scheme,
where every instance is assigned a weight equal to the
product of its edge weights.
The $(i,j)$th MAM entry is then defined as the sum of these
instance weights across all instances containing
both vertices $i$ and $j$:

```{r}
build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func",
  mam_weight_type = "product")
```

The 3-cycle on vertices 1, 2 and 4
has edge weights of 2, 3 and 4, so the product of its edge weights is 24.
Similarly the 3-cycle on vertices 2, 3 and 4 has product of edge weights of 60.
Vertices 2 and 4 appear in both, so their shared MAM entries are the
sum of these two product weights, which is 84.

### Computation method

The final argument to `build_motif_adjacency_matrix`
is the `mam_method` argument.
This does not affect the value returned but may impact the
amount of time taken to return the MAM.
In general the "sparse" method (the default)
is faster on large sparse networks.
The "dense" method, which uses fewer operations but on
denser matrices, tends to be faster for small dense networks.


```{r}
mam_sparse <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "sparse")
mam_dense  <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "dense")
all(mam_sparse == mam_dense)
```

# Sampling random weighted directed networks

Building adjacency matrices by hand is tedious,
so it is useful to have methods for generating
the adjacency matrices of networks drawn from some probabilistic model.
We use (weighted) directed stochastic block models (DSBMs) and
(weighted) bipartite stochastic block models (BSBMs).

## Directed stochastic block models

First let's sample the adjacency matrix of a DSBM
which has two blocks of vertices;
the first containing five vertices and the second containing three.
We use strong within-block connections,
with the diagonal entries of the connection matrix set to $0.9$.
The between-block connections are weaker, with the off-diagonal
connection matrix entries set to $0.2$.
Note how the resulting adjacency matrix is denser
on its diagonal blocks
$\{1, \dots, 5\} \times \{1, \dots, 5\}$
and
$\{6, \dots, 8\} \times \{6, \dots, 8\}$,
and is sparser on its off-diagonal blocks
$\{1, \dots, 5\} \times \{6, \dots, 8\}$
and
$\{6, \dots, 8\} \times \{1, \dots, 5\}$.
The entries which lie exactly on the diagonal will
always be zero,
since we only consider networks without self-loops.

```{r}
block_sizes <- c(5, 3)
connection_matrix <- matrix(c(
  0.9, 0.2,
  0.2, 0.9
), nrow = 2, byrow = TRUE)
```

\pagebreak

```{r}
sample_dsbm(block_sizes, connection_matrix)
```

### Constant-weighted directed stochastic block models

The matrix above has binary entries,
indicating that it is the adjacency matrix of an unweighted
directed network.
The `motifcluster` package also allows sampling of weighted directed networks.
The simplest example of this is "constant" weighting,
where we simply multiply each block of the adjacency matrix by a constant.


```{r}
weight_matrix <- matrix(c(
  5, 2,
  2, 5
), nrow = 2, byrow = TRUE)
sample_dsbm(block_sizes, connection_matrix, weight_matrix,
  sample_weight_type = "constant")
```

### Poisson-weighted directed stochastic block models

We can also use weights drawn randomly from a Poisson distribution,
where each block in the adjacency matrix has its own mean parameter.
This returns an adjacency matrix with weights which could be any natural
number, but is equal in expectation to the constant version.
Note that in this scheme it is possible for the weight to be zero,
removing an edge which might have otherwise been present.

```{r}
sample_dsbm(block_sizes, connection_matrix, weight_matrix,
  sample_weight_type = "poisson")
```

## Bipartite stochastic block models

The `motifcluster` package can also be used to sample
bipartite networks.
The vertices of a bipartite network are partitioned
into "source" and "destination" vertices,
and edges are only permitted to go from source vertices
to destination vertices.
Let's sample a DSBM with a single block of two source vertices
and two blocks of destination vertices, with
sizes of three and two respectively.
We can use a strong connection probability of 0.9 to the first block of
destination vertices,
and a weaker connection probability of 0.2 to the second.

```{r}
source_block_sizes <- c(2)
destination_block_sizes <- c(3, 2)
bipartite_connection_matrix <- matrix(c(0.9, 0.2), nrow = 1)
sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix)
```

### Weighted bipartite stochastic block models

Similarly to the more general directed stochastic block models,
we can also use constant-weighted or Poisson-weighted edges for bipartite
stochastic block models.

```{r}
bipartite_weight_matrix <- matrix(c(7, 2), nrow = 1)
sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix,
  bipartite_weight_matrix, sample_weight_type = "poisson")
```


# Spectral embedding with motif adjacency matrices

Spectral methods involve performing eigenvalue and
eigenvector operations on matrices related to networks.
We work here with weighted *undirected* networks
(which have symmetric adjacency matrices),
because motif adjacency matrices are always symmetric.

## Laplacian matrices

We can construct two types of Laplacian matrix for a network
using the `motifcluster` package.
First we create an example of a weighted undirected network $\mathcal{G}_2$.

\pagebreak

```{r}
G2 <- matrix(c(
  0, 2, 0, 0,
  2, 0, 4, 3,
  0, 4, 0, 5,
  0, 3, 5, 0
), byrow = TRUE, nrow = 4)
```

### Combinatorial Laplacian

The combinatorial Laplacian of an adjacency matrix $G$ is
$L_\mathrm{c} = D - G$,
where $D$ is the diagonal matrix of weighted vertex degrees:

```{r}
build_laplacian(G2, type_lap = "comb")
```

### Random-walk Laplacian

The random-walk Laplacian of an adjacency matrix $G$ is
$L_\mathrm{rw} = I - D^{-1}G$,
where $D$ is the diagonal matrix of weighted vertex degrees
and $I$ is the identity matrix:

```{r}
build_laplacian(G2, type_lap = "rw")
```

## Laplace embedding

Once we have constructed the desired Laplacian matrix,
we use it to embed each vertex into $\mathbb{R}^l$
by finding the eigenvectors associated with
its first (smallest magnitude) few eigenvalues.
Below we use the random-walk Laplacian,
and embedding dimension $l=2$:

```{r}
spectrum <- run_laplace_embedding(G2, num_eigs = 2, type_lap = "rw")
spectrum$vals
spectrum$vects
```

For a random-walk Laplacian,
the first eigenvalue is always zero (up to machine precision)
and its corresponding eigenvector is constant.

## Motif embedding

Motif embedding is simply the process of building an MAM
and performing Laplace embedding with it.
As an example we use the
`run_motif_embedding`
function on the network $\mathcal{G}_3$ below.

```{r}
G3 <- matrix(c(
  0, 0, 0, 0,
  0, 0, 2, 3,
  0, 4, 0, 0,
  4, 0, 5, 0
), byrow = TRUE, nrow = 4)
```

An artifact of building MAMs is that although the original network may
be connected, there is no guarantee that the MAM is also connected.
Hence the MAM is restricted to its largest connected component
before the Laplacian is formed.
We observe this with the network $\mathcal{G}_3$,
in which only three of the four vertices are embedded.

```{r}
spectrum <- run_motif_embedding(G3, motif_name = "M1", motif_type = "func",
  mam_weight_type = "unweighted", mam_method = "sparse", num_eigs = 2,
  restrict = TRUE, type_lap = "rw")
spectrum$vals
spectrum$vects
```


# Motif-based spectral clustering

The overall aim of `motifcluster` is to use motifs for spectral clustering,
so now we see how to extract clusters from the motif-based
eigenvector embeddings.
The `run_motif_clustering` function handles the entire process of
building an MAM, restricting it to its largest connected component,
performing eigenvector embedding, and extracting clusters.
We therefore take the opportunity to showcase the ability of
`motifcluster` to recover the blocks of a DSBM,
demonstrating all of the methods outlined in this vignette.

Let's use a DSBM with three blocks of 10 nodes each.

```{r}
block_sizes <- rep(10, 3)
```

We use strong connections of 0.8 within the blocks,
and weaker connections of 0.3 between the blocks.

```{r}
connection_matrix <- matrix(c(
  0.8, 0.2, 0.2,
  0.2, 0.8, 0.2,
  0.2, 0.2, 0.8
), nrow = 3)
```

We also set the within-block edges to be Poisson-weighted with
mean 20,
and the between-block edges to be Poisson-weighted with smaller
mean 10.

```{r}
weight_matrix <- matrix(c(
  20, 10, 10,
  10, 20, 10,
  10, 10, 20
), nrow = 3)

G4 <- sample_dsbm(block_sizes, connection_matrix, weight_matrix,
  sample_weight_type = "poisson")
```

Now we can run the motif-based spectral clustering algorithm
on this network with the 3-cycle motif $\mathcal{M}_1$.
We build a functional MAM,
weighting the instances by their mean edge weights,
using the sparse formulation.
We restrict this MAM to its largest connected component.
Then we construct a random-walk Laplacian and embed it using the
first four eigenvalues and eigenvectors.
Finally we extract three clusters.

```{r}
motif_cluster <- run_motif_clustering(G4, motif_name = "M1", motif_type = "func",
  mam_weight_type = "mean", mam_method = "sparse", type_lap = "rw",
  num_eigs = 4, num_clusts = 3
)
```

We can evaluate the performance by comparing it to the ground-truth
labels using the adjusted Rand index
from the `mclust` package:

```{r}
truth <- c(rep(1, 10), rep(2, 10), rep(3, 10))
mclust::adjustedRandIndex(motif_cluster$clusts, truth)
```

A larger value indicates better recovery of the blocks,
with a value of one indicating perfect agreement.


[arxivpaper]: https://arxiv.org/abs/2004.01293


# References


```{r, include = FALSE}
options(old_options)
#nolint end
```