--- title: "Using the R package `motifcluster`" author: "William G. Underwood" date: "`r format(Sys.Date(), '%B %d, %Y')`" output: pdf_document: citation_package: natbib number_sections: true highlight: tango keep_tex: TRUE bibliography: refs.bib vignette: > %\VignetteIndexEntry{using_the_motifcluster_package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#> " ) set.seed(200401293) old_options <- options() options(digits = 2) library(mclust) # nolint start: line_length_linter ``` \tableofcontents \pagebreak # Introduction This vignette demonstrates how to use the R package `motifcluster`. These methods are detailed in the paper *Motif-Based Spectral Clustering of Weighted Directed Networks*, which is available at [`arXiv:2004.01293`](arxivpaper) [@UnderwoodElliottCucuringu_2020_Motifs]. The functionality of the `motifcluster` package falls into a few main categories: - Building motif adjacency matrices - Sampling random weighted directed networks - Spectral embedding with motif adjacency matrices - Motif-based spectral clustering This vignette comprehensibly demonstrates all of these functionalities, showcasing the full capability of the `motifcluster` package. We first load some helpful packages for this tutorial: ```{r, eval = FALSE} library(devtools) library(mclust) ``` The `motifcluster` package can be installed from GitHub with: ```{r, eval = FALSE} install_github("wgunderwood/motifcluster/R") ``` The package can then be loaded from within R with: ```{r, results = "hide"} library(motifcluster) ``` # Building motif adjacency matrices The main novelty in the `motifcluster` package is its ability to build a wide variety of motif adjacency matrices (MAMs), and to do so quickly. There are several options to consider when building an MAM, which are covered in this section. ## An example network In order to demonstrate the construction of MAMs, we first need a small weighted directed network $\mathcal{G}_1$ to use as an example. Note that throughout this package we represent networks by their weighted directed adjacency matrices (possibly in sparse form). This means that for use alongside R packages such as `igraph`, one must manually convert between adjacency matrices and `igraph` objects. ```{r} G1 <- matrix(c( 0, 2, 0, 0, 0, 0, 2, 3, 0, 4, 0, 0, 4, 0, 5, 0 ), byrow = TRUE, nrow = 4) ``` ## Basic motif adjacency matrix construction The `build_motif_adjacency_matrix` function is the main workhorse for building MAMs with `motifcluster`. Let's use it to build an MAM for the network $\mathcal{G}_1$. First we must choose a motif to look for. A full list can be obtained with: ```{r} get_motif_names() ``` Let's use the 3-cycle motif $\mathcal{M}_1$. ```{r} build_motif_adjacency_matrix(G1, motif_name = "M1") ``` Note that all the entries are zero except for entries $(1,2)$, $(1,4)$, $(2,1)$, $(2,4)$, $(4,1)$, and $(4,2)$. This is because vertices 1, 2 and 4 form an exact copy of the motif $\mathcal{M}_1$ in the network $\mathcal{G}_1$, and the $(i,j)$th MAM entry simply counts the number of instances containing both vertices $i$ and $j$. ## Functional and structural motif adjacency matrices Looking at our example network $\mathcal{G}_1$ again, you might notice that there is seemingly another instance of the motif $\mathcal{M}_1$ in our network $\mathcal{G}_1$, on the vertices 2, 3 and 4, albeit with an "extra" edge from 2 to 3. The reason for this is that we instructed `build_motif_adjacency_matrix` to look for *structural* motif instances (this is the default). Structural instances require an exact match, with no extra edges. If we want to also include instances which may have extra edges present, we must instead use functional motif instances: ```{r} build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func") ``` This time we also pick up the 3-cycle on vertices 2, 3 and 4. Vertices 2 and 4 therefore occur in two distinct instances of the motif, and so their motif adjacency matrix entries are equal to two. ## Weighted motif adjacency matrices Our example network $\mathcal{G}_1$ has weighted edges, which we have not yet used: so far our MAMs have been simply counting instances of motifs. This is because the default weighting scheme is "unweighted", assigning every instance a weight of one. ### Mean-weighted instances We could instead use the "mean" weighting scheme, where every instance is assigned a weight equal to its mean edge weight. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$: ```{r} build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func", mam_weight_type = "mean") ``` The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so its mean edge weight is 3. Similarly the 3-cycle on vertices 2, 3 and 4 has mean edge weight of 4. Vertices 2 and 4 appear in both, so their mutual MAM entries are the sum of these two mean weights, which is 7. ### Product-weighted instances We can also use the "product" weighting scheme, where every instance is assigned a weight equal to the product of its edge weights. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$: ```{r} build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func", mam_weight_type = "product") ``` The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so the product of its edge weights is 24. Similarly the 3-cycle on vertices 2, 3 and 4 has product of edge weights of 60. Vertices 2 and 4 appear in both, so their shared MAM entries are the sum of these two product weights, which is 84. ### Computation method The final argument to `build_motif_adjacency_matrix` is the `mam_method` argument. This does not affect the value returned but may impact the amount of time taken to return the MAM. In general the "sparse" method (the default) is faster on large sparse networks. The "dense" method, which uses fewer operations but on denser matrices, tends to be faster for small dense networks. ```{r} mam_sparse <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "sparse") mam_dense <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "dense") all(mam_sparse == mam_dense) ``` # Sampling random weighted directed networks Building adjacency matrices by hand is tedious, so it is useful to have methods for generating the adjacency matrices of networks drawn from some probabilistic model. We use (weighted) directed stochastic block models (DSBMs) and (weighted) bipartite stochastic block models (BSBMs). ## Directed stochastic block models First let's sample the adjacency matrix of a DSBM which has two blocks of vertices; the first containing five vertices and the second containing three. We use strong within-block connections, with the diagonal entries of the connection matrix set to $0.9$. The between-block connections are weaker, with the off-diagonal connection matrix entries set to $0.2$. Note how the resulting adjacency matrix is denser on its diagonal blocks $\{1, \dots, 5\} \times \{1, \dots, 5\}$ and $\{6, \dots, 8\} \times \{6, \dots, 8\}$, and is sparser on its off-diagonal blocks $\{1, \dots, 5\} \times \{6, \dots, 8\}$ and $\{6, \dots, 8\} \times \{1, \dots, 5\}$. The entries which lie exactly on the diagonal will always be zero, since we only consider networks without self-loops. ```{r} block_sizes <- c(5, 3) connection_matrix <- matrix(c( 0.9, 0.2, 0.2, 0.9 ), nrow = 2, byrow = TRUE) ``` \pagebreak ```{r} sample_dsbm(block_sizes, connection_matrix) ``` ### Constant-weighted directed stochastic block models The matrix above has binary entries, indicating that it is the adjacency matrix of an unweighted directed network. The `motifcluster` package also allows sampling of weighted directed networks. The simplest example of this is "constant" weighting, where we simply multiply each block of the adjacency matrix by a constant. ```{r} weight_matrix <- matrix(c( 5, 2, 2, 5 ), nrow = 2, byrow = TRUE) sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "constant") ``` ### Poisson-weighted directed stochastic block models We can also use weights drawn randomly from a Poisson distribution, where each block in the adjacency matrix has its own mean parameter. This returns an adjacency matrix with weights which could be any natural number, but is equal in expectation to the constant version. Note that in this scheme it is possible for the weight to be zero, removing an edge which might have otherwise been present. ```{r} sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "poisson") ``` ## Bipartite stochastic block models The `motifcluster` package can also be used to sample bipartite networks. The vertices of a bipartite network are partitioned into "source" and "destination" vertices, and edges are only permitted to go from source vertices to destination vertices. Let's sample a DSBM with a single block of two source vertices and two blocks of destination vertices, with sizes of three and two respectively. We can use a strong connection probability of 0.9 to the first block of destination vertices, and a weaker connection probability of 0.2 to the second. ```{r} source_block_sizes <- c(2) destination_block_sizes <- c(3, 2) bipartite_connection_matrix <- matrix(c(0.9, 0.2), nrow = 1) sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix) ``` ### Weighted bipartite stochastic block models Similarly to the more general directed stochastic block models, we can also use constant-weighted or Poisson-weighted edges for bipartite stochastic block models. ```{r} bipartite_weight_matrix <- matrix(c(7, 2), nrow = 1) sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix, bipartite_weight_matrix, sample_weight_type = "poisson") ``` # Spectral embedding with motif adjacency matrices Spectral methods involve performing eigenvalue and eigenvector operations on matrices related to networks. We work here with weighted *undirected* networks (which have symmetric adjacency matrices), because motif adjacency matrices are always symmetric. ## Laplacian matrices We can construct two types of Laplacian matrix for a network using the `motifcluster` package. First we create an example of a weighted undirected network $\mathcal{G}_2$. \pagebreak ```{r} G2 <- matrix(c( 0, 2, 0, 0, 2, 0, 4, 3, 0, 4, 0, 5, 0, 3, 5, 0 ), byrow = TRUE, nrow = 4) ``` ### Combinatorial Laplacian The combinatorial Laplacian of an adjacency matrix $G$ is $L_\mathrm{c} = D - G$, where $D$ is the diagonal matrix of weighted vertex degrees: ```{r} build_laplacian(G2, type_lap = "comb") ``` ### Random-walk Laplacian The random-walk Laplacian of an adjacency matrix $G$ is $L_\mathrm{rw} = I - D^{-1}G$, where $D$ is the diagonal matrix of weighted vertex degrees and $I$ is the identity matrix: ```{r} build_laplacian(G2, type_lap = "rw") ``` ## Laplace embedding Once we have constructed the desired Laplacian matrix, we use it to embed each vertex into $\mathbb{R}^l$ by finding the eigenvectors associated with its first (smallest magnitude) few eigenvalues. Below we use the random-walk Laplacian, and embedding dimension $l=2$: ```{r} spectrum <- run_laplace_embedding(G2, num_eigs = 2, type_lap = "rw") spectrum$vals spectrum$vects ``` For a random-walk Laplacian, the first eigenvalue is always zero (up to machine precision) and its corresponding eigenvector is constant. ## Motif embedding Motif embedding is simply the process of building an MAM and performing Laplace embedding with it. As an example we use the `run_motif_embedding` function on the network $\mathcal{G}_3$ below. ```{r} G3 <- matrix(c( 0, 0, 0, 0, 0, 0, 2, 3, 0, 4, 0, 0, 4, 0, 5, 0 ), byrow = TRUE, nrow = 4) ``` An artifact of building MAMs is that although the original network may be connected, there is no guarantee that the MAM is also connected. Hence the MAM is restricted to its largest connected component before the Laplacian is formed. We observe this with the network $\mathcal{G}_3$, in which only three of the four vertices are embedded. ```{r} spectrum <- run_motif_embedding(G3, motif_name = "M1", motif_type = "func", mam_weight_type = "unweighted", mam_method = "sparse", num_eigs = 2, restrict = TRUE, type_lap = "rw") spectrum$vals spectrum$vects ``` # Motif-based spectral clustering The overall aim of `motifcluster` is to use motifs for spectral clustering, so now we see how to extract clusters from the motif-based eigenvector embeddings. The `run_motif_clustering` function handles the entire process of building an MAM, restricting it to its largest connected component, performing eigenvector embedding, and extracting clusters. We therefore take the opportunity to showcase the ability of `motifcluster` to recover the blocks of a DSBM, demonstrating all of the methods outlined in this vignette. Let's use a DSBM with three blocks of 10 nodes each. ```{r} block_sizes <- rep(10, 3) ``` We use strong connections of 0.8 within the blocks, and weaker connections of 0.3 between the blocks. ```{r} connection_matrix <- matrix(c( 0.8, 0.2, 0.2, 0.2, 0.8, 0.2, 0.2, 0.2, 0.8 ), nrow = 3) ``` We also set the within-block edges to be Poisson-weighted with mean 20, and the between-block edges to be Poisson-weighted with smaller mean 10. ```{r} weight_matrix <- matrix(c( 20, 10, 10, 10, 20, 10, 10, 10, 20 ), nrow = 3) G4 <- sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "poisson") ``` Now we can run the motif-based spectral clustering algorithm on this network with the 3-cycle motif $\mathcal{M}_1$. We build a functional MAM, weighting the instances by their mean edge weights, using the sparse formulation. We restrict this MAM to its largest connected component. Then we construct a random-walk Laplacian and embed it using the first four eigenvalues and eigenvectors. Finally we extract three clusters. ```{r} motif_cluster <- run_motif_clustering(G4, motif_name = "M1", motif_type = "func", mam_weight_type = "mean", mam_method = "sparse", type_lap = "rw", num_eigs = 4, num_clusts = 3 ) ``` We can evaluate the performance by comparing it to the ground-truth labels using the adjusted Rand index from the `mclust` package: ```{r} truth <- c(rep(1, 10), rep(2, 10), rep(3, 10)) mclust::adjustedRandIndex(motif_cluster$clusts, truth) ``` A larger value indicates better recovery of the blocks, with a value of one indicating perfect agreement. [arxivpaper]: https://arxiv.org/abs/2004.01293 # References ```{r, include = FALSE} options(old_options) #nolint end ```