--- title: "Getting started with moc.gapbk" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with moc.gapbk} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) ``` ## Overview The `moc.gapbk` package implements the **Multi-Objective Clustering Algorithm Guided by a-Priori Biological Knowledge** (MOC-GaPBK) proposed by Parraga-Alava and others (2018). The algorithm combines: * NSGA-II as the underlying multi-objective evolutionary engine, * Path-Relinking as an intensification strategy, and * Pareto Local Search as a diversification strategy. It receives two distance matrices and produces a set of non-dominated clustering solutions. The second matrix is typically used to encode a-priori biological knowledge (for example, semantic similarity between genes). ## Basic usage ```{r basic} library(moc.gapbk) set.seed(2025) # Toy data: 50 objects (e.g. genes) described by 20 features (e.g. samples). x <- matrix(stats::runif(50 * 20, min = -5, max = 10), nrow = 50, ncol = 20) # Two distance matrices over the same set of objects. # Here we use amap if available (correlation distance is biologically # common), and fall back to base R otherwise so the vignette knits # under any configuration. if (requireNamespace("amap", quietly = TRUE)) { d1 <- as.matrix(amap::Dist(x, method = "euclidean")) d2 <- as.matrix(amap::Dist(x, method = "correlation")) } else { d1 <- as.matrix(stats::dist(x, method = "euclidean")) d2 <- as.matrix(stats::dist(x, method = "manhattan")) } res <- moc.gapbk(dmatrix1 = d1, dmatrix2 = d2, num_k = 3, generation = 5, pop_size = 6) ``` ### Pareto-front population `res$population` contains the medoids that survived the last generation, together with the values of the two objective functions, the Pareto ranking and the crowding distance. ```{r population} head(res$population) ``` ### Cluster assignments per solution `res$matrix.solutions` is a data frame whose columns are the clustering assignments produced by each non-dominated solution. ```{r matrix-solutions} head(res$matrix.solutions) ``` ### Convenient per-solution vectors `res$clustering` exposes the same information as a list of named integer vectors, ready to be passed to validation indices, plotting helpers, etc. ```{r clustering-vec} str(res$clustering[[1]]) table(res$clustering[[1]]) ``` ## Enabling Path-Relinking and Pareto Local Search The full algorithm activates the intensification and diversification strategies through the `local_search` argument. Because Pareto Local Search has quadratic cost on the size of the Pareto front, this option is disabled by default in the vignette and the example below is shown but not evaluated. ```{r local-search, eval = FALSE} res_full <- moc.gapbk(d1, d2, num_k = 3, generation = 10, pop_size = 10, local_search = TRUE, cores = 2) ``` ## Tips for biological applications In bioinformatics workflows, `dmatrix1` is usually a distance derived from numerical expression profiles (for example, correlation or Euclidean distance on log-expression values), while `dmatrix2` is a distance derived from a-priori biological knowledge (for example, semantic similarity between Gene Ontology terms). The Xie-Beni validity index is computed independently on each matrix and acts as one of the two objective functions of the NSGA-II engine. ## Backward compatibility Versions before 0.2.0 exported the function as `moc.gabk` (with a single `p`). That name is preserved as a deprecated alias and emits a warning; all new code should call `moc.gapbk` directly. ## References Parraga-Alava, J., Dorn, M., Inostroza-Ponta, M. (2018). A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. *BioData Mining* 11(1), 1-16.