The best advice on using the clustering functions in
clusterExperiment for large datasets is to avoid
calculating any \(NxN\) distance or
similarity matrix. They take a long time to calculate, and a large
amount of memory to store.
The most likely reason to calculate such a matrix is because of the clustering routine used. Methods like PAM or hierarchical clustering use a distance matrix and are not good choices for large datasets.
Prior to version 2.5.5, our functions would internally
calculate a distance matrix if the clustering algorithm needed it, and
it would be hard for the user to realize that they selected a clustering
routine that needed such a matrix. Now we have added the argument
makeMissingDiss, which, if FALSE, will not
calculate any needed distance matrices and instead return an error. We
recommend setting this argument to FALSE with large
datasets as a caution. If you discover that you are hitting an error,
select a different clustering algorithm that does not need a \(NxN\) distance matrix.
Note, that this may not work with PAM, because PAM takes as
input a matrix \(x\) (see
?pam). But if a \(x\)
matrix is given as input, the pam function simply
calculates internally the distance matrix! The option
makeMissingDiss=FALSE may not catch this, since the actual
clustering function allows for using an input matrix \(x\). (This is an unfortunate for large
datasets, and we may in the future change how we classify the possible
input into PAM to classify it as a method that only accepts distance
matrices to allow it to be caught by
makeMissingDiss=FALSE.)
Similarly, using any options regarding silhouette distance will
create a \(NxN\) matrix as part of the
silhouette computation in cluster package. This includes
findBestK=TRUE argument. These options should only be
considered for moderate sized datasets where the calculation (and
storage) of the \(NxN\) matrix is not a
problem.
Unfortunately, subsampling and consensus clustering (with
makeConsensus) operate by clustering based on the
proportion of shared clusterings per pairs of sample, which has been in
past versions stored by clusterExperiment in a \(NxN\) matrix (see the main tutorial
vignette for an explanation of these methods). While we are working on
methods to avoid calculating this matrix, they are not yet completely
operational in avoiding the \(NxN\)
matrix.
We have, however, in version 2.5.5 made some
infrastructure changes to allow for avoidance of the \(NxN\) matrix for subsampling and consensus
clustering if the user has defined a clustering function to do this (see
details below).
We have also in version 2.5.5 changed the clustering
functions to allow the user to request clustering of only unique
representations of the combinations of clusterings from subsampling or
in makeConsensus, significantly reducing the size of the
\(NxN\) matrix used in the actual
clustering step (see below).
Here we document some infrastructure changes made to allow for avoidance of the \(NxN\) matrix for subsampling and consensus clustering. These do not, as of yet, actually provide the ability to avoid the \(NxN\) calculation for the clustering, but do set up an infrastructure where the user can now provide the appropriate clustering routine to avoid it.
2.5.5 the
results of subsampling would be saved as a NxN matrix, corresponding to
the proportion of times two samples were clustered together across the
\(B\) subsamples. As of
2.5.5 the results are simply saved as a \(NxB\) matrix, giving the (integer-valued)
cluster assigments of each sample in each subsample. This \(NxB\) matrix will need to be clustered to
get anything interesting, and whether the clustering of that matrix will
require calculating a NxN matrix depends on the clustering routine set
in the mainClusterArgs (see below).2.5.5 the makeConsensus command now expects
clustering techniques that will work directly on the \(NxB\) matrices of clusterings, rather than
directly calculating the \(NxN\)
matrix. Again, this requires a clustering routine that works on a \(NxB\) matrix of clusterings, and whether
the clustering of that matrix will require calculating a NxN matrix
depends on the clustering routine (see below).inputType="cat", see
?ClusterFunctions), they do this by simply internally
calculating the \(NxN\) matrix (and
this is NOT controlled by makeMissingDiss argument as the
actual clustering function that is called calculates it, not the
clusterExperiment infrastructure – similarly to PAM above).
We are working on creating a clustering routine that avoids this step;
if the user has such a clustering routine, they can provide this
clustering routine to the functions (see main vignette and
?ClusterFunction for how to integrate a user-defined
function)makeConsensus, only the \(M\) unique combinations of
clusters are clustered; this can effect the results, since it ignores
the number of samples represented by each of the \(M\) combinations (important for methods
like kmeans that take the averages acrosss the samples). However, it can
dramatically reduce the size, no longer requiring calculation or storage
of all the dissimilarities between identically clustered samples. To
choose this option, set clusterArgs=list(removeDup=TRUE) in
the list of arguments passed to either mainClusteringArgs
or subsampleArgs. This can also be done for the clustering
function of subsampling, but is likely to lead to much less of a
reduction in size.ClusterExperiment object. Instead we allow for either
storage of the \(NxN\) matrix or the
\(NxB\) matrix, or even just the
indices of the clusterings that make up the \(NxB\) matrix. This slot was primarily used
for the plotCoClustering command (basically a heatmap of
the \(NxN\) matrix), which is unlikely
to be of practical use for extremely large datasets. However, the
plotCoClustering command will calculate that \(NxN\) matrix on the fly from the \(NxB\) matrix that is stored, so again
should be avoided for large datasets.The package is compatible with HDF5 Matrices, meaning that the package will run if the data given is a reference to a HDF5 file. However, the code may acheive such this compatibility by bringing the full matrix into memory. In particular, the default clustering routines are not compatible with the HDF5 implementation, meaning that they must bring the full dataset into memory for calculations.
The only exception to this is the method “mbkmeans” which calls on
the clustering routine (from the package of the same name). This package
implements a version of kmeans (“Mini-Batch kmeans”) that truly works
with the structure of the HDF5 datasets to avoid bringing the full
dataset into memory. “Mini-batch kmeans” refers to only using a
proportion of the data (a “batch”) at each iteration of the clustering.
The mbkmeans package integrates this with HDF5 files, among
other formats, meaning that mbkmeans actually is written (in C code) so
as to not bring the entire dataset into memory but only the subset (or
batch) needed for any particular calculation.
Unlike the mbkmeans package, however, the integration in
clusterExperiment has not been tested to ensure that the
full dataset is not inadvertantly brought into memory by other
components of clusterExperiment infrastructure. This is an
ongoing area for improvement. (So far integration of
mbkmeans as a built-in options in
clusterExperiment has only been tested so far that it
successfully runs the clustering routine.)
Further comments:
mbkmeans).mbkmeans: if
using mbkmeans with subsample=TRUE, then the ‘classify’
function (i.e. the assignment of samples that were not part of the
subsample to a clustering) is not part of mbkmeans, and may
bring the entire matrix into memory (when classify is All
or OutOfSample)