One essential functionality in the analysis of large scale datasets is the ability to perform dimensionality reduction techniques and plot the outputs, such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAPs). The mesa package currently supports functionality for PCAs and UMAPs.
We will use the exampleTumourNormal data object for these examples.
First we will show the functionality for PCA, which is a linear dimensionality reduction technique. PCA is performed with mesa using a two-step approach:
getPCA() function to generate the PCA resultsplotPCA() function.Please note that the qsea package also has functions with the same names; the mesa functions are replacements for these.
The function getPCA calculates a PCA on a qseaSet:
pca0 <- getPCA(exampleTumourNormal)
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------By default, this will perform the PCA using beta values, and consider the 1000 most variable windows across all samples (or all windows if there are less than 1000).
We then can use the plotPCA() function to plot the result.
#>
#> $pca1$PC2vsPC3
The output of getPCA() is the PCA result object (technically a mesaDimRed object). This includes a copy of the sampleTable, detached from the original qseaSet.
As shown above, plotPCA() returns two separate plots by default; one for PC1 vs PC2 and another for PC2 vs PC3.
In the same way as for a qseaSet, the sampleTable may be updated directly, using mutate or left_join.
PlotPCA function argumentsThe colour and shape arguments to plotPCA() specify variables to use to set the colour and/or shape of the points (samples) on the PCA plot. These variables need to be in the qseaSet sample table.
Multiple variables can be specified for colour and a PCA plot will be produced for each variable.
Only one variable can be specified for shape.
plotPCA(pca1,
colour = c("patient", "age", "transformed_age"),
shape = "tumour")
#> $pca1
#> $pca1$patient
#> $pca1$patient$PC1vsPC2#>
#> $pca1$patient$PC2vsPC3
#>
#>
#> $pca1$age
#> $pca1$age$PC1vsPC2
#>
#> $pca1$age$PC2vsPC3
#>
#>
#> $pca1$transformed_age
#> $pca1$transformed_age$PC1vsPC2
#>
#> $pca1$transformed_age$PC2vsPC3
Here, many PCA plots have been produced and displayed (PC1 vs PC2 and PC2 vs PC2 for three different colour variable annotations. To have more control over which plots are displayed you can save the plots to a variable and then choose which ones to display.
PCAplots1 <- plotPCA(pca1,
colour = c("patient", "age", "transformed_age"),
shape = "tumour")
PCAplots1$pca1$patient$PC1vsPC2The function aims to select sensible colour and shape palettes based on the properties of the variable being used.
For the transformed_age variable (that was generated in the mutate example above), a diverging colour palette is used, but it is not symmetric around zero. That is, the darkest blue maps to approx -0.5 whereas the darkest red maps to approx 2.
Depending on the variable, it may be appropriate that the darkest blue maps to -2 (i.e. equivalent to the darkest red mapping to 2), and for this the symDivColourScale argument can be set to TRUE, as seen below (and we also replot the above version for comparison).
PCAplots2 <- plotPCA(pca1, colour = "transformed_age", shape = "tumour", symDivColourScale = TRUE)
PCAplots1$pca1$transformed_age$PC1vsPC2By default, filled shapes (21-25) with a black outline are used for points unless a categorical variable with more than 5 categories (including NA) is set to the shape, in which case the unfilled shapes are used (0-20)*. Filled shapes are generally preferred, as they remain more distinguishable when many points overlap.
*Note: Shapes 19 and 20 are excluded from default palettes as they are difficult to distinguish from shape 16.
As a rule, we do not allow mixing of filled and unfilled shapes on the same plot. This is because the variable provided to the colour argument is mapped to the fill aesthetic for filled shapes and the colour aesthetic for unfilled shapes in the underlying ggplot2 geom.
plotPCA(pca1,
colour = "tumour",
shape = "tissue")
#> $pca1
#> $pca1$tumour
#> $pca1$tumour$PC1vsPC2#>
#> $pca1$tumour$PC2vsPC3
#>
#> $pca1$tumour$PC2vsPC3
It is possible to specify your own colour and shape palettes using the colourPalette and shapePalette arguments.
plotPCA(pca1,
colour = "tumour",
colourPalette = c("firebrick","blue"),
shape = "tumour",
shapePalette = c(15, 8))
#> $pca1
#> $pca1$tumour
#> $pca1$tumour$PC1vsPC2#>
#> $pca1$tumour$PC2vsPC3
These can be a named vector, in order to specify which value of the annotation gets which colour/shape:
plotPCA(pca1,
colour = "tumour",
colourPalette = c("Tumour" = "firebrick", "Normal" = "blue"),
shape = "tumour",
shapePalette = c("Tumour" = 15, "Normal" = 8))
#> $pca1
#> $pca1$tumour
#> $pca1$tumour$PC1vsPC2#>
#> $pca1$tumour$PC2vsPC3
If a variable used for colour annotation contains NA values then these are displayed with a dark grey colour by default (this can be changed using the NAcolour argument).
If a variable used for shape annotation contains NA values then these are displayed explicitly with shape 25 by default for a filled shape palette and shape 7 by default for an unfilled shape palette (this can be changed using the NAshape argument).
plotPCA(pca1, components = c(1,2), colour = "stage")
#> $pca1
#> $pca1$stage
#> $pca1$stage$PC1vsPC2As can be seen in the last example above, the components argument to plotPCA() can be used to specify which components to plot. Above we only ask for PC1 vs PC2.
If you would like more than one plot, then use a list. For example, the default behaviour is to set components = list(c(1, 2), c(2, 3)) which plots both PC1 vs PC2 and PC2 vs PC3.
This is controlled using the pointSize argument, which has a default value of 2.
plotPCA(pca1,
components = c(1, 2),
colour = "patient",
shape = "tumour")
#> $pca1
#> $pca1$patient
#> $pca1$patient$PC1vsPC2
plotPCA(pca1,
components = c(1, 2),
colour = "patient",
shape = "tumour",
pointSize = 4)
#> $pca1
#> $pca1$patient
#> $pca1$patient$PC1vsPC2Sample names can be displayed for each point by setting showSampleNames = TRUE.
plotPCA(pca1,
components = c(1, 2),
colour = "patient",
shape = "tumour",
showSampleNames = TRUE)
#> $pca1
#> $pca1$patient
#> $pca1$patient$PC1vsPC2Alternatively, plotly::ggplotly() can be used to generate an interactive version of the plot which gives information for each point when you rollover the point with the mouse. The plotlyAnnotations argument can be used to specify which variables to display in the pop-up boxes. The sample name will always be displayed, as will any variables being used colour and shape annotations.
PCAplots3 <- plotPCA(pca1,
components = c(1, 2),
colour = "patient",
shape = "tumour",
plotlyAnnotations = c("stage", "age", "gender"))Plotly plots are great for interactive use in html reports.
The downside of using them is that the legend often gets displayed in a complicated, unhelpful way and the formatting of the plot is often not as good and hard to control.
Sometimes it makes sense to plot a non-plotly and plotly version of the same plot.
getPCA()When you run getPCA() some messages are printed and will appear in the console window and also below the code chunk in the source window. This gives the following information:
makeDataTable() (if applicable), the mesa function that generates the table of values (beta values by default)pca <- getPCA(qseaSet = exampleTumourNormal)
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------getPCA() function argumentsBy default, the getPCA function will only consider a subset of the most variable windows across the qseaSet. This has two benefits, the first being computational efficiency, but the second is to prevent the effects of minor batch effects from appearing when they are summed over many windows (although that is also worth investigating separately). This is seen particularly when using beta values, as how close a beta value can get to 0 or 1 is determined by the enrichment and number of fragments sequenced; a sample with 5 million fragments might be assigned a beta value of 0.95 in a window, but 0.99 in the same window if it had 10 million fragments.
This selection of the number of windows can be set using the topVarNum argument of getPCA(). If not explicitly set, the default value is the top 1000 most variable windows.
More than one value can be specified for topVarNum and PCA results will be produced for each. This is computationally more efficient than running getPCA() multiple times with a single topVarNum value each time.
Specifying a value of NA will result in all windows being used (i.e. no filtering based on most variable windows).
pca2 <- getPCA(exampleTumourNormal, topVarNum = c(100, 250, 500, NA))
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across all 10 samples:
#> Colon1_N, Colon1_T, Colon2_N, Colon2_T, Lung1_N, Lung1_T, Lung2_N, Lung2_T, Lung3_N, Lung3_T.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.19 resulting in 100 windows.
#> Performing PCA with 10 samples and 100 windows
#> -> pca1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.13 resulting in 250 windows.
#> Performing PCA with 10 samples and 250 windows
#> -> pca2.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.0485 resulting in 500 windows.
#> Performing PCA with 10 samples and 500 windows
#> -> pca3.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca4.
#> ------------------------------Here, four different values have been used and the messages output tell you the name assigned to each PCA result.
When plugging this output into plotPCA, plots are produced for all four PCA versions, with all options for the colour parameter:
PCAplots4 <- plotPCA(pca2, colour = c("patient", "age"), shape = "tumour")
PCAplots4$pca1$patient$PC1vsPC2If you specify a value larger than the number of available windows, then PCA will be performed with all available windows and redundant values will be ignored:
pca <- getPCA(exampleTumourNormal, topVarNum = c(250, 500, 750, 1000, NA))
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 750, 1000
#> These values are not used; PCA is already being done with all remaining windows.
#> Calculating standard deviation for each window across all 10 samples:
#> Colon1_N, Colon1_T, Colon2_N, Colon2_T, Lung1_N, Lung1_T, Lung2_N, Lung2_T, Lung3_N, Lung3_T.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.13 resulting in 250 windows.
#> Performing PCA with 10 samples and 250 windows
#> -> pca1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.0485 resulting in 500 windows.
#> Performing PCA with 10 samples and 500 windows
#> -> pca2.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca3.
#> ------------------------------It is possible to select which samples are used to calculate the standard deviation for each window using the topVarSamples argument. This may be either an explicit vector of sample names,
tumourNames <- exampleTumourNormal |> filter(tumour == "Tumour") |> getSampleNames()
getPCA(exampleTumourNormal, topVarNum = c(50), topVarSamples = tumourNames)
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across 5 of 10 samples:
#> Colon1_T, Colon2_T, Lung1_T, Lung2_T, Lung3_T.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across 5 samples (windowSd1).
#> Standard deviation threshold = 0.254 resulting in 50 windows.
#> Performing PCA with 10 samples and 50 windows
#> -> pca1.
#> ------------------------------
#> Object containing 1 dimensionality reduction objects for 10 samplesOr a string to apply (as a regular expression):
getPCA(exampleTumourNormal, topVarNum = c(50), topVarSamples = "_T")
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across 5 of 10 samples:
#> Colon1_T, Colon2_T, Lung1_T, Lung2_T, Lung3_T.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across 5 samples (windowSd1).
#> Standard deviation threshold = 0.254 resulting in 50 windows.
#> Performing PCA with 10 samples and 50 windows
#> -> pca1.
#> ------------------------------
#> Object containing 1 dimensionality reduction objects for 10 samplesThe default is to use all samples.
The slowest part of the PCA is often generating the table of values (e.g. beta values), which we often refer to as the dataTable. For large numbers of windows and/or samples this can take a considerable amount of time and could become a bottleneck if you need to run getPCA() multiple times.
The getPCA() function therefore allows for the dataTable to be generated externally and then input into getPCA(). This means getPCA() can be run multiple times with the dataTable generated only once.
pca1 <- getPCA(exampleTumourNormal) # dataTable generated within getPCA
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------For example, we can use getBetaTable to generate the table to use:
dataTable <- getBetaTable(exampleTumourNormal)
#> Generating table of beta values for 819 regions across 10 samples.
pca2 <- getPCA(exampleTumourNormal, dataTable = dataTable) # dataTable generated externally and passed into getPCA
#> Generating table of beta values for 50 regions across 5 samples.
#> ------------------------------
#> Initial number of windows = 819.
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------And this gives the same result:
Another option is to run getPCA() and set the returnDataTable argument to TRUE. The dataTable will then be part of the function output and can be passed into any further calls to getPCA().
The dataTable is not given as part of the output by default since the dataTable can often be large and storing it every time getPCA() is run would be memory-inefficient.
pca3 <- getPCA(exampleTumourNormal, returnDataTable = TRUE) # dataTable generated within getPCA and returned as part of output
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------methods::slot(pca1,"dataTable") # NULL
#> data frame with 0 columns and 0 rows
methods::slot(pca3, "dataTable") # dataTable has been stored in output
#> # A tibble: 599 × 14
#> seqnames start end CpG_density Colon1_N Colon1_T Colon2_N Colon2_T
#> <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 7 25005001 25005300 13.4 0.0565 0.0171 0.0275 0.0165
#> 2 7 25017601 25017900 8.38 0.122 0.0695 0.918 0.818
#> 3 7 25017901 25018200 17.9 0.0240 0.0660 0.265 0.250
#> 4 7 25018201 25018500 16.5 0.0472 0.0143 0.112 0.0669
#> 5 7 25018501 25018800 17.5 0.0241 0.0140 0.0426 0.0133
#> 6 7 25018801 25019100 15.3 0.0493 0.0149 0.0692 0.0559
#> 7 7 25019101 25019400 18.2 0.0932 0.0380 0.220 0.0646
#> 8 7 25019401 25019700 17.1 0.0466 0.0528 0.0881 0.0258
#> 9 7 25035001 25035300 9.29 0.0855 0.327 0.308 0.389
#> 10 7 25047301 25047600 9.54 0.225 0.0787 0.0722 0.0447
#> # ℹ 589 more rows
#> # ℹ 6 more variables: Lung1_N <dbl>, Lung1_T <dbl>, Lung2_N <dbl>,
#> # Lung2_T <dbl>, Lung3_N <dbl>, Lung3_T <dbl>We can then use the dataTable calculated in a previous getPCA call, but ask getPCA to return a different number of top windows:
pca4 <- getPCA(exampleTumourNormal,
dataTable = methods::slot(pca3, "dataTable"),
topVarNum = 250)
#> Generating table of beta values for 50 regions across 5 samples.
#> ------------------------------
#> Initial number of windows = 599.
#> No windows have missing values.
#> ------------------------------
#> Calculating standard deviation for each window across all 10 samples:
#> Colon1_N, Colon1_T, Colon2_N, Colon2_T, Lung1_N, Lung1_T, Lung2_N, Lung2_T, Lung3_N, Lung3_T.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 10 samples (windowSd1).
#> Standard deviation threshold = 0.13 resulting in 250 windows.
#> Performing PCA with 10 samples and 250 windows
#> -> pca1.
#> ------------------------------By default, getPCA will use beta values to generate the PCA. This may be changed via the normMethod option, for instance to nrpm as shown below, although this has a tendency to show more enrichment-based variability than the use of the beta values, which correct this to some extent.
pca_beta <- getPCA(exampleTumourNormal, normMethod = "beta")
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of beta values for 819 regions across 10 samples.
#> -----------
#> Filtered out 220 windows with at least one missing value: 599 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 599): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 599 windows
#> -> pca1.
#> ------------------------------
pca_nrpm <- getPCA(exampleTumourNormal, normMethod = "nrpm")
#> ------------------------------
#> Initial number of windows = 819.
#> -----------
#> Generating table of nrpm values for 819 regions across 10 samples.
#> -----------
#> No windows have missing values.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 819): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 819 windows
#> -> pca1.
#> ------------------------------
pca_beta |> plotPCA(colour = "tissue", shape = "tumour", components = c(1,2))
#> $pca1
#> $pca1$tissue
#> $pca1$tissue$PC1vsPC2pca_nrpm |> plotPCA(colour = "tissue", shape = "tumour", components = c(1,2))
#> $pca1
#> $pca1$tissue
#> $pca1$tissue$PC1vsPC2Whilst PCA is an unsupervised method for dimensionality reduction, sometimes there is a set of windows which are of particular interest, and you wish to see how the samples cluster when considering these. Ideally these have been selected in an orthogonal manner rather than generated using the samples that are being plotted in the PCA. For instance, differentially methylated regions may have been found using a comparison between methylation arrays (a different technology), and you wish to see how these same windows split the samples.
This may be done by pre-filtering the qseaSet object via filterByOverlaps:
exampleTumourNormal |>
filterByOverlaps(regionsToOverlap = tibble(seqnames = 7, start = 25002001, end = 25117900)) |>
getPCA()
#> ------------------------------
#> Initial number of windows = 38.
#> -----------
#> Generating table of beta values for 38 regions across 10 samples.
#> -----------
#> Filtered out 11 windows with at least one missing value: 27 windows remaining.
#> ------------------------------
#> The following topVarNum values are larger than the number of remaining windows (= 27): 1000
#> These values are not used; PCA will be done with all remaining windows instead.
#> ------------------------------
#> No filtering of windows based on window standard deviation.
#> Performing PCA with 10 samples and 27 windows
#> -> pca1.
#> ------------------------------
#> Object containing 1 dimensionality reduction objects for 10 samplesFor a full list of arguments, see the documentation (?getPCA).
Uniform Manifold and Projection (UMAP) is another method for performing dimensionality reduction. This is a nonlinear technique, attempting to reduce the N-dimensional space onto a 2 or 3 dimensional projection, attempting to preserve both local distances between points as well as the global topology. This is more appropriate for large datasets so we use the [qsea::getExampleQseaSet()] function to create an example dataset with two types of samples, before calculating and plotting a UMAP using the getUMAP and plotUMAP functions:
set.seed(1)
qseaSetRandom <- qsea::getExampleQseaSet(repl = 20)
#> selecting windows with low CpG density for background read estimation
#> 1.48% of the windows have enrichment pattern density of at most 0.5 per fragment
#> and are used for background reads estimation
umap1 <- qseaSetRandom |>
getUMAP()
#> ------------------------------
#> Initial number of windows = 10000.
#> -----------
#> Generating table of beta values for 9800 regions across 2 samples.
#> -----------
#> Filtered out 3239 windows with at least one missing value: 6561 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across all 40 samples:
#> Sim1T, Sim2T, Sim3T, Sim4T, Sim5T, Sim6T, Sim7T, Sim8T, Sim9T, Sim10T, Sim11T, Sim12T, Sim13T, Sim14T, Sim15T, Sim16T, Sim17T, Sim18T, Sim19T, Sim20T, Sim1N, Sim2N, Sim3N, Sim4N, Sim5N, Sim6N, Sim7N, Sim8N, Sim9N, Sim10N, Sim11N, Sim12N, Sim13N, Sim14N, Sim15N, Sim16N, Sim17N, Sim18N, Sim19N, Sim20N.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 40 samples (windowSd1).
#> Standard deviation threshold = 0.203 resulting in 1000 windows.
#> Performing UMAP with 40 samples and 1000 windows
#> -> umap1.
#> ------------------------------
plotUMAP(umap1, colour = "group")
#> $umap1
#> $umap1$group
#> $umap1$group$UMAP1vsUMAP2The vast majority of the options for getUMAP are the same as for getPCA, but while the PCA algorithm has no further options, the UMAP algorithm has a number of options which can affect the output. The underlying implementation is via the [uwot::umap()] function, and any additional parameters provided to getUMAP are passed to this function. Of particular note are the two options, n_neighbors and min_dist. n_neighbors controls how many neighbouring samples are considered to build the local structure, while min_dist controls how far apart the resulting embedded points must be. There is an excellent interactive example available to see how these two parameters influence the output, for a range of different datatypes.
Here we show two UMAPs built off the same dataset with different values of min_dist and n_neighbors:
qseaSetRandom |>
getUMAP(min_dist = 0.01, n_neighbors = 2) |>
plotUMAP(colour = "group")
#> ------------------------------
#> Initial number of windows = 10000.
#> -----------
#> Generating table of beta values for 9800 regions across 2 samples.
#> -----------
#> Filtered out 3239 windows with at least one missing value: 6561 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across all 40 samples:
#> Sim1T, Sim2T, Sim3T, Sim4T, Sim5T, Sim6T, Sim7T, Sim8T, Sim9T, Sim10T, Sim11T, Sim12T, Sim13T, Sim14T, Sim15T, Sim16T, Sim17T, Sim18T, Sim19T, Sim20T, Sim1N, Sim2N, Sim3N, Sim4N, Sim5N, Sim6N, Sim7N, Sim8N, Sim9N, Sim10N, Sim11N, Sim12N, Sim13N, Sim14N, Sim15N, Sim16N, Sim17N, Sim18N, Sim19N, Sim20N.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 40 samples (windowSd1).
#> Standard deviation threshold = 0.203 resulting in 1000 windows.
#> Performing UMAP with 40 samples and 1000 windows
#> -> umap1.
#> ------------------------------
#> $umap1
#> $umap1$group
#> $umap1$group$UMAP1vsUMAP2
qseaSetRandom |>
getUMAP(min_dist = 1, n_neighbors = 40) |>
plotUMAP(colour = "group")
#> ------------------------------
#> Initial number of windows = 10000.
#> -----------
#> Generating table of beta values for 9800 regions across 2 samples.
#> -----------
#> Filtered out 3239 windows with at least one missing value: 6561 windows remaining.
#> ------------------------------
#> Calculating standard deviation for each window across all 40 samples:
#> Sim1T, Sim2T, Sim3T, Sim4T, Sim5T, Sim6T, Sim7T, Sim8T, Sim9T, Sim10T, Sim11T, Sim12T, Sim13T, Sim14T, Sim15T, Sim16T, Sim17T, Sim18T, Sim19T, Sim20T, Sim1N, Sim2N, Sim3N, Sim4N, Sim5N, Sim6N, Sim7N, Sim8N, Sim9N, Sim10N, Sim11N, Sim12N, Sim13N, Sim14N, Sim15N, Sim16N, Sim17N, Sim18N, Sim19N, Sim20N.
#> -> column name windowSd1.
#> ------------------------------
#> Filtering windows based on standard deviation across all 40 samples (windowSd1).
#> Standard deviation threshold = 0.203 resulting in 1000 windows.
#> Performing UMAP with 40 samples and 1000 windows
#> -> umap1.
#> ------------------------------
#> $umap1
#> $umap1$group
#> $umap1$group$UMAP1vsUMAP2As a final note, we state that UMAPs should be used with care, as they are easy to misinterpret.
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.80.0
#> [3] rtracklayer_1.72.0 BiocIO_1.22.0
#> [5] Biostrings_2.80.0 XVector_0.52.0
#> [7] GenomicRanges_1.64.0 Seqinfo_1.2.0
#> [9] tibble_3.3.1 org.Mm.eg.db_3.23.0
#> [11] org.Hs.eg.db_3.23.1 AnnotationDbi_1.74.0
#> [13] IRanges_2.46.0 S4Vectors_0.50.0
#> [15] Biobase_2.72.0 BiocGenerics_0.58.0
#> [17] generics_0.1.4 stringr_1.6.0
#> [19] tidyr_1.3.2 ggplot2_4.0.3
#> [21] mesa_0.99.1 dplyr_1.2.1
#> [23] qsea_1.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] fs_2.1.0
#> [2] matrixStats_1.5.0
#> [3] bitops_1.0-9
#> [4] enrichplot_1.32.0
#> [5] lubridate_1.9.5
#> [6] httr_1.4.8
#> [7] RColorBrewer_1.1-3
#> [8] doParallel_1.0.17
#> [9] tools_4.6.0
#> [10] utf8_1.2.6
#> [11] R6_2.6.1
#> [12] lazyeval_0.2.3
#> [13] uwot_0.2.4
#> [14] GetoptLong_1.1.1
#> [15] withr_3.0.2
#> [16] prettyunits_1.2.0
#> [17] gridExtra_2.3
#> [18] preprocessCore_1.74.0
#> [19] cli_3.6.6
#> [20] Cairo_1.7-0
#> [21] hues_0.2.0
#> [22] scatterpie_0.2.6
#> [23] labeling_0.4.3
#> [24] sass_0.4.10
#> [25] S7_0.2.2
#> [26] Rsamtools_2.28.0
#> [27] systemfonts_1.3.2
#> [28] yulab.utils_0.2.4
#> [29] DOSE_4.6.0
#> [30] dichromat_2.0-0.1
#> [31] plotrix_3.8-14
#> [32] limma_3.68.1
#> [33] RSQLite_2.4.6
#> [34] FNN_1.1.4.1
#> [35] gridGraphics_0.5-1
#> [36] TxDb.Hsapiens.UCSC.hg19.knownGene_3.22.1
#> [37] shape_1.4.6.1
#> [38] gtools_3.9.5
#> [39] crosstalk_1.2.2
#> [40] GO.db_3.23.1
#> [41] Matrix_1.7-5
#> [42] abind_1.4-8
#> [43] lifecycle_1.0.5
#> [44] yaml_2.3.12
#> [45] snakecase_0.11.1
#> [46] SummarizedExperiment_1.42.0
#> [47] gplots_3.3.0
#> [48] SparseArray_1.12.2
#> [49] BiocFileCache_3.2.0
#> [50] HMMcopy_1.54.0
#> [51] grid_4.6.0
#> [52] blob_1.3.0
#> [53] crayon_1.5.3
#> [54] ggtangle_0.1.2
#> [55] lattice_0.22-9
#> [56] GenomicFeatures_1.64.0
#> [57] cigarillo_1.2.0
#> [58] KEGGREST_1.52.0
#> [59] magick_2.9.1
#> [60] pillar_1.11.1
#> [61] knitr_1.51
#> [62] ComplexHeatmap_2.28.0
#> [63] rjson_0.2.23
#> [64] boot_1.3-32
#> [65] codetools_0.2-20
#> [66] glue_1.8.1
#> [67] ggiraph_0.9.6
#> [68] ggfun_0.2.0
#> [69] fontLiberation_0.1.0
#> [70] data.table_1.18.2.1
#> [71] vctrs_0.7.3
#> [72] png_0.1-9
#> [73] treeio_1.36.1
#> [74] gtable_0.3.6
#> [75] cachem_1.1.0
#> [76] xfun_0.57
#> [77] S4Arrays_1.12.0
#> [78] MEDIPSData_1.48.0
#> [79] iterators_1.0.14
#> [80] statmod_1.5.1
#> [81] nlme_3.1-169
#> [82] ggtree_4.2.0
#> [83] bit64_4.8.0
#> [84] fontquiver_0.2.1
#> [85] progress_1.2.3
#> [86] filelock_1.0.3
#> [87] UpSetR_1.4.0
#> [88] GenomeInfoDb_1.48.0
#> [89] bslib_0.10.0
#> [90] irlba_2.3.7
#> [91] KernSmooth_2.23-26
#> [92] otel_0.2.0
#> [93] colorspace_2.1-2
#> [94] DBI_1.3.0
#> [95] DNAcopy_1.86.0
#> [96] tidyselect_1.2.1
#> [97] bit_4.6.0
#> [98] compiler_4.6.0
#> [99] curl_7.1.0
#> [100] httr2_1.2.2
#> [101] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
#> [102] fontBitstreamVera_0.1.1
#> [103] DelayedArray_0.38.1
#> [104] plotly_4.12.0
#> [105] scales_1.4.0
#> [106] caTools_1.18.3
#> [107] ChIPseeker_1.48.0
#> [108] MEDIPS_1.64.0
#> [109] rappdirs_0.3.4
#> [110] digest_0.6.39
#> [111] rmarkdown_2.31
#> [112] htmltools_0.5.9
#> [113] pkgconfig_2.0.3
#> [114] MatrixGenerics_1.24.0
#> [115] dbplyr_2.5.2
#> [116] fastmap_1.2.0
#> [117] rlang_1.2.0
#> [118] GlobalOptions_0.1.4
#> [119] htmlwidgets_1.6.4
#> [120] UCSC.utils_1.8.0
#> [121] farver_2.1.2
#> [122] jquerylib_0.1.4
#> [123] zoo_1.8-15
#> [124] jsonlite_2.0.0
#> [125] BiocParallel_1.46.0
#> [126] GOSemSim_2.38.0
#> [127] RCurl_1.98-1.18
#> [128] magrittr_2.0.5
#> [129] ggplotify_0.1.3
#> [130] patchwork_1.3.2
#> [131] Rcpp_1.1.1-1.1
#> [132] ape_5.8-1
#> [133] ggnewscale_0.5.2
#> [134] gdtools_0.5.0
#> [135] stringi_1.8.7
#> [136] MASS_7.3-65
#> [137] plyr_1.8.9
#> [138] parallel_4.6.0
#> [139] ggrepel_0.9.8
#> [140] hms_1.1.4
#> [141] circlize_0.4.18
#> [142] igraph_2.3.0
#> [143] enrichit_0.1.4
#> [144] reshape2_1.4.5
#> [145] biomaRt_2.68.0
#> [146] XML_3.99-0.23
#> [147] evaluate_1.0.5
#> [148] foreach_1.5.2
#> [149] tweenr_2.0.3
#> [150] purrr_1.2.2
#> [151] polyclip_1.10-7
#> [152] clue_0.3-68
#> [153] ggforce_0.5.0
#> [154] janitor_2.2.1
#> [155] restfulr_0.0.16
#> [156] RSpectra_0.16-2
#> [157] tidytree_0.4.7
#> [158] tidydr_0.0.6
#> [159] viridisLite_0.4.3
#> [160] TxDb.Hsapiens.UCSC.hg38.knownGene_3.22.0
#> [161] aplot_0.2.9
#> [162] memoise_2.0.1
#> [163] plyranges_1.32.0
#> [164] GenomicAlignments_1.48.0
#> [165] cluster_2.1.8.2
#> [166] timechange_0.4.0