--- title: "Sparse Robust PCA with the ssMRCD Estimator for Multi-Source Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Sparse Robust PCA with the ssMRCD Estimator for Multi-Source Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, fig.dim = c(7, 4.5), comment = "#>" ) ``` This vignette is based on the real world example described in Puchhammer, Wilms and Filzmoser (2024). However, only a subset of the data is analysed to reduce the necessary runtime due to the large number of groups in the original data analysis. All functions are included in the package `ssMRCD`. ```{r} library(ssMRCD) library(dplyr) library(ggplot2) library(robustbase) library(tidyr) library(ggridges) ``` ## Data Preparation Weather data is made available by GeoSphere Austria (2024) () and is pre-cleaned and saved in the data frame object `weatherHoheWarte`. We consider a set of weather variables for the weather station Hohe Warte in Vienna, Austria over the years 2000-2023. The decision of variables and time lines are based on data quality, i.e. missing data. Additional information can be found on the helping page. ```{r, eval = FALSE} ? weatherHoheWarte ``` ```{r load data} data("weatherHoheWarte") head(weatherHoheWarte) ``` Only data from 2000 onward is used. ```{r data preparation} varnames_short= c("cl", "rad", "vp", "wmax", "ap", "hum", "prec", "sight", "sun", "tmax", "tmin", "t", "w") # filter data weather = weatherHoheWarte %>% dplyr::filter(year > 2000) # select variables X_data = weather %>% select(cloud_cover:wind_velocity) %>% as.matrix colnames(X_data) = varnames_short # get number of variables p = dim(X_data)[2] ``` To facilitate a stable algorithm, the data is scaled robustly. High variance complicates the strategy to choose \code{rho} during the algorithm since it is based on the average overall variance. ```{r scale data} X_data = scale(X_data, center = robustbase::colMedians(X_data), scale = apply(X = X_data, MARGIN = 2, FUN = mad)) ``` ## Calculating the ssMRCD Estimator for Yearly Groups Since each year is supposed to be similar but can also contain certain irregular weather phenomena, the ssMRCD estimator (Puchhammer and Filzmoser, 2024) is applicable to take advantage of the additional information of prior and following years. For the ssMRCD estimator we use the function `ssMRCD`. The neighborhoods are years, thus acquiring comparability by including one weather cycle per group. Weights are based on the temporal proximity and linear decline of influence, which lasts a maximum of 5 years and calculated with the function `time_weights`. ```{r groups and weights} N_groups = as.numeric(as.factor(weather$year)) N = max(N_groups) W = time_weights(max(N_groups), c(5:1)) ``` The optimal amount of smoothing is tuned with the residuals-based criteria described in Puchhammer, Wilms and Filzmoser (2024). Since the optimization of values takes some time (around 10 min), the optimal parameter selection is hard-coded in the second code chunk. To see the optimal smoothing criterion plotted over lambda run the first code chunk below. ```{r hyperparameter tuning, eval = FALSE} set.seed(1234) lambda = seq(0, 1, by = 0.05) opt_lambda = ssMRCD(X = X_data, groups = N_groups, weights = W, lambda = lambda, tuning = list(method = "residuals", plot = TRUE), alpha = 0.75) ``` ```{r} lambda_grid = seq(0, 1, by = 0.05) residual_values = c(3.199500,3.188801,3.182297,3.175784,3.171462,3.167695,3.165151,3.162758, 3.161896,3.161929,3.162562, 3.163831,3.165651,3.168228,3.171867,3.175790, 3.180868,3.186437,3.192634,3.200292,3.208845) lambda_opt = 0.4 ggplot() + geom_line(aes(x = lambda_grid, y = residual_values)) + geom_point(aes(x = lambda_grid, y = residual_values)) + geom_point(aes(x = lambda_opt, y = residual_values[lambda_grid == lambda_opt]), col = "red") + labs(x = expression(lambda), y = "Residual Value", title = "Optimal Smoothing Based on Residuals") + theme_minimal() ``` Here, the ssMRCD with the optimal lambda is calculated. ```{r ssMRCD} # get optimal smoothing parameter and estimator set.seed(123) ssMRCD_weather = ssMRCD(X = X_data, groups = weather$year, weights = W, lambda = 0.4, alpha = 0.75) # collect covariance matrices COVS = ssMRCD_weather$MRCDcov plot(ssMRCD_weather, type = c("ellipses", "convergence"), variables = c("hum", "sun")) ``` By plotting the mean estimates of the ssMRCD estimator we see clear temporal trends in the data and the different variables, . ```{r dynamic means} # collect means mu_timeseries = do.call(cbind, ssMRCD_weather$MRCDmu) colnames(mu_timeseries) = 2001:2023 rownames(mu_timeseries) = colnames(X_data) # plot means dependent on year ggplot(mu_timeseries %>% data.frame %>% mutate(Var1 = rownames(.)) %>% tidyr::pivot_longer(cols = X2001:X2023) %>% group_by(Var1) %>% mutate(value = (value-min(value))/(max(value) - min(value)), name = as.numeric(gsub("X", "", name)))) + geom_line(aes(x = name, y = value, group = Var1)) + geom_smooth(aes(x = name, y = value, group = Var1), se = F) + facet_wrap(vars(Var1)) + theme_minimal() + labs(x = "", y = "") ``` ## Robust Sparse PCA The covariances from the ssMRCD-estimator can be used as plug-in for the sparse robust multi-source PCA. ### Hyper-Parameter Tuning To get the optimal distribution of sparsity $\gamma$ we use the AUC for the path of explained variance and sparsity of entries and groups. The optimal amount of sparsity $\eta$ is based on the trade-off product optimization (TPO) criterion. The parameter tuning is included in the function \code{msPCA} when \code{eta} and/or \code{gamma} are numeric vectors. Here, only \code{eta} is tuned. ```{r pca tuning, eval = TRUE} set.seed(12345) pca = msPCA(eta = seq(0, 3, 0.25), gamma = 0.5, COVS = COVS, k = p) pca$plot_tpo ``` The optimal parameter values is $\eta=$ `r pca$eta`. ```{r} summary(pca, k = 3) ``` ### Sparse Robust PCA - Number of PCs When deciding how many PCs are enough to describe the data well, a scree-plot inspired box-plot scree-plot can visualize the explained variance well for multiple groups. Note, that we adjust the amount of sparsity for further components based on the remaining variance possible to explain. ```{r scree plot} # plot scree plot screeplot(x = pca, text = TRUE, size = 3, gnames = 2001:2023, cutoff = 0.8) ``` We select the first three PCs as sufficient, since 80% of overall variance is explained by them. ```{r optimal k} # optimal number of components optimal_k = 3 ``` ## Diagnostic Plots When it comes to plotting, we also need to align the loadings for a clear picture. Multiple approaches are possible and implemented in the function `align` with different `type` settings. ```{r align loadings} ?align pca$PC[, 1:optimal_k, ] = align(PC = pca$PC[, 1:optimal_k, ],type = "mean") ``` ### Loadings Loadings can be plotted via the function `plot` with `type = "loadings"`. Keep in mind, that each PC for each source is unique only up to a factor -1. While in theory this is not remarkable, for comparing results in a sensible manner, the PCs should be oriented to reflect similarities specifically. The function `align_PC` provides various approach to orient them in a similar direction and for applications different `type` arguments can be explored. ```{r plot loadings} # plot aligned loadings plot(x = pca, type = "loadings", gnames = 2001:2023, k = 1) plot(x = pca, type = "loadings", gnames = 2001:2023, k = 2) plot(x = pca, type = "loadings", gnames = 2001:2023, k = 3) ``` Also something like a biplot can be visualized to see the differences in the loadings. ```{r biplot} biplot(pca) ``` ### Scores, Score- and Orthogonal-Distances To improve interpretation of the components, we calculate the scores as well as the score distance and the orthogonal distance with the function `scores`. You can either use the ssMRCD object, then the data from the covariance estimation is used and centered accordingly, or you can specify the input manually. ```{r scores} sc = scores(X = X_data, groups = weather$year, PC = pca$PC, mu = ssMRCD_weather$MRCDmu, Sigma = ssMRCD_weather$MRCDcov) sc2 = scores(PC = pca$PC, ssMRCD = ssMRCD_weather) ``` Possible visualizations include univariate densities for each group via histograms of the scores. ```{r} plot(x = pca, ssMRCD = ssMRCD_weather, type = "scores") ``` The score distance and orthogonal distance are calculated via the `scores` function. Some densities can be shown for each group similar to Puchhammer, Wilms and Filzmoser (2024). ```{r distance distribution} # distances score_sd = sc$SD score_od = sc$OD dat = data.frame(time = weather$time, year = weather$year, OD = score_od, SD = score_sd, groups = weather$year) dat_melt = dat %>% tidyr::pivot_longer(cols = c("OD", "SD")) ggplot(dat_melt %>% dplyr::filter(name %in% c( "SD", "OD"))) + ggridges::geom_density_ridges_gradient(aes(x = value, y = groups, fill = as.factor(groups), group = groups), scale = 10, rel_min_height = 0.01, gradient_lwd = 0.5) + scale_x_continuous(expand = c(0.03, 0)) + scale_y_discrete(expand = c(0.0, 0, 0.1, 0)) + scale_fill_discrete(name = "") + ggridges::theme_ridges(font_size = 13, grid = TRUE) + theme(axis.title.y = element_blank(), legend.position = "right", panel.grid.major.y = element_line() ) + labs(x = "") + facet_grid(cols = vars(name), scales = "free") ``` Also, a distance-distance plot can be plotted. ```{r distance distance plot} plot(pca, type = "score_distances", ssMRCD = ssMRCD_weather, k = 3) ``` For more fun, the two distances can be visualized via Gif in a bivariate way for each group separately. ```{r gif, eval = FALSE} # use function saveGIF library(animation) param = 2001:2023 saveGIF(interval = 0.1, movie.name = "PCA_scoreDist_bivariate.gif", expr = { for (i in param ){ g = ggplot(dat %>% dplyr::filter (year %in% seq(i-5, i+5, 1))) + geom_density_2d_filled(aes(x = SD, y = OD), alpha = 1) + scale_x_sqrt(limits = c(0,40)) + ylim(c(0, 6.2)) + labs(title = i) print(g) }} ) ``` ## References GeoSphere Austria (2022): . Puchhammer, P., & Filzmoser, P. (2024). Spatially smoothed robust covariance estimation for local outlier detection. _Journal of Computational and Graphical Statistics, 33(3)_, 928-940.. Puchhammer, P., Wilms, I., & Filzmoser, P. (2024). Sparse outlier-robust PCA for multi-source data. _arXiv preprint arXiv:2407.16299._ .