--- title: "Getting Started with BCFM: A Complete Workflow" author: "Allison Tegge, Virginia Tech" date: "`r format(Sys.Date(), '%B %d, %Y')`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: false vignette: > %\VignetteIndexEntry{Getting Started with BCFM: A Complete Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6 ) library(BCFM) library(dplyr) library(tidyr) library(ggplot2) ``` # Introduction This vignette demonstrates a complete workflow for using the **BCFM** (Bayesian Clustering Factor Models) package. We'll use a simulated dataset that comes with the package to illustrate: - Data preparation and exploration - Model selection across different numbers of groups and factors - Fitting the optimal model - Comprehensive visualization of results --- # Load and Explore the Data For this tutorial, we use a simulated dataset included in the BCFM package. The dataset contains 200 observations, 20 variables, and 5 time points, with a known structure of 4 groups and 3 latent factors. ```{r simulated-data-exploration} # Load the simulated dataset data("sim.data", package = "BCFM") # Examine the structure str(sim.data) # Extract dimensions n_obs <- dim(sim.data)[1] # 200 observations n_vars <- dim(sim.data)[2] # 20 variables cat("Dataset dimensions:\n") cat(" Observations:", n_obs, "\n") cat(" Variables:", n_vars, "\n") ``` --- # Model Selection The first step is to identify the optimal number of groups (clusters) and latent factors. We use `BCFM.model.selection()` to fit models across a grid of possible configurations and compare them using IC. ```{r run-model-selection} # Specify variables to use for clustering cluster.vars <- paste0("V", 1:n_vars) # Create output directory for results (using tempdir() for CRAN compliance) output_dir <- file.path(tempdir(), "BCFM_analysis") # Run model selection # Note: grouplist and factorlist can be vectors (e.g., c(2, 3, 4)) or sequences (e.g., 2:4) # The function will fit a model for each combination of groups and factors BCFM.model.selection( data = sim.data, cluster.vars = cluster.vars, grouplist = 4, # Number of groups factorlist = 3, # Number of factors n.iter = 100, # Number of MCMC iterations (use 50000+ for real analysis) burnin = 10, # Burnin for IC calculation (use 10000+ for real analysis) every = 10, # Retains the output of every "every" MCMC iterations cluster.size = 0.01, # Minimum proportion required for each cluster (default 0.05) output_dir = output_dir, # Specify where to save results seed = 123 # Optional seed for reproducibility ) ``` > **Note:** For demonstration, we use minimal iterations (100) and burnin (10). For actual analysis, use `n.iter = 50000` and `burnin = 10000` or higher to ensure convergence. > **Output Directory:** The `output_dir` parameter specifies where model results will be saved. By default, results are saved to `tempdir()`. For your own analyses, you can specify any directory where you have write permissions. --- # Examining Output Files After running model selection, check the output directory: ```{r check-output} # Check what files were created list.files(output_dir) # Load IC results (if multiple models were fitted) load(file.path(output_dir, "IC.Rdata")) print(IC.matrix) # Load the fitted model results load(file.path(output_dir, "results-covarianceF-g4-f3.Rdata")) # Examine the structure str(SDresult$Result, max.level = 1) ``` When testing multiple configurations (e.g., `grouplist = 2:4, factorlist = 2:4`), an `IC.Rdata` file is created for model comparison. --- # Visualization The BCFM package provides comprehensive visualization functions to explore your results. ## Factor Loadings Matrix The factor loadings matrix **B** shows the relationship between observed variables and latent factors. Each column represents a latent factor, and each row represents a variable. Higher absolute values indicate stronger associations. ```{r factor-loadings-matrix} # Check dimensions and calculate posterior mean of B (after burnin) n_iter <- dim(SDresult$Result$B)[1] burnin <- min(100, floor(n_iter * 0.2)) # Use 20% as burnin or 100, whichever is smaller cat("Total iterations:", n_iter, "\n") cat("Using burnin:", burnin, "\n") # Calculate posterior mean after burnin B.matrix <- apply(SDresult$Result$B[burnin:n_iter, , ], MARGIN = c(2, 3), FUN = mean) B.matrix <- as.data.frame(B.matrix) # Set variable names rownames(B.matrix) <- paste0("Var", 1:nrow(B.matrix)) # Set informative column names for factors colnames(B.matrix) <- paste0("Factor ", 1:ncol(B.matrix)) # Print the factor loadings matrix knitr::kable(round(B.matrix, 3), caption = "Posterior Mean Factor Loadings", align = 'c') ``` ## Latent Profiles Visualize the mean profiles for each cluster: ```{r latent-profiles-plot, results='hide'} ggplot_latent.profiles(Gibbs = SDresult$Result) ``` ## Cluster Means Explore the posterior distributions of cluster means: ```{r mu-density-plot, results='hide'} ggplot_mu.density(Gibbs = SDresult$Result, add.legend = TRUE) ``` ## Cluster Probabilities ### Density Plots Examine the posterior distributions of cluster membership probabilities: ```{r probs-density-plot, results='hide'} ggplot_probs.density(Gibbs = SDresult$Result) ``` ### Trace Plots Check convergence of cluster probabilities: ```{r probs-trace-plot, results='hide'} ggplot_probs.trace(Gibbs = SDresult$Result) ``` ## Cluster Assignments Heatmap Visualize the cluster membership assignments over MCMC iterations: ```{r heatmap-plot, results='hide'} ggplot_Zit.heatmap(Gibbs = SDresult$Result) ``` ## Additional Visualizations The BCFM package includes several other plotting functions to explore your results in depth: - `ggplot_B.trace()` - Check convergence of factor loadings - `ggplot_B.CI()` - Examine credible intervals for loadings - `ggplot_omega.density()` - Visualize within-cluster covariance structures - `ggplot_sigma2.CI()` - Examine error variances - `ggplot_tau.CI()` - Examine factor variances - `ggplot_variability()` - Understand variance decomposition across factors See the function documentation for details. --- # Interpreting Results ## Cluster Assignments The heatmap shows how observations are assigned to clusters. Stable assignments (consistent colors across iterations) indicate good convergence and clear cluster structure. ## Factor Loadings The loading matrix **B** shows which variables are associated with each latent factor. Large credible intervals that include zero suggest weak associations. ## Cluster Characteristics The latent profiles plot reveals the distinctive patterns of each cluster across variables, helping to interpret what makes each cluster unique. ## Convergence Diagnostics The trace plots for factor loadings and cluster probabilities help assess MCMC convergence. Look for: - No trends over iterations (stationary) - Good mixing (values move freely) - Similar patterns across chains ## Variance Components - **Sigma² (Error variances)**: Larger values indicate more unexplained variance for those variables - **Tau (Factor variances)**: Shows the relative importance of each latent factor - **Variance decomposition**: Illustrates how much variance each factor explains --- # Accessing Model Components You can directly access various components of the fitted model: ```{r model-components} # Cluster assignments (most likely cluster for each observation) cluster_assignments <- SDresult$Result$Z # Loading matrix loadings <- SDresult$Result$B # Cluster means cluster_means <- SDresult$Result$mu # Cluster probabilities cluster_probs <- SDresult$Result$probs # Factor scores factor_scores <- SDresult$Result$X # Show dimensions cat("Dimensions of key components:\n") cat(" B (loadings):", dim(loadings), "\n") cat(" mu (means):", dim(cluster_means), "\n") cat(" X (scores):", dim(factor_scores), "\n") ``` --- # Summary This vignette covered: 1. **Model Selection** - Using `BCFM.model.selection()` to fit models 2. **Output Examination** - Exploring result files and IC matrices 3. **Factor Analysis** - Interpreting loadings, convergence, and variance 4. **Clustering** - Visualizing cluster profiles and assignments 5. **Diagnostics** - Assessing model fit and convergence For more details, see `?BCFM.model.selection` and individual function help pages. --- # Session Info ```{r session-info} sessionInfo() ```