--- title: "Train and Save a BERTopic Model" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Train and Save a BERTopic Model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(reticulate) # Replace the path below with the path of your Python environment # Then uncomment the command below: # Tip: BERTOPICR_VENV should be the folder that contains `pyvenv.cfg`. # Sys.setenv( # BERTOPICR_VENV = "C:/path/to/your/venv", # NOT_CRAN = "true" # ) # 1. Define the libraries you need required_modules <- c("bertopic", "umap", "hdbscan", "sklearn", "numpy", "sentence_transformers", "torch") # macOS: if reticulate fails to load Python libraries, run once per session. if (identical(Sys.info()[["sysname"]], "Darwin")) { bertopicr::configure_macos_homebrew_zlib() } # Optional: point reticulate at a user-specified virtualenv venv <- Sys.getenv("BERTOPICR_VENV") if (nzchar(venv)) { venv_cfg <- file.path(venv, "pyvenv.cfg") if (file.exists(venv_cfg)) { reticulate::use_virtualenv(venv, required = TRUE) } else { message("Warning: BERTOPICR_VENV does not point to a valid virtualenv: ", venv) } } # Try to find python, but don't crash if it's missing (e.g. on another user's machine) if (!reticulate::py_available(initialize = TRUE)) { try(reticulate::use_python(Sys.which("python"), required = FALSE), silent = TRUE) } # 2. Check if they are installed python_ready <- tryCatch({ # Attempt to initialize python and check modules py_available(initialize = TRUE) && all(vapply(required_modules, py_module_available, logical(1))) }, error = function(e) FALSE) # 3. Only evaluate chunks when Python is ready and NOT_CRAN is set run_chunks <- python_ready && identical(Sys.getenv("NOT_CRAN"), "true") knitr::opts_chunk$set(eval = run_chunks) if (!python_ready) { message("Warning: Required Python modules are not available. Vignette code will not run.") } else { message("Python environment ready: ", reticulate::py_config()$python) if (!identical(Sys.getenv("NOT_CRAN"), "true")) { message("Note: Set NOT_CRAN=true to run Python-dependent chunks locally.") } } ``` This vignette shows how to train a BERTopic model from R and persist it to disk along with the R-side extras (probabilities, reduced embeddings, and dynamic topic outputs). Set `eval = TRUE` for the chunks you want to run. ## Load R packages Python environment selection and checks are handled in the hidden setup chunk at the top of the vignette. ```{r} library(reticulate) library(bertopicr) library(readr) library(dplyr) ``` ## GPU availability (optional) ```{r} reticulate::py_run_string(code = "import torch print(torch.cuda.is_available())") # if GPU is available then TRUE else FALSE ``` ## Load sample data Below, the German sample dataframe is used for topic analysis. ```{r} sample_path <- system.file("extdata", "spiegel_sample.rds", package = "bertopicr") df <- read_rds(sample_path) docs <- df |> pull(text_clean) ``` ## Train the model The `train_bertopic_model()` function is a convenience function. For more options / parameter finetuning, see the other vignette (topics_spiegel.Rmd) or the Quarto file (inst/extdata/topics_spiegel.qmd). For more settings of the `train_bertopic_model()` function, check the help file. ```{r} topic_model <- train_bertopic_model( docs = docs, top_n_words = 50L, # set integer numbger of top words embedding_model = "Qwen/Qwen3-Embedding-0.6B", # choose your (multilingual) model from huggingface.co embedding_show_progress = TRUE, timestamps = df$date, # set this to NULL if not applicable with your data classes = df$genre, # set this to NULL if not applicable with your data representation_model = "keybert" # keyword generation for each topic ) ``` ## Save the model and extras > BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you save and load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same. ```{r} save_bertopic_model(topic_model, "topic_model") ```