--- title: "Getting Started with SportMiner" author: "Praveen D Chougale and Usha Ananthakumar" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with SportMiner} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Introduction **SportMiner** is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for: - Retrieving abstracts from the Scopus database - Preprocessing and cleaning text data - Performing advanced topic modeling (LDA, STM, CTM) - Creating publication-ready visualizations - Analyzing keyword co-occurrence networks This vignette demonstrates the core functionality of SportMiner through a practical example. ## Installation ```{r install, eval=FALSE} install.packages("SportMiner") ``` ## Setting Up Your Scopus API Key Before using SportMiner, you need a Scopus API key. You can obtain one by registering at [Elsevier Developer Portal](https://dev.elsevier.com/). ```{r api-key} library(SportMiner) # Option 1: Set directly sm_set_api_key("your_api_key_here") # Option 2: Set via environment variable (recommended) # Add to your .Renviron file: # SCOPUS_API_KEY=your_api_key_here # Then restart R and run: sm_set_api_key() ``` ## Step 1: Retrieve Papers from Scopus Let's search for papers on talent identification in sport science that use principal component analysis or cluster analysis. ```{r search} # Define the search query query <- paste0( 'TITLE-ABS-KEY(', '("talent identification" OR "sport science" OR "athlete") ', 'AND ', '("principal component analysis" OR "PCA" OR "cluster analysis") ', ') AND DOCTYPE(ar) AND PUBYEAR > 2010' ) # Retrieve papers papers <- sm_search_scopus( query = query, max_count = 100, verbose = TRUE ) # View the data structure head(papers[, c("title", "year", "author_keywords")]) ``` ## Step 2: Preprocess Text Data Convert the raw abstracts into a clean, stemmed word count format. ```{r preprocess} # Preprocess abstracts processed_data <- sm_preprocess_text( data = papers, text_col = "abstract", min_word_length = 3 ) # View the processed data head(processed_data) ``` ## Step 3: Create Document-Term Matrix Transform the word counts into a sparse matrix suitable for topic modeling. ```{r dtm} # Create DTM dtm <- sm_create_dtm( word_counts = processed_data, min_term_freq = 3, max_term_freq = 0.5 ) # Check dimensions print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol)) ``` ## Step 4: Select Optimal Number of Topics Use coherence-based selection to find the best number of topics. ```{r optimal-k} # Test different values of k k_selection <- sm_select_optimal_k( dtm = dtm, k_range = seq(4, 16, by = 2), method = "gibbs", plot = TRUE ) # View results print(k_selection$results) print(paste("Optimal k:", k_selection$optimal_k)) ``` ## Step 5: Train Topic Model Fit an LDA model using the optimal k. ```{r train-lda} # Train the model lda_model <- sm_train_lda( dtm = dtm, k = k_selection$optimal_k, method = "gibbs", iter = 500 ) ``` ## Step 6: Visualize Topics ### Top Terms per Topic ```{r plot-terms} # Plot top terms sm_plot_topic_terms( model = lda_model, n_terms = 10 ) ``` ### Topic Frequency Distribution ```{r plot-frequency} # Plot document distribution sm_plot_topic_frequency( model = lda_model, dtm = dtm ) ``` ### Topic Trends Over Time ```{r plot-trends} # Add doc_id to papers for joining papers$doc_id <- paste0("doc_", seq_len(nrow(papers))) # Plot trends sm_plot_topic_trends( model = lda_model, dtm = dtm, metadata = papers, doc_id_col = "doc_id" ) ``` ## Step 7: Keyword Co-occurrence Network Visualize how author keywords co-occur across papers. ```{r keyword-network} # Create network network_plot <- sm_keyword_network( data = papers, keyword_col = "author_keywords", min_cooccurrence = 2, top_n = 30 ) print(network_plot) ``` ## Advanced: Compare Multiple Models Compare LDA, STM, and CTM to find the best-performing model. ```{r compare-models} # Run comparison comparison <- sm_compare_models( dtm = dtm, k = 10, seed = 1729, verbose = TRUE ) # View metrics print(comparison$metrics) # Get recommendation print(paste("Recommended model:", comparison$recommendation)) # Use the recommended model best_model <- comparison$models[[tolower(comparison$recommendation)]] ``` ## Customizing Visualizations All plotting functions use the custom `theme_sportminer()` theme, but you can customize further. ```{r custom-theme} library(ggplot2) # Create a plot with custom theme settings p <- sm_plot_topic_frequency(lda_model, dtm) # Add customizations p + labs( title = "Distribution of Research Topics in Sport Science", subtitle = "Based on 100 papers from Scopus (2010-2025)" ) + theme_sportminer(base_size = 14, grid = FALSE) ``` ## Best Practices 1. **API Rate Limits**: Scopus has rate limits. Use `max_count` wisely and add delays between large queries. 2. **Reproducibility**: Always set seeds when running topic models: ```r sm_train_lda(dtm, k = 10, seed = 1729) ``` 3. **Hyperparameter Tuning**: Experiment with `min_term_freq` and `max_term_freq` in `sm_create_dtm()` to balance vocabulary size and model performance. 4. **Model Selection**: Don't rely solely on coherence. Inspect the top terms for each topic to ensure interpretability. ## Next Steps - Explore the package documentation for detailed function reference - Experiment with different preprocessing and modeling parameters - Contact the maintainer for bug reports and feature requests ## Citation If you use SportMiner in your research, please cite: ```r citation("SportMiner") ``` ## References - Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. *Journal of Machine Learning Research*, 3, 993-1022. - Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. *Journal of Statistical Software*, 91(2), 1-40.