--- title: "PubMatrixR: Literature Co-occurrence Analysis" subtitle: "A comprehensive guide to analyzing publication relationships" output: rmarkdown::html_vignette author: "ToledoEM" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{PubMatrixR: Literature Co-occurrence Analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup-options, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", warning = FALSE, message = FALSE, out.width = "100%", dpi = 150 ) ``` ```{r setup, message=FALSE} library(PubMatrixR) library(knitr) library(kableExtra) library(dplyr) library(pheatmap) library(ggplot2) ``` ## Introduction PubMatrixR is an R package designed to analyze literature co-occurrence patterns by systematically searching PubMed and PMC databases. This vignette demonstrates how to: - Create co-occurrence matrices from literature searches - Visualize results using custom heatmaps with overlap percentage clustering - Work with gene sets from MSigDB - Create bar plots showing publication patterns by gene - Export results for further analysis ### Acknowledgments This package is a heavily fork of the original [PubMatrixR](https://github.com/tslaird/PubMatrixR) by tslaird. Our gratitude to the original author. ### NCBI API Key (Recommended) For better performance and higher rate limits, we recommend obtaining an NCBI API key: - **Without API key**: 3 requests per second - **With API key**: 10 requests per second To obtain your free NCBI API key, visit: Once you have your API key, you can use it in PubMatrixR like this: ```{r, eval=FALSE} result <- PubMatrix( A = gene_set_1, B = gene_set_2, API.key = "your_api_key_here", Database = "pubmed" ) ``` ## Preparing Gene Sets ### Making the gene lists from MSigDB For this example, we'll extract genes related to WNT signaling and obesity from the MSigDB database: `msigdf` is optional and not required to build this vignette. The example below is shown for reference only and is not executed during package checks. ```{r gene_extraction, eval = FALSE} # Extract WNT-related genes A <- msigdf::msigdf.human %>% dplyr::filter(grepl(geneset, pattern = "wnt", ignore.case = TRUE)) %>% dplyr::pull(symbol) %>% unique() # Extract obesity-related genes B <- msigdf::msigdf.human %>% dplyr::filter(grepl(geneset, pattern = "obesity", ignore.case = TRUE)) %>% dplyr::pull(symbol) %>% unique() # Sample genes for demonstration (making them equal in length) A <- sample(A, 10, replace = FALSE) B <- sample(B, 10, replace = FALSE) ``` ### Fallback Example Dataset When MSigDB is not available, we use these representative gene sets: ```{r example_genes} # WNT signaling pathway genes A <- c("WNT1", "WNT2", "WNT3A", "WNT5A", "WNT7B", "CTNNB1", "DVL1") # Obesity-related genes B <- c("LEPR", "ADIPOQ", "PPARG", "TNF", "IL6", "ADRB2", "INSR") ``` ## Literature Analysis ### Running PubMatrixR Now we'll search for co-occurrences between our gene sets in PubMed literature: ```{r pubmatrix_analysis, eval = FALSE} # Intentionally not run during package checks: this performs live NCBI queries. # Run actual PubMatrix analysis current_year <- as.integer(format(Sys.Date(), "%Y")) result <- PubMatrix( A = A, B = B, Database = "pubmed", daterange = c(2020, current_year), outfile = "pubmatrix_result" ) ``` ```{r pubmatrix_analysis_offline} # Offline deterministic example used for vignette rendering/package checks. result <- outer(seq_along(B), seq_along(A), function(i, j) { 8 + (i * 6) + (j * 5) + ((i + j) %% 4) * 3 + ((i * j) %% 5) }) result <- as.data.frame(result, check.names = FALSE) colnames(result) <- A rownames(result) <- B ``` ### Results Table The co-occurrence matrix shows the number of publications mentioning each pair of genes: ```{r results_table} kable(result, caption = "Co-occurrence Matrix: WNT Genes vs Obesity Genes (Publication Counts)", align = "c", format = if (knitr::pandoc_to() == "html") "html" else "markdown" ) %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "center" ) %>% kableExtra::add_header_above(c(" " = 1, "Obesity Genes" = length(B))) ``` ## Visualization ### Publication Count Bar Plots Let's first examine which genes have the highest publication counts: ```{r bar_plots, fig.width=10, fig.height=7, out.width="100%", dpi=150} # Create data frame for List A genes (rows) colored by List B genes (columns) a_genes_data <- data.frame( gene = rownames(result), total_pubs = rowSums(result), stringsAsFactors = FALSE ) # Add color coding based on max overlap with B genes a_genes_data$max_b_gene <- apply(result, 1, function(x) colnames(result)[which.max(x)]) a_genes_data$max_overlap <- apply(result, 1, max) # Create data frame for List B genes (columns) colored by List A genes (rows) b_genes_data <- data.frame( gene = colnames(result), total_pubs = colSums(result), stringsAsFactors = FALSE ) # Add color coding based on max overlap with A genes b_genes_data$max_a_gene <- apply(result, 2, function(x) rownames(result)[which.max(x)]) b_genes_data$max_overlap <- apply(result, 2, max) bar_plot_theme <- theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 10), axis.title = element_text(size = 11), axis.text = element_text(size = 10), panel.grid.major.y = element_blank(), legend.position = "bottom", legend.title = element_text(size = 10), legend.text = element_text(size = 9), legend.box = "vertical" ) n_fill_a <- length(unique(a_genes_data$max_b_gene)) n_fill_b <- length(unique(b_genes_data$max_a_gene)) # Plot A genes colored by their strongest B gene partner p1 <- ggplot(a_genes_data, aes(x = reorder(gene, total_pubs), y = total_pubs, fill = max_b_gene)) + geom_col(width = 0.75) + coord_flip() + labs( title = "List A Genes by Publication Count", subtitle = "Colored by strongest List B partner", x = "Genes (List A)", y = "Total Publications", fill = "Strongest B Partner" ) + scale_fill_viridis_d(option = "D", end = 0.9) + guides(fill = guide_legend(nrow = max(1, ceiling(n_fill_a / 4)), byrow = TRUE)) + bar_plot_theme # Plot B genes colored by their strongest A gene partner p2 <- ggplot(b_genes_data, aes(x = reorder(gene, total_pubs), y = total_pubs, fill = max_a_gene)) + geom_col(width = 0.75) + coord_flip() + labs( title = "List B Genes by Publication Count", subtitle = "Colored by strongest List A partner", x = "Genes (List B)", y = "Total Publications", fill = "Strongest A Partner" ) + scale_fill_viridis_d(option = "D", end = 0.9) + guides(fill = guide_legend(nrow = max(1, ceiling(n_fill_b / 4)), byrow = TRUE)) + bar_plot_theme print(p1) print(p2) ``` ### Heatmap with Overlap Percentages The heatmap displays overlap percentages calculated from the publication co-occurrence counts: ```{r heatmap_with_numbers_asymmetric, fig.width=10.5, fig.height=7.5, out.width="100%", dpi=150, fig.cap="Overlap percentage heatmap with values displayed in each cell"} plot_pubmatrix_heatmap(result, title = "WNT-Obesity Overlap (%)", show_numbers = TRUE, cellwidth = 44, cellheight = 32, width = 12, height = 10 ) ``` ### Clean Heatmap For a cleaner visualization without numbers, useful for presentations: ```{r heatmap_clean_asymmetric, fig.width=10.5, fig.height=7.5, out.width="100%", dpi=150, fig.cap="Co-occurrence heatmap without numbers for better visual clarity"} plot_pubmatrix_heatmap(result, title = "WNT-Obesity Co-occurrence (Clean)", show_numbers = FALSE, cellwidth = 44, cellheight = 32, width = 12, height = 10 ) ``` Asymmetric lists ```{r asymmetric_example, eval = FALSE} # Intentionally not run during package checks: this performs live NCBI queries. A <- c("NCOR2", "NCSTN", "NKD1", "NOTCH1", "NOTCH4", "NUMB", "PPARD", "PSEN2", "PTCH1", "RBPJ", "SKP2", "TCF7", "TP53") B <- c("EIF1", "EIF1AX", "EIF2B1", "EIF2B2", "EIF2B3", "EIF2B4", "EIF2B5", "EIF2S1", "EIF2S2", "EIF2S3", "ELAVL1") # Run actual PubMatrix analysis current_year <- as.integer(format(Sys.Date(), "%Y")) result <- PubMatrix( A = A, B = B, Database = "pubmed", daterange = c(2020, current_year), outfile = "pubmatrix_result" ) ``` ```{r asymmetric_example_offline} # Offline deterministic example used for vignette rendering/package checks. A <- c("NCOR2", "NCSTN", "NKD1", "NOTCH1", "NOTCH4", "NUMB", "PPARD", "PSEN2", "PTCH1", "RBPJ", "SKP2", "TCF7", "TP53") B <- c("EIF1", "EIF1AX", "EIF2B1", "EIF2B2", "EIF2B3", "EIF2B4", "EIF2B5", "EIF2S1", "EIF2S2", "EIF2S3", "ELAVL1") result <- outer(seq_along(B), seq_along(A), function(i, j) { 12 + (i * 4) + (j * 3) + ((i * j) %% 9) + ((i + 2 * j) %% 6) }) result <- as.data.frame(result, check.names = FALSE) colnames(result) <- A rownames(result) <- B ``` ### Results Table The co-occurrence matrix shows the number of publications mentioning each pair of genes: ```{r results_table_asymmetric} kable(result, caption = "Co-occurrence Matrix: Longer Lists (Publication Counts)", align = "c", format = if (knitr::pandoc_to() == "html") "html" else "markdown" ) %>% kableExtra::kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "center" ) %>% kableExtra::add_header_above(c(" " = 1, "A Genes" = length(A))) ``` ### Bar Plots for Asymmetric Lists ```{r bar_plots_asymmetric, fig.width=10, fig.height=8, out.width="100%", dpi=150} # Create data frame for List A genes (rows) colored by List B genes (columns) a_genes_data2 <- data.frame( gene = rownames(result), total_pubs = rowSums(result), stringsAsFactors = FALSE ) # Add color coding based on max overlap with B genes a_genes_data2$max_b_gene <- apply(result, 1, function(x) colnames(result)[which.max(x)]) a_genes_data2$max_overlap <- apply(result, 1, max) # Create data frame for List B genes (columns) colored by List A genes (rows) b_genes_data2 <- data.frame( gene = colnames(result), total_pubs = colSums(result), stringsAsFactors = FALSE ) # Add color coding based on max overlap with A genes b_genes_data2$max_a_gene <- apply(result, 2, function(x) rownames(result)[which.max(x)]) b_genes_data2$max_overlap <- apply(result, 2, max) bar_plot_theme <- theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 10), axis.title = element_text(size = 11), axis.text = element_text(size = 10), panel.grid.major.y = element_blank(), legend.position = "bottom", legend.title = element_text(size = 10), legend.text = element_text(size = 8.5), legend.box = "vertical" ) n_fill_a2 <- length(unique(a_genes_data2$max_b_gene)) n_fill_b2 <- length(unique(b_genes_data2$max_a_gene)) # Plot A genes colored by their strongest B gene partner p3 <- ggplot(a_genes_data2, aes(x = reorder(gene, total_pubs), y = total_pubs, fill = max_b_gene)) + geom_col(width = 0.75) + coord_flip() + labs( title = "List A Genes by Publication Count (Asymmetric)", subtitle = "Colored by strongest List B partner", x = "Genes (List A)", y = "Total Publications", fill = "Strongest B Partner" ) + scale_fill_viridis_d(option = "D", end = 0.9) + guides(fill = guide_legend(nrow = max(1, ceiling(n_fill_a2 / 4)), byrow = TRUE)) + bar_plot_theme # Plot B genes colored by their strongest A gene partner p4 <- ggplot(b_genes_data2, aes(x = reorder(gene, total_pubs), y = total_pubs, fill = max_a_gene)) + geom_col(width = 0.75) + coord_flip() + labs( title = "List B Genes by Publication Count (Asymmetric)", subtitle = "Colored by strongest List A partner", x = "Genes (List B)", y = "Total Publications", fill = "Strongest A Partner" ) + scale_fill_viridis_d(option = "D", end = 0.9) + guides(fill = guide_legend(nrow = max(1, ceiling(n_fill_b2 / 4)), byrow = TRUE)) + bar_plot_theme print(p3) print(p4) ``` ### Heatmap with Overlap Percentages The heatmap displays overlap percentages calculated from the publication co-occurrence counts: ```{r heatmap_with_numbers_asymmetric2, fig.width=10.5, fig.height=7.5, out.width="100%", dpi=150, fig.cap="Overlap percentage heatmap with values displayed in each cell"} plot_pubmatrix_heatmap(result, title = "Asymmetric Lists Overlap (%)", show_numbers = TRUE, cellwidth = 44, cellheight = 32, width = 12, height = 10 ) ``` ### Clean Heatmap For a cleaner visualization without numbers, useful for presentations: ```{r heatmap_clean_asymmetric2, fig.width=10.5, fig.height=7.5, out.width="100%", dpi=150, fig.cap="Co-occurrence heatmap without numbers for better visual clarity"} plot_pubmatrix_heatmap(result, title = "Asymmetric Lists Co-occurrence (Clean)", show_numbers = FALSE, cellwidth = 44, cellheight = 32, width = 12, height = 10 ) ``` ## Summary This vignette demonstrated: 1. **Gene Set Preparation**: Using MSigDB or manual curation to create meaningful gene lists 2. **Literature Analysis**: Running PubMatrixR to generate co-occurrence matrices 3. **Data Visualization**: Creating publication-ready heatmaps with custom color schemes and bar plots 4. **Results Interpretation**: Understanding co-occurrence patterns in the literature The resulting matrices and visualizations can help identify: - Strong literature connections between gene sets - Potential research gaps (low co-occurrence pairs) - Patterns in publication trends over time - Most studied genes and their strongest literature partners ## System Information ```{r system_info} sessionInfo() ```