---
title: "Introduction to scToppR"
package: "`r pkg_ver('scToppR')`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{1. Introduction to scToppR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r opts, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

scToppR is a package that allows seamless, workflow-based interaction with ToppGene, a portal for gene enrichment analysis. Researchers can use scToppR to directly query ToppGene's databases and conduct analysis with a few lines of code. scToppR's availability on Bioconductor ensures easy installation and integration with other Bioconductor workflows, allowing researchers to incorporate functional enrichment analysis from ToppGene into their existing pipelines.

The use of data from ToppGene is governed by their Terms of Use:
https://toppgene.cchmc.org/navigation/termsofuse.jsp 

This vignette demonstrates the use of scToppR within a differential expression workflow. We show the complete workflow from differential expression results to pathway analysis and visualization. While the examples show how to make live API calls to ToppGene, this vignette uses pre-computed results to ensure reproducibility and avoid dependency on internet connectivity.

# Installation

```{r installation, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("scToppR")
```


# Load Data
As an introduction, this vignette will work with the FindAllMarkers output from Seurat's PBMC 3k clustering tutorial: [https://satijalab.org/seurat/articles/pbmc3k_tutorial.html](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)

You can follow that tutorial and get the markers file from this line:

```
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
```
Alternatively, this markers table is included in the scToppR package:
```{r setup}
library(scToppR)
data("pbmc.markers")
head(pbmc.markers)
```


With this data we can run the function `toppFun` to get results from ToppGene. The toppFun function can accept three different data formats:

- A vector of gene symbols `type = "marker_list"`
This is simply a list of gene symbols, without any additional information.

- A data frame of cluster marker genes `type = "marker_df"`
This is a dataframe where each column is a different cluster or celltype, and each row contains marker genes for that cluster.

- A data frame of differentially expressed genes `type = "degs"`
This is a typical output from a differential expression analysis such as DESeq2, containing gene symbols and statistics including p values and log fold changes. If the dataframe has a cluster or celltype column, the function can run ToppGene analysis for each cluster separately.

The pbmc.markers data is in the "degs" format, so we will set `type = "degs"` in the toppFun function. We will also need to specify the relevant columns for clusters, genes, p values, and log fold changes:

```{r toppFun_1}
# This is how you would run the analysis with live data (requires internet)
if (curl::has_internet()) {
     toppdata.pbmc <- toppFun(
        input_data = pbmc.markers,
        type = "degs",
        topp_categories = NULL,
        cluster_col = "cluster",
        gene_col = "gene",
        p_val_col = "p_val_adj",
        logFC_col = "avg_log2FC"
    )
} else {
   data("toppdata.pbmc")
}

head(toppdata.pbmc)
```


Additionally, you can run toppFun on all ToppGene categories by setting topp_categories to NULL. You may also provide 1 or more specific categories as a list. To see all ToppGene categories, you can also use the function get_ToppCats():

```{r topp_cats}
get_ToppCats()
```

You can also set additional parameters in the toppFun function, please check the documentation for more information.

The results of toppFun (whether from a live API call or loaded from cached data) are organized into a data frame with the following structure:

```{r toppData_structure}
# Examine the structure of the results
str(toppdata.pbmc)
cat("Number of enriched terms:", nrow(toppdata.pbmc), "\n")
cat("Categories analyzed:", length(unique(toppdata.pbmc$Category)), "\n")
cat("Clusters analyzed:", length(unique(toppdata.pbmc$Cluster)), "\n")
```

## Plotting

scToppR can automatically create DotPlots for each ToppGene category. Simply run:

```{r toppPlot_1}
plots <- toppPlot(toppdata.pbmc,
    category = "GeneOntologyMolecularFunction",
    clusters = NULL
)
plots[1]
```

This will create a list of plots for all clusters in one specific category. Here, the category "GenoOntologyMolecularFunction" was requested, and the `clusters` parameter was left NULL as default. If `clusters` is NULL, then all available ones are used. For example, the output here creates a list of plots for each cluster for the "GenoOntologyMolecularFunction". If multiple clusters are selected, users can use `combine = TRUE` to return a patchwork object of plots. Leaving `combine = FALSE` returns a list of ggplot objects. If using the `save = TRUE` parameter, the function will automatically save each individual plot in the format: `{category}_{cluster}_dotplot.pdf`

scToppR can also create balloon plots showing overlapping terms between all clusters.

```{r toppBalloon}
toppBalloon(toppdata.pbmc, categories = "GeneOntologyMolecularFunction")
```

This function also has a save parameter, which will automatically save plots, which is helpful if multiple categories are visualized.


## Saving 

scToppR will also automatically save the results of the ToppGene query. By default it will save separate files for each cluster. To save as one large file, set the parameter `split = FALSE`. It will also save all files as Excel spreadsheets, but this can be changed using the `format` parameter--it must be one of `c("xlsx", "csv", "tsv")`.

```{r save}
tmpdir <- tempdir()
toppSave(toppdata.pbmc, filename = "PBMC", save_dir = tmpdir, split = TRUE, format = "xlsx")
```

```{r sessionInfo}
sessionInfo()
```