--- title: "Using the GEOquery Package" author: "Sean Davis" date: "2025-08-16" last-modified: {{< meta last-modified >}} format: html: toc: true vignette: > %\VignetteIndexEntry{Using GEOquery (Quarto)} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r setup} #| message: false #| warning: false #| echo: false library(GEOquery) library(knitr) ``` # Introduction to GEO and GEOquery The NCBI Gene Expression Omnibus ([GEO](https://www.ncbi.nlm.nih.gov/geo/)) was established in 2000 as a public repository for high-throughput molecular abundance data, primarily microarray data at the time. Today, GEO hosts a diverse array of data types including gene expression, genomic DNA, and protein abundance measurements from various technologies like microarrays, next-generation sequencing, and mass spectrometry. GEOquery was created to bridge the gap between this vast resource and Bioconductor's analytical tools. First released in 2007, GEOquery has evolved alongside GEO itself, adapting to new data types and formats over time. ## Why GEOquery? Before GEOquery, researchers would need to: 1. Manually download data from the GEO website 2. Parse complex SOFT format files 3. Construct data structures suitable for analysis 4. Integrate metadata with expression data GEOquery automates this entire process, allowing researchers to focus on analysis rather than data acquisition and formatting. GEOquery also facilitates automation and reproducibility by incorporating data acquisition into workflows, scripts, or documents. ## GEO Data Organization Understanding GEO's data organization is essential for effective use of GEOquery: 1. **Platform (GPL)**: Describes array design, probes, or detectable elements 2. **Sample (GSM)**: Contains individual experiment measurements 3. **Series (GSE)**: Groups related samples together, typically representing a complete study 4. **Dataset (GDS)**: Curated by GEO staff, represents biologically and statistically comparable samples # Getting Started with GEOquery Before working with GEOquery (or any R or Bioconductor package), one must first install the package. Installation is a one-time (or at least not repeated often) operation. To install GEOquery, ensure that you have [installed R and Bioconductor](https://bioconductor.org/install/). Then, to install GEOquery: ```{r} #| eval: false BiocManager::install('GEOquery') ``` Before using GEOquery, we need to load the GEOquery library. Loading the GEOquery library must be done _each time you start a new R session_. ```{r} library(GEOquery) ``` ## Downloading a GEO Series The most common use case is downloading a GEO Series (GSE), which typically represents a complete study: ```{r} # Download GSE2553 gse <- getGEO("GSE2553") class(gse) ``` Notice that `getGEO` returns a list. This is because a single GSE can contain experiments from multiple platforms. Each element of the list is an `ExpressionSet` containing data from one platform: ```{r} length(gse) gse[[1]] ``` ## Historical Context: SOFT Format vs GSEMatrix Files GEO originally provided data in SOFT (Simple Omnibus Format in Text) format, which contained extensive information but was slow to parse for large datasets. In response to community needs, GEO introduced GSEMatrix files—a more efficient, tab-delimited format. GEOquery defaults to using GSEMatrix files (`GSEMatrix=TRUE`) because: 1. Parsing is substantially faster (often by 10-100x) 2. Memory usage is more efficient 3. The resulting `ExpressionSet` objects are directly usable with Bioconductor tools # Searching GEO Programmatically While GEO's web interface is powerful, programmatic searches enable automated data discovery and retrieval. GEOquery provides direct access to GEO's search capabilities: ```{r} #| tbl-cap: "Available GEO search fields." #| label: tbl-search-fields # What fields can we search? fields <- searchFieldsGEO() kable(fields) ``` GEO uses a specific search syntax with field identifiers in square brackets: ```{r} #| label: tbl-example-search #| tbl-cap: "Top search results for studies related to COVID-19 in humans with GEO-calculated RNA-seq counts available." # Find RNA-seq studies related to COVID-19 in humans results <- searchGEO('covid-19[All Fields] AND "rnaseq counts"[Filter] AND Homo sapiens[ORGN]') results |> dplyr::mutate(Summary=paste(strtrim(Summary,120), '...')) |> dplyr::mutate(Title = paste(strtrim(Title, 120), '...')) |> head() |> kable() ``` The search capabilities mirror GEO's web interface but allow for integration with R workflows. # RNA-seq Quantifications: NCBI's Solution to Reanalysis Challenges ## The Challenge of RNA-seq Reanalysis A major barrier to exploiting the massive volume of public RNA-seq data has been the computational cost and expertise required to consistently process raw reads into usable expression values. Different processing pipelines can produce different results, making cross-study comparisons challenging. ## NCBI's RNA-seq Quantification Pipeline To address this challenge, in 2020-2021, the NCBI SRA and GEO teams developed a standardized pipeline that precomputes RNA-seq gene expression counts for human and mouse datasets. As described in their [documentation](https://www.ncbi.nlm.nih.gov/geo/info/rnaseqcounts.html), this pipeline: 1. Processes RNA-seq data from SRA using the HISAT2 aligner 2. Generates gene expression counts using the featureCounts program 3. Provides consistent annotation based on current genome builds 4. Makes counts available in standardized formats GEOquery provides direct access to these precomputed counts: ```{r} # Check if RNA-seq quantifications are available has_quant <- hasRNASeqQuantifications("GSE164073") has_quant ``` ```{r} #| eval: false # Get genome build and species information genome_info <- getRNASeqQuantGenomeInfo("GSE164073") genome_info ``` ```{r} #| eval: false # Download and construct a SummarizedExperiment se <- getRNASeqData("GSE164073") se ``` This feature saves researchers significant time and computational resources while ensuring standardized processing across datasets. # Understanding Supplementary Files in GEO GEO accessions often include supplementary files containing raw data, processing scripts, or additional results not captured in the standard GEO formats. These files are invaluable for: 1. Accessing raw data (e.g., CEL files, FASTQ files) 2. Understanding custom processing pipelines 3. Retrieving additional metadata or results GEOquery makes accessing these files straightforward: ```{r} # List available supplementary files without downloading supp_files <- getGEOSuppFiles('GSE63137', fetch_files = FALSE) head(supp_files) ``` You can filter files by pattern to find specific file types: ```{r} # Find all text files txt_files <- getGEOSuppFiles('GSE63137', fetch_files = FALSE, filter_regex = 'txt') head(txt_files) ``` And download specific files or all supplementary files: ```{r} #| eval: false # Download all supplementary files for a sample getGEOSuppFiles('GSM15789') # Files saved to a new directory ``` # Navigating Between R and the GEO Web Interface Sometimes you may want to examine a GEO record in its web interface. GEOquery provides convenience functions for this: ```{r} #| eval: false # Get the URL for a GEO accession url <- urlForAccession("GSE262484") url # Open a browser to the GEO page browseGEOAccession("GSE262484") ``` For RNA-seq datasets specifically, there's a convenience function to search for RNA-seq counts on the GEO website: ```{r} #| eval: false browseWebsiteRNASeqSearch() ``` These functions bridge the programmatic and web interfaces to GEO, allowing seamless transitions between analytical and exploratory modes. # Working with GDS Datasets GEO DataSets (GDS) are curated collections of samples, processed and normalized to be directly comparable. While less common in modern workflows, they remain available and GEOquery supports them: ```{r} # Download a GDS dataset gds <- getGEO("GDS507") gds ``` GDS objects can be converted to Bioconductor data structures: ```{r} # Convert to ExpressionSet (with log2 transformation) eset <- GDS2eSet(gds, do.log2=TRUE) eset ``` ```{r} # Or to a limma MAList malist <- GDS2MA(gds) class(malist) ``` These conversions are particularly useful for integrating older GEO datasets into modern analytical workflows. # Advanced Features ## Getting GSE Data Tables Some GSE records contain data tables with important metadata not captured in the standard GSE structure: ```{r} # Get data tables from GSE3494 dt_list <- getGSEDataTables("GSE3494") names(dt_list) head(dt_list[[1]]) ``` ## Working with GPL Platforms Platform records (GPL) contain important probe annotations: ```{r} # Get a platform record gpl <- getGEO("GPL96") head(Table(gpl)[, 1:5]) ``` When retrieving GSE records, GEOquery can automatically include GPL annotation: ```{r} #| eval: false # Get GSE with GPL annotation gse_with_gpl <- getGEO("GSE2553", AnnotGPL=TRUE) head(fData(gse_with_gpl[[1]])) ``` # Reporting Bugs and Contributing As GEO continues to evolve, GEOquery adapts to support new features and data types. If you encounter issues: 1. Check the [Bioconductor Support site](https://support.bioconductor.org/) 2. Report bugs on [GitHub](https://github.com/seandavi/GEOquery/issues) 3. Consider contributing via pull requests # Citing GEOquery If you use GEOquery in your research, please cite: ```{r} citation("GEOquery") ``` # Session Information ```{r} sessionInfo() ```