Contents

1 Abstract

AnnotatedBCGEData is an R/Bioconductor package that provides access to more than 100 gene-expression datasets generated using microarray and RNA-Sequencing technologies. We collected these datasets from public repositories, including Gene Expression Omnibus, ArrayExpress, and The Cancer Genome Atlas. Where feasible, we obtained and reprocessed the raw data using computational pipelines and annotations that are standardized and more updated than those used to process the data originally. Additionally, we annotated the sample metadata using ontology terms from the National Cancer Institute Thesaurus. This was necessary because different data submitters used different terminology to describe breast-cancer patients and samples. Mapping the metadata to ontology terms makes it possible for users to search the data using these standardized terms.

2 Installation instructions

Get the latest stable R release from CRAN.

Then install AnnotatedBCGEData from Bioconductor using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("AnnotatedBCGEData")

Or install the development version:

BiocManager::install("srp33/AnnotatedBCGEData")

Finally, load the package:

library(AnnotatedBCGEData)

3 Data retrieval

Attempting to store all of the data within this package would be infeasible due to the data’s size. Thus, we store the data remotely. Once a user has identified dataset(s) to analyze, our package retrieves the data from a public Zenodo repository. (If they prefer, researchers can find details about the data and access the files directly by exploring our Zenodo repository with a Web browser.)

3.1 Searching for ontology terms

This package contains functions designed to help users filter the included datasets based on their associated ontology terms. The first function that researchers typically use is searchOntologyTerms. Its purpose is to help users identify ontology terms mapped to the data. It accepts two parameters: a string to search by and what should be searched. The second parameter allows the user to indicate whether the string is expected to match the “Name” (default), “Definition”, “URI”, or “Code” of the ontology term(s). The function returns a tibble with information about the matching terms. The user reviews the matching term(s) and retrieves the corresponding ontology code(s), which can then be passed to the searchForDatasetsByField or searchForDatasetsByValue functions, which return information about datasets associated with the term(s).

In the following example, we are using the searchOntologyTerms function to search for ontology terms that match the string, “Progesterone Receptor Status”. Because we did not specify a value for the term_type argument, the default of “Name” is used.

searchOntologyTerms("Progesterone Receptor Status")
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17488901'!
## Rows: 518 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): URI, Preferred Label, Synonyms, Definitions, code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
##   URI                                           Name  Synonyms Definitions Code 
##   <chr>                                         <chr> <chr>    <chr>       <chr>
## 1 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 2 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… Indicates … C161…
## 3 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 4 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 5 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… PgR Sta… An immunoh… C147…

In the following similar example, we are searching for ontology terms that match the shorter (and lowercase) string, “receptor status”.

searchOntologyTerms("receptor status")
## Rows: 518 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): URI, Preferred Label, Synonyms, Definitions, code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 11 × 5
##    URI                                          Name  Synonyms Definitions Code 
##    <chr>                                        <chr> <chr>    <chr>       <chr>
##  1 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… "An immuno… C147…
##  2 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… "Indicates… C161…
##  3 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Rece… Recepto… "Refers to… C942…
##  4 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… "An immuno… C147…
##  5 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… Estroge… "The estro… C161…
##  6 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… "An immuno… C147…
##  7 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… "An immuno… C147…
##  8 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… "An immuno… C147…
##  9 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… "An immuno… C147…
## 10 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… PgR Sta… "An immuno… C147…
## 11 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… "An immuno… C147…

The function returns a tibble containing details about ontology terms related to both progesterone receptor status and estrogen receptor status.

The National Cancer Institute Thesaurus provides synonyms for most terms. When a user searches by “Name” with the searchOntologyTerms function, it automatically checks against synonyms as well. This is helpful when the search text does not match the “preferred name” of an ontology term but instead matches a synonym. For example, we could use the following code to search for ontology terms related to progesterone receptor status using a synonym that is commonly used in medical contexts.

searchOntologyTerms("PgR Status")
## Rows: 518 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): URI, Preferred Label, Synonyms, Definitions, code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 5 × 5
##   URI                                           Name  Synonyms Definitions Code 
##   <chr>                                         <chr> <chr>    <chr>       <chr>
## 1 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 2 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… Indicates … C161…
## 3 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 4 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… Progest… An immunoh… C147…
## 5 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaur… Prog… PgR Sta… An immunoh… C147…

Another alternative is to search ontology term definitions. This is helpful when you don’t know the exact term but understand the concept you’re looking for. In this example, we are searching for definitions matching the string, “immunohistochemical”.

searchOntologyTerms("immunohistochemical", term_type = "Definition")
## Rows: 518 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): URI, Preferred Label, Synonyms, Definitions, code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 16 × 5
##    URI                                          Name  Synonyms Definitions Code 
##    <chr>                                        <chr> <chr>    <chr>       <chr>
##  1 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… An immunoh… C147…
##  2 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… An immunoh… C147…
##  3 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Immu… Immunoh… The result… C175…
##  4 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… An immunoh… C147…
##  5 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… An immunoh… C147…
##  6 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… An immunoh… C147…
##  7 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… The result… C185…
##  8 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… Progest… An immunoh… C147…
##  9 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… IGF1… IGF1R I… An immunoh… C843…
## 10 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… An immunoh… C147…
## 11 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… An immunoh… C147…
## 12 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… An immunoh… C147…
## 13 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Prog… PgR Sta… An immunoh… C147…
## 14 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… Estr… ER Stat… An immunoh… C147…
## 15 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… An immunoh… C147…
## 16 http://ncicb.nci.nih.gov/xml/owl/EVS/Thesau… HER2… HER2/Ne… An immunoh… C141…

In this case, the function returns a tibble containing details about ontology terms for concepts related to using immunohistochemical methods for profiling. This includes terms related to progesterone receptor status as well as other cell-surface markers.

3.2 Identifying datasets associated with specific ontology terms

After we have identified ontology terms for the research topic we wish to study, we can identify datasets associated with those terms. The searchForDatasetsByField function accepts codes for ontology terms. Each of these codes should be a unique identifier associated with an ontology term (typically found using the searchOntologyTerms function). The searchForDatasetsByField function checks the ontology terms associated with metadata columns (fields) and returns a tibble with the identifier(s) of the associated dataset(s), the names of columns that have been mapped to any of the terms, and the associated codes. The dataset identifiers can then be passed to the getDataset function to retrieve the data.

Similarly, the searchForDatasetsByValue function accepts codes for ontology terms that have been mapped to values within a given column (field). The function searches across the datasets and returns a tibble with information about the matching datasets. The dataset identifiers can then be passed to the getDataset function to retrieve the data.

In the previous examples, we searched for ontology terms related to progesterone receptor status. One of the matching terms used “Progesterone Receptor Status” as its name “C16149” as its code. We also identified “C16150” as the code associated with “Estrogen Receptor Status.” In the following example, we are using the searchForDatasetsByField function to search for datasets with a column (field) that has been mapped to either of these ontology terms.

searchForDatasetsByField(c("C16149", "C16150"))
## ℹ Number of records: 3
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17583904'!
## Rows: 6079 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): dataset, orig_field, NCIT_field, NCIT_values, orig_values, NCIT_fie...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 116 × 3
##    Dataset_ID Field Code  
##    <chr>      <chr> <chr> 
##  1 GSE11001   er    C16150
##  2 GSE19615   er    C16150
##  3 GSE19697   er    C16150
##  4 GSE20194   er    C16150
##  5 GSE24185   er    C16150
##  6 GSE2990    er    C16150
##  7 GSE43365   er    C16150
##  8 GSE48390   er    C16150
##  9 GSE50948   er    C16150
## 10 GSE5460    er    C16150
## # ℹ 106 more rows

Alternatively (or instead), we can use the searchForDatasetsByValue function to search for the datasets with values in any column that have been associated with a specified ontology term. For example, if we wanted to identify datasets with samples that had been indicated as positive for the progesterone receptor, we could use the code “C15496” for the Progesterone Receptor Positive term.

searchForDatasetsByValue("C15496")
## Rows: 6079 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (9): dataset, orig_field, NCIT_field, NCIT_values, orig_values, NCIT_fie...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 48 × 5
##    Dataset_ID        Field          Field_Code Original_Value              Code 
##    <chr>             <chr>          <chr>      <chr>                       <chr>
##  1 GSE2603           path_pr_status C16149     P                           C154…
##  2 GSE61304          pgr            C16149     1                           C154…
##  3 GSE6532_U133A     pgr            C16149     1                           C154…
##  4 GSE6532_U133B     pgr            C16149     1                           C154…
##  5 GSE6532_U133Plus2 pgr            C16149     1                           C154…
##  6 GSE9195           pgr            C16149     1                           C154…
##  7 GSE45255          pgr_status     C16149     PgR+||positive||focally po… C154…
##  8 GSE58984          pgr_status     C16149     1                           C154…
##  9 GSE96058_HiSeq    pgr_status     C16149     1                           C154…
## 10 GSE96058_NextSeq  pgr_status     C16149     1                           C154…
## # ℹ 38 more rows

3.3 Obtaining dataset-level metadata

In addition to using the ontology, users can decide which datasets to analyze based on metadata about the datasets. This is possible using the getDatasetMetadata function, which accepts a vector of dataset identifiers. Below is an example.

getDatasetMetadata(c("GSE41197", "GSE59772", "GSE10797"))
## ℹ Number of records: 3
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17780657'!
## Rows: 102 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (12): accession_id, contact_city, contact_country, contact_institute, av...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 6
##   Dataset_ID Source Experiment_Type               Title   Summary Overall_Design
##   <chr>      <chr>  <chr>                         <chr>   <chr>   <chr>         
## 1 GSE10797   GEO    Expression profiling by array Transc… The mo… Breast tissue…
## 2 GSE41197   GEO    Expression profiling by array Differ… We use… RNA from huma…
## 3 GSE59772   GEO    Expression profiling by array Analys… Triple… Three FFPE br…

4 Obtaining gene-expression data (as SummarizedExperiment objects)

Installing our package does not download the datasets themselves. The data files are stored in a Zenodo repository and retrieved on demand. To obtain data, use the getDataset function, which stores the data and metadata in a SummarizedExperiment object.

The SummarizedExperiment object for each dataset includes 3 items. One is the gene-expression data, a numeric matrix with samples as columns and genes as rows. Ensembl gene identifiers are used as row names. Each data value is the expression level for its respective sample and gene. Second is the sample metadata, which often includes patient demographics and sample details. Each row represents a sample, and each column represents a different metadata variable. Third is feature metadata: information about the genes that were profiled for each sample. Each row represents a gene and provides information like the Entrez gene ID, gene symbol, and chromosome identifier.

GSE41197_SE = getDataset("GSE41197")
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17428997'!
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 25)
## ✔ Successfully fetched list of published records!
## Either the data was not found in the cache or a newer version of the data was identified. Downloading now.
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17428997'!
## Rows: 9593 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (5): Dataset_ID, HGNC_Symbol, Ensembl_Gene_ID, Chromosome, Gene_Biotype
## dbl (17): Entrez_Gene_ID, GSM1010328, GSM1010329, GSM1010330, GSM1010331, GS...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 16 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): Dataset_ID, Sample_ID, Platform_ID, Patient_ID, disease_state
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
expression_data = assay(GSE41197_SE)
sample_metadata = colData(GSE41197_SE)
feature_data = rowData(GSE41197_SE)

The colData and rowData functions of the SummarizedExperiment class return specialized data structures. These can be easily converted to data frames using as.data.frame. However, converting directly to the tibble type is undesirable because it removes the row names.

The getDataset function creates a cache (using BiocFileCache) on the user’s machine. All downloaded files are saved to this cache so that files do not need to be re-downloaded when the same dataset is retrieved. By default, the cache location is set using tempdir(). This directory is removed when the R session ends. Therefore, the user may specify an alternative cache directory if they wish to ensure the cached files persist between R sessions.

In some cases, multiple versions of a given dataset are available on Zenodo. The getDataset function allows the user to specify which version of the data to use. By default, the most recent version is downloaded. The v argument accepts an integer specifying the version. The following example illustrates retrieving a specific version of a dataset.

GSE41197_v2 = getDataset("GSE41197", v=2)
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17428997'!
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 25)
## ✔ Successfully fetched list of published records!
## Either the data was not found in the cache or a different version was requested. Downloading now.
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched published record for concept DOI '10.5281/zenodo.17428997'!
## ℹ Number of records: 4
## ℹ Successfully fetched list of published records - page 1 (size = 25)
## ✔ Successfully fetched list of published records!
## ℹ Number of records: 1
## ℹ Successfully fetched list of published records - page 1 (size = 10)
## ✔ Successfully fetched list of published records!
## ✔ Successfully fetched record for DOI '10.5281/zenodo.17429158'!
## Rows: 9593 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (5): Dataset_ID, HGNC_Symbol, Ensembl_Gene_ID, Chromosome, Gene_Biotype
## dbl (17): Entrez_Gene_ID, GSM1010328, GSM1010329, GSM1010330, GSM1010331, GS...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 16 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): Dataset_ID, Sample_ID, Platform_ID, Patient_ID, disease_state
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
v2_expression_data = assay(GSE41197_v2)
v2_metadata = colData(GSE41197_v2)
v2_feature_data = rowData(GSE41197_v2)

If the specified version does not exist or needs to be downloaded, a message will be displayed.

The getDataset function accepts a single dataset identifier. At the time of package development, the Zenodo API limits the number of requests per minute to 30. Because of this, we have included logic in our package to pause and retry if a request to download a dataset fails. These API limits should not cause problems in most cases, but users should be aware that this may cause delays when downloading large amounts of data.

4.1 Example 1: Plotting gene-expression values

In this example, we demonstrate simple data exploration. We first access the gene-expression matrix. To make the data easier to plot, we then convert the matrix into a tibble and pivot the data. We use ggplot2 functions to create a box plot showing the gene expression levels across the samples for 10 genes. (The values are centered at zero, which explains why some values are negative.)

# Set the random seed so the graph is reproducible.
set.seed(0)

as_tibble(expression_data, rownames='Ensembl_Gene_ID') %>%
  head(n = 10) %>%
  pivot_longer(-Ensembl_Gene_ID, names_to = "Sample_ID", values_to = "Expression_Level") %>%
  ggplot(aes(x=Ensembl_Gene_ID, y=Expression_Level)) +
  geom_boxplot() +
  geom_jitter(color = "red", alpha = 0.2) +
  labs(x='Ensembl Gene ID', y='Expression level') +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.2 Example 2: Using metadata to create a heatmap

In this example, we download the data and metadata with the eventual goal of creating a heatmap of gene-expression levels. First, we load two datasets, convert them into tibbles, and identify samples from stromal cells. Then we limit the SummarizedExperiment data to those samples.

# Retrieve the data from Zenodo
GSE10797_SE <- getDataset("GSE10797")
GSE59772_SE <- getDataset("GSE59772")

# Retrieve stroma sample IDs as vectors
GSE10797_stroma_sample_ids = colData(GSE10797_SE) %>%
  as_tibble(rownames = 'Sample_ID') %>%
  filter(tissue_source == 'stromal cells from breast cancer patient') %>%
  pull(Sample_ID)

GSE59772_stroma_sample_ids = colData(GSE59772_SE) %>%
  as_tibble(rownames = 'Sample_ID') %>%
  filter(tissue == 'Stroma') %>%
  pull(Sample_ID)

GSE10797_SE = GSE10797_SE[,colnames(GSE10797_SE) %in% GSE10797_stroma_sample_ids]
GSE59772_SE = GSE59772_SE[,colnames(GSE59772_SE) %in% GSE59772_stroma_sample_ids]

Now we filter the data to genes that both data sets have in common.

common_genes <- intersect(rownames(GSE10797_SE), rownames(GSE59772_SE))

GSE10797_SE <- GSE10797_SE[common_genes,]
GSE59772_SE <- GSE59772_SE[common_genes,]
# This also has the benefit that the genes will be in the same order in both.

Finally, we create a tibble with the data from both sources. After doing so, we calculate the variance across the 5 samples. We then select the 10 genes with the highest variance and create a heatmap of those. Please note that this analysis is simplified. In an actual data analysis, we would need to account for systematic biases due to combining datasets from disparate sources.

combined_data = cbind(assay(GSE10797_SE), assay(GSE59772_SE)) %>%
  as_tibble(rownames = 'Gene_ID') %>%
  pivot_longer(-Gene_ID, names_to = 'Sample_ID', values_to = 'Expression_Level')

top_variance_genes = group_by(combined_data, Gene_ID) %>%
  summarize(Variance = var(Expression_Level)) %>%
  arrange(desc(Variance)) %>%
  head(n = 10) %>%
  pull(Gene_ID)

combined_data = filter(combined_data, Gene_ID %in% top_variance_genes)

ggplot(combined_data, aes(x = Sample_ID, y = Gene_ID, fill = Expression_Level)) +
  geom_tile() +
  labs(x = 'Sample', y = 'Gene', fill = 'Expression level', 
       title = 'Genes in stromal samples with highest variance') + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        plot.title = element_text(hjust = 0.5, face = "bold"))

4.3 Example 3: Using Feature Data

In this final example, we use gene annotations to identify which genes to plot. We filter each dataset to only include lncRNA genes on either chromosome 1 or 2. Because we previously limited the genes to those present in both datasets, we only need to use one of the datasets to identify the relevant genes.

gene_ids = rowData(GSE10797_SE) %>%
  as_tibble(rownames = 'Ensembl_Gene_ID') %>%
  filter(Chromosome %in% 1:2 & Gene_Biotype == "lncRNA") %>%
  pull(Ensembl_Gene_ID)

GSE10797_SE <- GSE10797_SE[gene_ids,]
GSE59772_SE <- GSE59772_SE[gene_ids,]

Next, we extract the expression values, put them into tibbles, and pivot them.

gse10797_data = assay(GSE10797_SE) %>%
  as_tibble(rownames = 'Gene_ID') %>%
  pivot_longer(-Gene_ID, names_to = 'Sample_ID', values_to = 'Expression_Level') %>%
  mutate(Dataset_ID = "GSE10797")

gse59772_data = assay(GSE59772_SE) %>%
  as_tibble(rownames = 'Gene_ID') %>%
  pivot_longer(-Gene_ID, names_to = 'Sample_ID', values_to = 'Expression_Level') %>%
  mutate(Dataset_ID = "GSE59772")

combined_data = bind_rows(gse10797_data, gse59772_data)

Finally, we combine the data sets using bind_rows() and read it into a ggplot.

ggplot(combined_data, aes(x = Gene_ID, y = Expression_Level, fill = Dataset_ID)) +
  geom_boxplot() +
  theme_bw() +
  labs(x = 'Gene', y='Expression Level', 
       title = 'lncRNA expression levels on chromosomes 1 & 2') +
  coord_flip()

5 Session Info

utils::sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.42.0 Biobase_2.72.0             
##  [3] GenomicRanges_1.64.0        Seqinfo_1.2.0              
##  [5] IRanges_2.46.0              S4Vectors_0.50.0           
##  [7] BiocGenerics_0.58.0         generics_0.1.4             
##  [9] MatrixGenerics_1.24.0       matrixStats_1.5.0          
## [11] zen4R_0.10.5                tidyr_1.3.2                
## [13] tibble_3.3.1                dplyr_1.2.1                
## [15] ggplot2_4.0.3               AnnotatedBCGEData_0.99.3   
## [17] BiocStyle_2.40.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    farver_2.1.2        blob_1.3.0         
##  [4] filelock_1.0.3      S7_0.2.2            fastmap_1.2.0      
##  [7] BiocFileCache_3.2.0 XML_3.99-0.23       digest_0.6.39      
## [10] lifecycle_1.0.5     RSQLite_2.4.6       magrittr_2.0.5     
## [13] compiler_4.6.0      rlang_1.2.0         sass_0.4.10        
## [16] tools_4.6.0         utf8_1.2.6          yaml_2.3.12        
## [19] knitr_1.51          labeling_0.4.3      S4Arrays_1.12.0    
## [22] bit_4.6.0           curl_7.1.0          DelayedArray_0.38.1
## [25] plyr_1.8.9          xml2_1.5.2          RColorBrewer_1.1-3 
## [28] abind_1.4-8         withr_3.0.2         purrr_1.2.2        
## [31] grid_4.6.0          scales_1.4.0        tinytex_0.59       
## [34] dichromat_2.0-0.1   cli_3.6.6           rmarkdown_2.31     
## [37] crayon_1.5.3        otel_0.2.0          httr_1.4.8         
## [40] tzdb_0.5.0          DBI_1.3.0           cachem_1.1.0       
## [43] stringr_1.6.0       parallel_4.6.0      BiocManager_1.30.27
## [46] XVector_0.52.0      vctrs_0.7.3         Matrix_1.7-5       
## [49] jsonlite_2.0.0      bookdown_0.46       hms_1.1.4          
## [52] bit64_4.8.0         magick_2.9.1        jquerylib_0.1.4    
## [55] keyring_1.4.1       glue_1.8.1          stringi_1.8.7      
## [58] gtable_0.3.6        pillar_1.11.1       rappdirs_0.3.4     
## [61] htmltools_0.5.9     R6_2.6.1            dbplyr_2.5.2       
## [64] httr2_1.2.2         vroom_1.7.1         evaluate_1.0.5     
## [67] lattice_0.22-9      readr_2.2.0         memoise_2.0.1      
## [70] bslib_0.10.0        Rcpp_1.1.1-1.1      SparseArray_1.12.2 
## [73] xfun_0.57           pkgconfig_2.0.3