--- title: "BloodCancerMultiOmics2017 - data overview" author: "Małgorzata Oleś" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{BloodCancerMultiOmics2017 - data overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Prerequisites ```{r loadlib, message=FALSE} library("BloodCancerMultiOmics2017") # additional library("Biobase") library("SummarizedExperiment") library("DESeq2") library("reshape2") library("ggplot2") library("dplyr") library("BiocStyle") ``` # Introduction Primary tumor samples from blood cancer patients underwent functional and molecular characterization. `r Biocpkg("BloodCancerMultiOmics2017")` includes the resulting preprocessed data. A quick overview of the available data is provided below. For the details on experimental settings please refer to: S Dietrich\*, M Oleś\*, J Lu\* et al. *Drug-perturbation-based stratification of blood cancer*
*J. Clin. Invest.* (2018); 128(1):427–445. doi:10.1172/JCI93801. \* equal contribution # Data overview Load all of the available data. ```{r} data("conctab", "drpar", "lpdAll", "patmeta", "day23rep", "drugs", "methData", "validateExp", "dds", "exprTreat", "mutCOM", "cytokineViab") ``` The data sets are objects of different classes (`data.frame`, `ExpressionSet`, `NChannelSet`, `RangedSummarizedExperiment`, `DESeqDataSet`), and include data for either all studied patient samples or only a subset of these. The overview below shortly describes and summarizes the data available. Please note that the presence of a given patient sample ID within the data set doesn't necessarily mean that the data is available for this sample (the slot could be filled with NAs). Patient samples per data set. ```{r numberOfSamples} samplesPerData = list( drpar = colnames(drpar), lpdAll = colnames(lpdAll), day23rep = colnames(day23rep), methData = colnames(methData), patmeta = rownames(patmeta), validateExp = unique(validateExp$patientID), dds = colData(dds)$PatID, exprTreat = unique(pData(exprTreat)$PatientID), mutCOM = rownames(mutCOM), cytokineViab = unique(cytokineViab$Patient) ) ``` List of all samples present in data sets. ```{r} (samples = sort(unique(unlist(samplesPerData)))) ``` Total number of samples. ```{r} length(samples) ``` A plot summarizing the presence of a given patient sample within each data set. ```{r sampleOverlap, fig.height=4, fig.width=8, echo=FALSE} plotTab = melt(samplesPerData, value.name="PatientID") plotTab$L1 = factor(plotTab$L1, levels=c("patmeta", "mutCOM", "lpdAll", "methData", "exprTreat", "dds", "cytokineViab", "day23rep", "validateExp", "drpar")) # order of the samples in the plot tmp = do.call(cbind, lapply(samplesPerData[c("drpar", "validateExp", "day23rep", "dds", "exprTreat", "methData", "cytokineViab")], function(x) { samples %in% x })) rownames(tmp) = samples ord = order(tmp[,1], tmp[,2], tmp[,3], tmp[,4], tmp[,5], tmp[,6], tmp[,7], decreasing=TRUE) ordSamples = rownames(tmp)[ord] plotTab$PatientID = factor(plotTab$PatientID, levels=ordSamples) ggplot(plotTab, aes(x=PatientID, y=L1)) + geom_tile(fill="lightseagreen") + scale_y_discrete(expand=c(0,0)) + ylab("Data objects") + xlab("Patient samples") + geom_vline(xintercept=seq(10, length(samples),10), color="grey") + geom_hline(yintercept=seq(0.5, length(levels(plotTab$L1)), 1), color="dimgrey") + theme(panel.grid=element_blank(), text=element_text(size=18), axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(color="gainsboro")) ``` The classification below stratifies data sets according to different types of experiments performed and included. Please refer to the manual for a more detailed information on the content of these data objects. ## Patient metadata Patient metadata is provided in the `patmeta` object. ```{r} # Number of patients per disease sort(table(patmeta$Diagnosis), decreasing=TRUE) # Number of samples from pretreated patients table(!patmeta$IC50beforeTreatment) # IGHV status of CLL patients table(patmeta[patmeta$Diagnosis=="CLL", "IGHV"]) ``` ## High-throughput drug screen data The viability measurements from the high-throughput drug screen are included in the `drpar` object. The metadata about the drugs and drug concentrations used can be found in `drugs` and `conctab` objects, respectively. The `drpar` object includes multiple channels, each of which consists of cells' viability data for a single drug concentration step. Channels `viaraw.1_5` and `viaraw.4_5` contain the mean viability score between multiple concentration steps as indicated at the end of the channel name. ```{r} channelNames(drpar) # show viability data for the first 5 patients and 7 drugs in their lowest conc. assayData(drpar)[["viaraw.1"]][1:7,1:5] ``` Drug metadata. ```{r} # number of drugs nrow(drugs) # type of information included in the object colnames(drugs) ``` Drug concentration steps (c1 - lowest, c5 - highest). ```{r} head(conctab) ``` The reproducibility of the screening platform was assessed by screening `r unname(ncol(day23rep))` patient samples in two replicates. The viability measurements are available for two time points: 48 h and 72 h after adding the drug. The screen was performed for `r length(unique(fData(day23rep)$DrugID))` drugs in 1-2 different drug concentrations (`r table(table(fData(day23rep)$DrugID))["1"]` in 1 and `r table(table(fData(day23rep)$DrugID))["2"]` in 2 drug concentrations). This data is provided in `day23rep`. ```{r} channelNames(day23rep) # show viability data for 48 h time point for all patients marked as # replicate 1 and 3 first drugs in all their conc. drugs2Show = unique(fData(day23rep)$DrugID)[1:3] assayData(day23rep)[["day2rep1"]][fData(day23rep)$DrugID %in% drugs2Show,] ``` The follow-up drug screen, which confirmed the targets and the signaling pathway dependence of the patient samples was performed for `r length(unique(validateExp$patientID))` samples and the following drugs: `r paste(unique(validateExp$Drug), collapse=", ")`. | Drug name | Target | |-------------|--------| | Cobimetinib | MEK | | Trametinib | MEK | | SCH772984 | ERK1/2 | | Ganetespib | Hsp90 | | Onalespib | Hsp90 | The data is included in the `validateExp` object. ```{r} head(validateExp) ``` Moreover, we also performed a small drug screen in order to check the influence of the different cytokines/chemokines on the viability of the samples. These data are included in `cytokineViab` object. ```{r} head(cytokineViab) ``` ## Gene mutation data The `mutCOM` object contains information on the presence of gene mutations in the studied patient samples. ```{r} # there is only one channel with the binary type of data for each gene channelNames(mutCOM) # the feature data includes detailed information about mutations in # TP53 and BRAF genes, as well as clone size of #del17p13, KRAS, UMODL1, CREBBP, PRPF8, trisomy12 mutations colnames(fData(mutCOM)) ``` ## Gene expression data RNA-Seq data preprocessed with `r Biocpkg("DESeq2")` is provided in the `dds` object. ```{r} # show count data for the first 5 patients and 7 genes assay(dds)[1:7,1:5] # show the above with patient sample ids assay(dds)[1:7,1:5] %>% `colnames<-` (colData(dds)$PatID[1:5]) # number of genes and patient samples nrow(dds); ncol(dds) ``` Additionally, `r length(unique(pData(exprTreat)$PatientID))` patient samples underwent gene expression profiling using Illumina microarrays before and 12 h after treatment with `r tmp=unique(pData(exprTreat)$DrugID); length(tmp[!is.na(tmp)])` drugs. These data are included in the `exprTreat` data object. ```{r} # patient samples included in the data set (p = unique(pData(exprTreat)$PatientID)) # type of metadata included for each gene colnames(fData(exprTreat)) # show expression level for the first patient and 3 first probes Biobase::exprs(exprTreat)[1:3, pData(exprTreat)$PatientID==p[1]] ``` ## DNA methylation data DNA methylation included in `methData` object contains data for `r ncol(methData)` patient samples and 5000 of the most variable CpG sites. ```{r} # show the methylation for the first 7 CpGs and the first 5 patient samples assay(methData)[1:7,1:5] # type of metadata included for CpGs colnames(rowData(methData)) # number of patient samples screened with the given platform type table(colData(methData)$platform) ``` ## Other Object `lpdAll` is a convenient assembly of data contained in the other data objects mentioned earlier in this vignette. For details, please refer to the manual. ```{r} # number of rows in the dataset for each type of data table(fData(lpdAll)$type) # show viability data for drug ibrutinib, idelalisib and dasatinib # (in the mean of the two lowest concentration steps) and # the first 5 patient samples Biobase::exprs(lpdAll)[which( with(fData(lpdAll), name %in% c("ibrutinib", "idelalisib", "dasatinib") & subtype=="4:5")), 1:5] ``` # Original data The raw data from the whole exome sequencing, RNA-seq and DNA methylation arrays is stored in the European Genome-Phenome Archive (EGA) under accession number EGAS0000100174. The preprocesed DNA methylation data, which include complete list of CpG sites (not only the 5000 with the highest variance) can be accessed through Bioconductor ExperimentHub platform. ```{r eval=FALSE} library("ExperimentHub") eh = ExperimentHub() obj = query(eh, "CLLmethylation") meth = obj[["EH1071"]] # extract the methylation data ``` # Session info ```{r} sessionInfo() ```