%\VignetteIndexEntry{Processed human microRNA-overexpression data from GEO, and sequence information from TargetScan, and targetScore from TargetScore} \documentclass[12pt]{article} \usepackage[left=1in,top=1in,right=1in, bottom=1in]{geometry} \usepackage{Sweave} \usepackage{times} \usepackage{hyperref} \usepackage{subfig} \usepackage{natbib} \usepackage{graphicx} \hypersetup{ colorlinks, citecolor=black, filecolor=black, linkcolor=black, urlcolor=black } \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\SD}{Supplementary Data} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\TopHat}{\software{TopHat}} \newcommand{\Bowtie}{\software{Bowtie}} \bibliographystyle{plainnat} \title{Processed human microRNA-overexpression data from GEO, and sequence information from TargetScan, and targetScore from TargetScore} \author{Yue Li \\ \texttt{yueli@cs.toronto.edu}} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \section{MicroRNA perturbation datasets} We collected 84 Gene Expression Omnibus (GEO) series corresponding to 6 platforms, 77 human cells or tissues, and 112 distinct miRNAs. To our knowledge, this is by far the largest miRNA-overexpression data compendium. To automate the data download and processing, we developed a pipeline written in R, making use of the function \texttt{getGEO} from \emph{GEOquery} R/Bioconductor package (\cite{Davis:2007fc}). For each dataset, the pipeline downloads the raw or processed data (if available) and calculates (when necessary) the log fold-change (logFC) in treatment (miRNA transfected) vs (mock) control, taking into account the unique properties of each data. Next, we combined all of the logFC data columns into a single $N \times M$ matrix for all of the $N = 19177$ RefSeq mRNAs (NM\_* obtained from UCSC) and $M = 286$ datasets. Missing data (logFC) for some genes across studies were imputed using \texttt{impute.knn} from \emph{impute} R package (\cite{Troyanskaya:2001uh}). For miRNA transfection data having multiple measurements (in different studies), we picked the one whose logFC correlate the most with the validated targets from mirTarBase \cite{Hsu:2011ed} or average them if no validated target available. <>= library(TargetScoreData) transfection_data <- get_miRNA_transfection_data()$transfection_data datasummary <- list( `MicroRNA` = table(names(transfection_data)), `GEO Series` = table(sapply(transfection_data, function(df) df$Series[1])), `Platform` = table(sapply(transfection_data, function(df) df$platform[1])), `Cell/Tissue` = table(sapply(transfection_data, function(df) df$cell[1]))) print(lapply(datasummary, length)) @ \section{TargetScan context score and PCT} TargetScan context score and PCT for all of the predicted sites (including conserved and nonconserved sites) downloaded from TargetScan website (\url{http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_61}) <>= targetScanCS <- get_TargetScanHuman_contextScore() targetScanPCT <- get_TargetScanHuman_PCT() head(targetScanCS) dim(targetScanCS) head(targetScanPCT) dim(targetScanPCT) @ \section{TargetScore} Encouraged by the superior performance of TargetScore (manuscript in peer-review), we applied TargetScore to all of the transfection data above. For further exploring miRNA targetome and their associations, we enclose the targetScores results in this package. <>= targetScoreMatrix <- get_precomputed_targetScores() head(names(targetScoreMatrix)) head(targetScoreMatrix[[1]]) @ We can reproduce targetScores using the above data as demonstrated in the following example (require \Rpackage{TargetScore} package). As a convenience function, we applied a wrapper function called \texttt{getTargetScores} that does the following: (1) given a miRNA ID, obtain fold-change(s) from logFC.imputed matrix or use the user-supplied fold-changes; (2) retrives TargetScan context score (CS) and PCT (if found); (3) obtain validated targets from the local mirTarBase file; (4) compute targetScore. We apply \texttt{getTargetScores} function using miRNA hsa-miR-1, which we know has all three types of data, namely logFC, targetScan context score, and PCT. <>= library(TargetScore) library(gplots) myTargetScores <- getTargetScores("hsa-miR-1", tol=1e-3, maxiter=200) table((myTargetScores$targetScore > 0.1), myTargetScores$validated) # a very lenient cutoff # obtain all of targetScore for all of the 112 miRNA logFC.imputed <- get_precomputed_logFC() mirIDs <- unique(colnames(logFC.imputed)) # takes time # targetScoreMatrix <- mclapply(mirIDs, getTargetScores) # names(targetScoreMatrix) <- mirIDs @ \section{Session Info} <>= sessionInfo() @ \bibliography{TargetScoreData} \end{document}