%\VignetteIndexEntry{pathprintGEOData}
\documentclass{article}
\usepackage{Sweave}
\usepackage{amsmath}
\usepackage{amscd}
\usepackage[tableposition=top]{caption}
\usepackage{ifthen}
\usepackage[utf8]{inputenc}
\usepackage{hyperref}
\usepackage[usenames]{color}
\definecolor{midnightblue}{rgb}{0.098,0.098,0.439}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\setkeys{Gin}{width=\textwidth}
\begin{document}
\SweaveOpts{concordance=TRUE}
\title{About pathprintGEOData}
\maketitle

\section{Description}
This package contains the data used by the pathprint package, including 
fingerprint and metadata data frames and chipframe. The fingerprint matrices 
contain pathway Fingerprint vectors that have been pre-calculated for ~390,000 publicly available arrays from the GEO corpus, spanning 6 species and 31 
platforms. All data are accompanied by their associated metadata. 

The data in this package were obtained using the method described by Altschuler 
et al. (2013, PMID: 23890051). The package  
\href{https://bioconductor.org/packages/release/bioc/html/GEOquery.html}
{GEOquery} was used to retrieve normalized expression tables for all of the 
experiments of each platform, all normalization methods were accepted. The 
expression data was mapped to Entrez Gene identifications using systematically 
updated annotations from \href{http://ailun.ucsf.edu/}{AILUN}(Array Information 
Library Universal Navigator). Multiple probes were merged to unique Entrez Gene 
IDs by taking the mean probe set intensity.  H. sapiens canonical pathway gene 
sets were compiled from Reactome, Wiki-pathways and KEGG (Kyoto Encyclopedia of 
Genes and Genomes). Static modules were constructed independently by decomposing
a network that extended curated pathways with non-curated sources of 
information, including protein-protein interactions, gene co-expression, protein
domain interaction, GO annotations and text-mined protein interactions. 
M. musculus, R. norvegicus, D. rerio, D. melanogaster, and C. elegans gene sets 
were inferred using homology based on the 
\href{https://www.ncbi.nlm.nih.gov/homologene}{HomoloGene} database. Pathway 
expression scores were calculated for each pathway in each array based on the 
mean squared ranked expression of the member genes. The full set of GEO 
experiments was used to calculate a static pathway expression background 
distribution for each pathway across each platform. A signed probability of 
expression (POE) was calculated based on a two-component uniform-normal mixture 
model, representing the probability that a pathway expression score has 
significant low (negative) or high (positive) expression. POE values were 
converted to a ternary score (-1,0,1) by application of a symmetric threshold to
produce the final pathprint matrix.

\section{Using pathprintGEOData with pathprint package}

The data in this package are primarily used by the pathprint package. For the 
following examples to work, the pathprint package needs to be installed. For 
further explanations of some of the functions mentioned in the examples please 
refer to pathprint. Furthermore, the SummarizedExperiment package is required to
extract the two matrices from the SummarizedExperiment object.

<<>>=
# use the pathprint library
library(pathprint)
library(SummarizedExperiment)
library(pathprintGEOData)

# load  the data
data(SummarizedExperimentGEO)

ds = c("chipframe", "genesets","pathprint.Hs.gs","platform.thresholds",
    "pluripotents.frame")
data(list = ds)

# see available platforms
names(chipframe)

# extract GEO.fingerprint.matrix and GEO.metadata.matrix
GEO.fingerprint.matrix = assays(geo_sum_data)$fingerprint
GEO.metadata.matrix = colData(geo_sum_data)

# create consensus fingerprint for pluripotent samples
pluripotent.consensus<-consensusFingerprint(
    GEO.fingerprint.matrix[,pluripotents.frame$GSM],
    threshold=0.9)

# calculate distance from the pluripotent consensus
geo.pluripotentDistance<-consensusDistance(
    pluripotent.consensus, GEO.fingerprint.matrix)

# plot histograms
par(mfcol = c(2,1), mar = c(0, 4, 4, 2))

geo.pluripotentDistance.hist<-hist(
    geo.pluripotentDistance[,"distance"],
    nclass = 50, xlim = c(0,1), 
    main = "Distance from pluripotent consensus")

par(mar = c(7, 4, 4, 2))

hist(geo.pluripotentDistance[
    pluripotents.frame$GSM, "distance"],
    breaks = geo.pluripotentDistance.hist$breaks, 
    xlim = c(0,1), 
    main = "", 
    xlab = "above: all GEO, below: pluripotent samples")

# annotate top 100 matches not in original seed with metadata
geo.pluripotentDistance.noSeed<-geo.pluripotentDistance[
    !(rownames(geo.pluripotentDistance)
    %in% 
    pluripotents.frame$GSM),
    ]

top.noSeed.meta<-GEO.metadata.matrix[
    match(
    head(rownames(geo.pluripotentDistance.noSeed), 100),
                            rownames(GEO.metadata.matrix)),
    ]
print(top.noSeed.meta[, c(1:4)])
@
\end{document}