% \VignetteIndexEntry{An R data package for sequence}
\documentclass{article}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{times}
\usepackage{titlesec}
\usepackage{sidecap}
\usepackage{float}
\usepackage{rotating}

%%%add more package for tables
\usepackage[table]{xcolor}
\usepackage[margin=1in]{geometry}
\usepackage{tabularx}
\usepackage{enumitem}


%%%

\textwidth=6.3in
\textheight=8in
%\parskip=.3cm
\oddsidemargin=.2in
\evensidemargin=.2in
\headheight=.5in

\usepackage{titlesec}

\setcounter{secnumdepth}{4}

\titleformat{\paragraph}
{\normalfont\normalsize\bfseries}{\theparagraph}{1em}{}
\titlespacing*{\paragraph}
{0pt}{3.25ex plus 1ex minus .2ex}{1.5ex plus .2ex}

%------------------------------------------------------------
% newcommand
%------------------------------------------------------------
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rexpression}[1]{\texttt{#1}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

%%%
\renewcommand{\familydefault}{\sfdefault}
\renewcommand{\arraystretch}{1.5}






%%%

\begin{document}
\SweaveOpts{concordance=TRUE}

%------------------------------------------------------------
\title{seq2pathway.data Vignette}
%------------------------------------------------------------
\author{Bin Wang}
%\date{}

\SweaveOpts{highlight=TRUE, tidy=TRUE, keep.space=TRUE, keep.blank.space=FALSE, keep.comment=TRUE}
\SweaveOpts{prefix.string=Fig}


\maketitle
\small\tableofcontents


%-------------------------------------------
\section{Abstract}
%-------------------------------------------

  \texttt{Seq2pathway.data} is supporting data for the \texttt{seq2pathway} package. The package includes pre-defined gene sets which are constructed from \texttt{R} package \texttt{org.Hs.eg.db}\cite{Carlson} for GeneOntology and the Molecular Signatures Database (MsigDB)\cite{Subramanian} for other functional gene sets. The gene locus definitions in the package is built from the \texttt{GENCODE project}\cite{Harrow}, including \texttt{GENCODE} \texttt{20} (hg38), \texttt{GENCODE} \texttt{19} (hg19), \texttt{GENCODE} \texttt{mmvM4} (mm10) and \texttt{GENCODE} \texttt{mmvM1} (mm9) currently.
  
  
%-------------------------------------------
\section{Data}
%-------------------------------------------

We grouped all data to nine categories. The data 2.1 to 2.5 is aimed to internal funtion use only. The data 2.6 to 2.9 is demo data for the \texttt{seq2pathway} package.
%------------------------------------
\subsection{GO\_BP\_list; GO\_MF\_list; GO\_CC\_list}
%-------------------------------------
These data contains all gene symbol lists extracted from an R object \texttt{org.Hs.egGO2EG}. 
Note that \texttt{org.Hs.egGO2EG} ``provides mappings between entrez gene identifiers and the GO identifiers that they are directly associated with. This mapping and its reverse mapping do NOT associate the child terms from the GO ontology with the gene. Only the directly evidenced terms are represented here.''\cite{Carlson}


The list \texttt{GO\_BP\_list} includes 9407 GO biological process terms, and the names of list stand for the names of GO terms. The corresponding element of each list is the gene symbol of each term. Similarly, the list \texttt{GO\_MF\_list} includes 3529 GO molecular function terms, and the list \texttt{GO\_CC\_list} includes 1198 GO cell component terms.

Data in this category will mainly be invoked internally by the functions in the \texttt{seq2pathway} package.

%------------------------------------
\subsection{Des\_BP\_list; Des\_MF\_list; Des\_CC\_list}
%-------------------------------------
These three lists include the description information of each GO terms. The name of each list stands for the name of GO terms. The corresponding element of each list is the description of each term. The desctiption is extracted from R object GO.db\cite{CarlsonGO}. The length of the list \texttt{Des\_BP\_list} is equal to the length of the list \texttt{GO\_BP\_list}. The similiar pattern applies to the other two lists.

Data in this category will mainly be invoked internally by the functions in the \texttt{seq2pathway} package.

%------------------------------------
\subsection{GO\_GENCODE\_df\_hg\_v19; GO\_GENCODE\_df\_hg\_v20; GO\_GENCODE\_df\_mm\_vM1; GO\_GENCODE\_df\_mm\_vM4}
%-------------------------------------
The gene set varies from different gene databese sources. The gene symbols of GO terms are from \texttt{org.Hs.egGO2EG}\cite{Carlson}, the gene set in our annoation package is from the \texttt{GENCODE}\cite{Harrow} data sets, which attribute gene set by differenct organisms and assembly versions. These data frames record the common genes from different databases. For example, the \texttt{GO\_GENCODE\_df\_hg\_v19} data frame records the common genes from \texttt{org.Hs.egGO2EG} and \texttt{GENCODE} human species \texttt{19} version. In summary, there are 17998 genes in \texttt{GO\_GENCODE\_df\_hg\_v19}, 18015 genes in \texttt{GO\_GENCODE\_df\_hg\_v20}, 14328 genes in \texttt{GO\_GENCODE\_df\_mm\_vM1}, and 15075 genes in \texttt{GO\_GENCODE\_df\_mm\_vM4} respectively. 

Data in this category will mainly be invoked internally by the functions in the \texttt{seq2pathway} package.


%------------------------------------
\subsection{Msig\_GENCODE\_df\_hg\_v19; Msig\_GENCODE\_df\_hg\_v20; Msig\_GENCODE\_df\_mm\_vM1; Msig\_GENCODE\_df\_mm\_vM4}
%-------------------------------------
These four data frames record the common genes from Molecular Signatures Database (\texttt{MsigDB})\cite{Subramanian} and the \texttt{GENCODE}\cite{Harrow} data sets. \texttt{MsigDB} is a collection of annotated gene sets. There are 22751 genes in \texttt{Msig\_GENCODE\_df\_hg\_v19}, 22721 genes in \texttt{Msig\_GENCODE\_df\_hg\_v20}, 15420 genes in \texttt{Msig\_GENCODE\_df\_mm\_vM1}, and 15528 genes in \texttt{Msig\_GENCODE\_df\_mm\_vM4}. 

Data in this category will mainly be invoked internally by the functions in the \texttt{seq2pathway} package.


%------------------------------------
\subsection{gencode\_coding}
%-------------------------------------
\texttt{gencode\_coding} is an R vector which collects all protein coding gene symbols from \texttt{GENCODE}\cite{Harrow} human version20. There are 19810 unique coding genes symbols in \texttt{gencode\_coding} object.

Data \texttt{gencode\_coding} will mainly be invoked internally by the functions in the \texttt{seq2pathway} package.

%------------------------------------
\subsection{MsigDB\_C5}
%------------------------------------
\texttt{MsigDB\_C5} is a gene-sets collection from \texttt{MsigDB}\cite{Subramanian} in GSA.genesets format, which could be open with \texttt{R} package \texttt{GSA}\cite{Efron}. \texttt{MsigDB\_C5} consists of genes annotated by the GO terms. As an R object, \texttt{MsigDB\_C5} is a list of 3 elements. The first element includes multiple sub lists. Each sub list is a gene list for one gene set. The second element records the names of genesets, and the last element is the descriptions of genesets. \texttt{Seq2pathway.data} is a supporting data package for the \texttt{seq2pathway} package. \texttt{MsigDB\_C5} is used as demo data for functions in the \texttt{seq2pathway} package. More details could be found in the vignette of \texttt{seq2pathway} package.
      \begin{verbatim}
> data(MsigDB_C5,package="seq2pathway.data")
> class(MsigDB_C5)     
[1] "GSA.genesets"
> names(MsigDB_C5)
[1] "genesets"             "geneset.names"        "geneset.descriptions"
      \end{verbatim} 
  
      
%------------------------------------
\subsection{gene\_description}
%------------------------------------
\texttt{gene\_description} is a data frame with two columns. The data give the gene descrption based on the \text{biomaRt}\cite{Durinck} package. \texttt{gene\_description} is a demo data for the \texttt{seq2pathway} package. More details could be found at the Vignette (\textbf{5.1.4 Add description for genes}) of the \texttt{seq2pathway} package.

      \begin{verbatim}
> data(gene_description,package="seq2pathway.data")
> head(gene_description)
          \end{verbatim}

      \begin{table}[H]
        \centering
         \small
          \resizebox{\textwidth}{!}{%
             \begin{tabular*}{\textwidth}{rr}
                \hline
                 hgnc\_symbol & description \\ 
                \hline
                ABCD4 & ATP-binding cassette, sub-family D (ALD),  member 4  [Source:HGNC Symbol;Acc:68 ] \\
                ABHD12B & abhydrolase domain containing 12B [Source:HGNC Symbol;Acc:19837] \\
                ABHD4 & abhydrolase domain containing 4 [Source:HGNC Symbol;Acc:20154] \\
                ACIN1 & apoptotic chromatin condensation inducer 1 [Source:HGNC Symbol;Acc:17066] \\
                ACOT1 & acyl-CoA thioesterase 1 [Source:HGNC Symbol;Acc:33128] \\
                ACOT2 & acyl-CoA thioesterase 2 [Source:HGNC Symbol;Acc:18431] \\
                \hline
            \end{tabular*}}
      %\label{}
      %\caption{add description}
      \end{table}   
      
%------------------------------------
\subsection{dat\_gene2path\_chip; dat\_gene2path\_RNA}
%------------------------------------
\texttt{dat\_gene2path\_chip} and \texttt{dat\_gene2path\_RNA} are demo data for functions in the \texttt{seq2pathway} package. Each of them is an R list object with 2 elements. The first element \texttt{gene2pathway\_result.2} is a list of \texttt{gene2pathway} test result, and the second element \texttt{gene2pathway\_result.FET} is a list of Fisher's exact test result. More details and usage could be found at \texttt{Examples} section in the vignette of \texttt{seq2pathway} package.

Small code for checking the data \texttt{dat\_gene2path\_chip} is provided below:
    \begin{verbatim}
> data(dat_gene2path_chip,package="seq2pathway.data")
> names(dat_gene2path_chip)
[1] "gene2pathway_result.2"   "gene2pathway_result.FET"
> class(dat_gene2path_chip$gene2pathway_result.2)
[1] "list"
> names(dat_gene2path_chip$gene2pathway_result.2)
[1] "GO_BP" "GO_CC" "GO_MF"
> head(dat_gene2path_chip$gene2pathway_result.2$GO_BP)
    \end{verbatim}
      
      \begin{table}[H]
        \centering
         \tiny
        \resizebox{\textwidth}{!}{%
             \begin{tabular*}{\textwidth}{l p{6cm}rrrp{5cm}}
    \hline
     & Des & peakscore & peakscore & Intersect & Intersect  \\ 
     &     & pathscore & pathscore & \_Count   & \_gene \\
     &     & \_Normalized & \_Pvalue  \\
    \hline
    GO:0000082 & 
    The mitotic cell cycle transition by which a cell in G1 commits to S phase. The process begins with the build up of G1 cyclin-dependent kinase (G1 CDK), resulting in the activation of transcription of G1 cyclins. The process ends with the positive feedback of the G1 cyclins on the G1 CDK which commits the cell to S phase, in which DNA replication is initiated.
    & 0.32017745 & 0.12 & 11 &
    CDKN3 GPR132 MNAT1 POLE2 PSMA3 PSMA6 PSMB5 PSMC1 PSMC6 PSME1 PSME2
    \\
    GO:0000086 &
    The mitotic cell cycle transition by which a cell in G2 commits to M phase. The process begins when the kinase activity of M cyclin/CDK complex reaches a threshold high enough for the cell cycle to proceed. This is accomplished by activating a positive feedback loop that results in the accumulation of unphosphorylated and active M cyclin/CDK complex.
    & -0.33586010 & 0.49 & 5 &
    AJUBA DYNC1H1 HSP90AA1 LIN52 MNAT1 \\
    GO:0000122 & 
    Any process that stops, prevents, or reduces the frequency, rate or extent of transcription from an RNA polymerase II promoter.
    & -0.11535853 & 0.16 & 20 &
    AJUBA BMP4 DACT1 DICER1 ESR2 FOXA1 GSC JDP2 NKX2-1 PPM1A PRMT5 PSEN1 RCOR1 SALL2 SIX1 SNW1 STRN3 YY1 ZBTB1 ZBTB42 \\
    GO:0000209 & 
    Addition of multiple ubiquitin groups to a protein, forming a ubiquitin chain.
    & 0.17070465 & 0.11 & 11 &
    ASB2 G2E3 PSMA3 PSMA6 PSMB11 PSMB5 PSMC1 PSMC6 PSME1 PSME2 RNF31 \\
    GO:0000278 & 
    Progression through the phases of the mitotic cell cycle, the most common eukaryotic cell cycle, which canonically comprises four successive phases called G1, S, G2, and M and includes replication of the genome and the subsequent segregation of chromosomes into daughter cells. In some variant cell cycles nuclear replication or nuclear division may not be followed by cell division, or G1 and G2 phases may be absent.
    & 0.06368249 & 0.04 & 16 &
    AJUBA DYNC1H1 HSP90AA1 LIN52 MNAT1 NEK9 POLE2 PSMA3 PSMA6 PSMB11 PSMB5 PSMC1 PSMC6 PSME1 PSME2 VRK1    \\
    GO:0000398 & 
    The joining together of exons from one or more primary transcripts of messenger RNA (mRNA) and the excision of intron sequences, via a spliceosomal mechanism, so that mRNA consisting only of the joined exons is produced.
    &  -0.55767621 & 0.59 & 8 & 
    CPSF2 HNRNPC NOVA1 PABPN1 PAPOLA PNN SNW1 SRSF5 \\
    
    \hline
    \end{tabular*}}
          %\label{table:GO BP of gene2pathway result.2}
      %\caption{GO BP of gene2pathway result.2}
      \end{table} 
%------------------------------------
\subsection{dat\_seq2pathway\_GOterms; dat\_seq2pathway\_Msig}
%------------------------------------
\texttt{dat\_seq2pathway\_GOterms} and \texttt{dat\_seq2pathway\_Msig} are demo data for functions in the \texttt{seq2pathway} package. Each of them is an R list object with 3 elements. The first element \texttt{seq2gene\_result} is a list with annotaton tables. The seconde element \texttt{gene2pathway\_result.FAIME} is a list of \texttt{gene2pathway} \texttt{FAIME}\cite{yang12} test result. And the third element \texttt{gene2pathway\_result.FET} is a list of Fisher's exact test results. More details and usage could be found in the \texttt{Examples} section in the vignette of the \texttt{seq2pathway} package.

    \begin{verbatim}
> data(dat_seq2pathway_Msig,package="seq2pathway.data")
> names(dat_seq2pathway_Msig)
[1] "seq2gene_result"   "gene2pathway_result.FAIME"    "gene2pathway_result.FET"  
> class(dat_seq2pathway_Msig$seq2gene_result)
[1] "list"
> names(dat_seq2pathway_Msig$seq2gene_result)
[1] "seq2gene_FullResult"           "seq2gene_CodingGeneOnlyResult"
> class(dat_seq2pathway_Msig$gene2pathway_result.FAIME)
[1] "data.frame"
> class(dat_seq2pathway_Msig$gene2pathway_result.FET)
[1] "data.frame"
    \end{verbatim}

\begin{thebibliography}{50}
    \bibitem{Carlson} Carlson M., \emph{org.Hs.eg.db: Genome wide annotation for Human.}, 
    R package version 3.0.0.
    
    \bibitem{Subramanian} Subramanian, Tamayo, et al. ,
    \emph{Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles}, 
    PNAS \textbf{102} (2005), 1545--1550.
    
    \bibitem{Harrow} Harrow J, et al.,
    \emph{GENCODE: The reference human genome annotation for The ENCODE Project}, 
    Genome Res. \textbf{9} (2012), 1760--1774.
    
    \bibitem{CarlsonGO} Carlson M., \emph{GO.db: A set of annotation maps describing the entire Gene Ontology.}, 
    R package version 3.0.0.

    \bibitem{Efron} Efron, B. and Tibshirani, \emph{R. On testing the significance of sets of genes.}, 
    Stanford tech report rep {2006}.

    \bibitem{Durinck} Durinck S, Spellman P, Birney E and Huber W,
    \emph{Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt}, 
    Nature Protocols \textbf{4} (2009), 1184--1192.
    
    \bibitem{yang12} X. Yang, K. Regan, Y. Huang, Q. Zhang, J. Li, T. Y. Seiwert, et al.,
    \emph{Single sample expression-anchored mechanisms predict survival in head and neck cancer}, 
    PLoS Comput Biol \textbf{8} (2012), e1002350.
    
\end{thebibliography}



\end{document}