%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%

%\VignetteIndexEntry{cgdv17: extract from Complete Genomics diversity panel}
%\VignetteDepends{GGtools,TxDb.Hsapiens.UCSC.hg19.knownGene,parallel}
%\VignetteKeywords{genetic sequencing}
%\VignettePackage{cgdv17}

\documentclass[12pt]{article}
\usepackage{auto-pst-pdf}
\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Exploring the Complete Genomics Diversity panel}
\author{VJ Carey}

\maketitle
\tableofcontents

\clearpage

\section{Introduction}

Complete Genomics Inc. distributes a collection of
data on deeply sequenced genomes (from
Coriell cell lines) from 11 different
human populations.

<<lkd>>=
library(cgdv17)
data(popvec)
popvec[1:5]
table(popvec)
@

The data are distributed with many details; VJC obtained
the masterVar TSV files from the Complete Genomics ftp2 site,
converted these to VCF 4.0 in Oct. 2011, using a tool
noted at

\begin{verbatim}
http://community.completegenomics.com/tools/m/cgtools/219.aspx
\end{verbatim}

The conversion tool used was released with various caveats.  Perhaps
the whole conversion should be redone with official tools.

The purpose of this note is to explore some basic structural features
of the data, so that relevant genetic structures can be identified
for analytic programming and interpretation.  We focus on variant calls
on chromosome 17.

Formal restrictions on publications related to these data
are as follows.
\begin{verbatim}
1. The Coriell and ATCC Repository number(s) of the
cell line(s) or the DNA sample(s) must be cited in
publication or presentations that are based on the
use of these materials.
2. You must reference our Science paper (R. Drmanac,
et. al. Science 327(5961), 78. [DOI: 10.1126/
science.1181498])
3. You must provide the version number of the
Complete Genomics assembly software with which
the data was generated. This can be found in the
header of the summary.tsv file (\# Software_Version).
\end{verbatim}

\clearpage

\section{Contents of a VCF header}

There is a lot of redundancy among the headers for the 46 files,
so one was isolated for distribution.
<<lkh>>=
data(h1)
h1
h1[[1]]$Sample
h1[[1]]$Header$META
h1[[1]]$Header$INFO
h1[[1]]$Header$FORMAT
@

\clearpage

\section{Variant calls for chromosome 17}

\subsection{Recording structural variation for an individual}

We created a provisional container for the call data on chromosome
17.  Tabix facilities were used to filter and index the data from the
full VCF to all of chromosome 17.

At present it is not clear how to model a collection of deeply sequenced chromosomes.
I have used VariantAnnotation:::readVcf, which must
be applied separately for each individual, given the
Complete Genomics distribution.
The focus is on structural information in the rowRanges component
of the VCF object returned by \texttt{readVcf}, which is a GRanges instance.
From the elementMetadata
I removed FILTER and added \verb+geno()$GT+
information.  This gives us information to specific
variants and phase for some variants, depending on the
string content of the GT information.

The \texttt{getRVS} function will collect file references for
the serialized GRanges.
<<lkd>>=
rv = getRVS("cgdv17")
rv
@

Data on one individual can be extracted using \texttt{getrd()}.
We will confine attention to variants with quality score in the
top quartile of its distribution for this individual.
<<lkd1>>=
R85 = getrd(rv, "NA06985")
length(R85)
summary(elementMetadata(R85)$QUAL)
kp = which(elementMetadata(R85)$QUAL >= 166)
R85hiq = R85[kp]
@

A small excerpt gives us a sense of the sorts of variation
to be encountered:
<<lkty>>=
elementMetadata(R85hiq)[11:20,]
refs = elementMetadata(R85hiq)$REF
alts = elementMetadata(R85hiq)$ALT
genos = elementMetadata(R85hiq)$geno
table(nchar(refs))
alts[grep(",",unlist(alts))]
@

Summary: references are recorded as DNAStrings, alternatives are compressed
character strings with commas, and the phasing of the individual-level calls can be
derived by parsing the \texttt{geno} component.

\subsection{Isolating variants in the vicinity of a gene, for an individual}

We are interested in gene ORMDL3.  We will use the hg19 transcriptDb to
obtain the locations and tabulate higher quality variants observed 100kb up and
downstream of the transcript.

<<getlocs>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx19 = TxDb.Hsapiens.UCSC.hg19.knownGene
library(org.Hs.eg.db)
get("ORMDL3", revmap(org.Hs.egSYMBOL))
ortx = transcriptsBy(tx19, "gene")$"94103"
seqlevels(R85hiq) = "chr17"
aro = subsetByOverlaps(R85hiq, ortx+100000)
table(elementMetadata(aro)$geno)
alts = unlist(elementMetadata(aro)$ALT)
alts[nchar(alts)>1]
@

There are deletion and insertion events, but I don't see any simple
way of isolating and counting them at the moment.  Some code will
be added to address this.

We can use VariantAnnotation to obtain structural contexts.
It takes over a minute to use locateVariants, so I just show the
code and results here for now.

\begin{verbatim}
mycache = new.env(hash=TRUE)
lvaro = locateVariants(aro, tx19, cache=mycache)
lvaro[1:4,]
table(lvaro$loca,sapply((lvaro$geneID), function(x)strsplit(x, ",")[[1]][1]))
DataFrame with 4 rows and 5 columns
    queryID location      txID                    geneID     cdsID
  <integer> <factor> <integer> <CompressedCharacterList> <integer>
1         1   intron     60959                     22806    189661
2         1   intron     60960                     22806    189661
3         1   intron     60961                     22806    189661
4         1   intron     60962                     22806    189661
            
             124626 1440 22806 284110 2886 55876 5709 94103 9862
  3'UTR           0    0     0      0    0     0    0     0   20
  5'UTR           0    0     0      1    0     8    0     0   10
  coding          0    0     0      6    0     0    0     0   10
  intergenic     34    1    10      4    1     2   25    10    0
  intron         28    0   105     42    0   138   99     6    0
\end{verbatim}

We see that this search for variants near ORMDL3 identifies
variants affecting other nearby genes.

\section{Filtering and analyzing variants on multiple individuals}

The analysis of a ragged variant set requires infrastructure.  We
will illustrate with a focused analysis of variants in the vicinity
of ORMDL3.  We have used the GGdata and hmyriB36 packages to collect
expression data on 12 individuals in the diversity cohort, in the
CY17 smlSet instance.  This includes expression on all genes on chr17, and the HapMap
phase 2 genotypes as well.

<<getd>>=
suppressPackageStartupMessages(library(GGtools))
data(CY17)
CY17
sn = sampleNames(CY17)
@

\subsection{Sample filtering}

The ragged variant set can be filtered to these individuals.
<<getse>>=
rv17 = rv[, sn]
rv17
@

\subsection{Counting variants in a specified region, with
  quality filtering}

The variant counting function takes two key parameters in
addition to the variant set:
a region within which to count, and a lower bound on call
quality for retained variants.  A third additional parameter
tells how to iterate over samples with an lapply-like function.

Since ORMDL3 is on the minus strand, the upstream region is
to the right.  We will create a region from start site
to 50k upstream.
<<lkco,cache=TRUE>>=
if (length(ortx)>1) ortx = ortx[2]
ortss = end(ortx)
ortup50 = GRanges("chr17", IRanges(ortss, width=50000))
cv50k = countVariants(rv17, ortup50, 160, lapply )
<<lkcnt>>=
cv50k
@
We see that the second sample seems to have a quality problem.
We will now drop it from both the expression and variant structures.
<<drops>>=
if (length(sampleNames(rv17))==12) rv17 = rv17[,-2]
if (length(sampleNames(CY17))==12) CY17 = CY17[,-2]
#redo
<<doagain,cache=TRUE>>=
cv50k = countVariants(rv17, ortup50, 160, lapply )
@

We can acquire the full data on variants in
the region under the quality constraint using variantGRanges.
<<lkv,cache=TRUE>>=
vv50k = variantGRanges( rv17, ortup50, 160, lapply )
<<domolk>>=
vv50k[[1]][1:5]
sapply(vv50k,length)
@

As a naive hint of a connection of ``variant burden'' with
ORMDL3 expression, consider the following display.
<<lkdis,fig=TRUE,keep.source=TRUE>>=
ORMDL3ex = as.numeric(exprs(CY17[genesym("ORMDL3"),]))
ygr = ifelse((1:11)<=4, "red", "green")
plot(ORMDL3ex~cv50k, col=ygr, pch=19, 
      ylab="variant count from 50kb upstream to TSS")
legend(10, 8.5, pch=19, col=c("red", "green"), legend=c("CEU", "YRI"))
summary(lm(ORMDL3ex~cv50k*factor(ygr)))
@

\subsection{Enumerating variants by structural context}

Now we focus on variants in the ORMDL3 coding region.
<<purerng,cache=TRUE>>=
library(parallel)
options(mc.cores=max(c(2, parallel::detectCores()-2)))
vv = variantGRanges( rv17, ortx, 160, mclapply )
vvv = lapply(vv, function(x) renameSeqlevels(x, c("17"="chr17")))
mycache = new.env(hash=TRUE)
locs = lapply(vvv, function(x) {
  locateVariants(x, tx19, CodingVariants(),cache=mycache)
})
@

Further work:  illustrate the predictCoding behavior, 
streamline the catalog of variants relevant to a given
gene over the 46 individuals.  Relate to population membership, and, where
available, to expression variation.


\end{document}