% % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{cgdv17: extract from Complete Genomics diversity panel} %\VignetteDepends{GGtools,TxDb.Hsapiens.UCSC.hg19.knownGene,parallel} %\VignetteKeywords{genetic sequencing} %\VignettePackage{cgdv17} \documentclass[12pt]{article} \usepackage{auto-pst-pdf} \usepackage{amsmath,pstricks} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Exploring the Complete Genomics Diversity panel} \author{VJ Carey} \maketitle \tableofcontents \clearpage \section{Introduction} Complete Genomics Inc. distributes a collection of data on deeply sequenced genomes (from Coriell cell lines) from 11 different human populations. <>= library(cgdv17) data(popvec) popvec[1:5] table(popvec) @ The data are distributed with many details; VJC obtained the masterVar TSV files from the Complete Genomics ftp2 site, converted these to VCF 4.0 in Oct. 2011, using a tool noted at \begin{verbatim} http://community.completegenomics.com/tools/m/cgtools/219.aspx \end{verbatim} The conversion tool used was released with various caveats. Perhaps the whole conversion should be redone with official tools. The purpose of this note is to explore some basic structural features of the data, so that relevant genetic structures can be identified for analytic programming and interpretation. We focus on variant calls on chromosome 17. Formal restrictions on publications related to these data are as follows. \begin{verbatim} 1. The Coriell and ATCC Repository number(s) of the cell line(s) or the DNA sample(s) must be cited in publication or presentations that are based on the use of these materials. 2. You must reference our Science paper (R. Drmanac, et. al. Science 327(5961), 78. [DOI: 10.1126/ science.1181498]) 3. You must provide the version number of the Complete Genomics assembly software with which the data was generated. This can be found in the header of the summary.tsv file (\# Software_Version). \end{verbatim} \clearpage \section{Contents of a VCF header} There is a lot of redundancy among the headers for the 46 files, so one was isolated for distribution. <>= data(h1) h1 h1[[1]]$Sample h1[[1]]$Header$META h1[[1]]$Header$INFO h1[[1]]$Header$FORMAT @ \clearpage \section{Variant calls for chromosome 17} \subsection{Recording structural variation for an individual} We created a provisional container for the call data on chromosome 17. Tabix facilities were used to filter and index the data from the full VCF to all of chromosome 17. At present it is not clear how to model a collection of deeply sequenced chromosomes. I have used VariantAnnotation:::readVcf, which must be applied separately for each individual, given the Complete Genomics distribution. The focus is on structural information in the rowRanges component of the VCF object returned by \texttt{readVcf}, which is a GRanges instance. From the elementMetadata I removed FILTER and added \verb+geno()$GT+ information. This gives us information to specific variants and phase for some variants, depending on the string content of the GT information. The \texttt{getRVS} function will collect file references for the serialized GRanges. <>= rv = getRVS("cgdv17") rv @ Data on one individual can be extracted using \texttt{getrd()}. We will confine attention to variants with quality score in the top quartile of its distribution for this individual. <>= R85 = getrd(rv, "NA06985") length(R85) summary(elementMetadata(R85)$QUAL) kp = which(elementMetadata(R85)$QUAL >= 166) R85hiq = R85[kp] @ A small excerpt gives us a sense of the sorts of variation to be encountered: <>= elementMetadata(R85hiq)[11:20,] refs = elementMetadata(R85hiq)$REF alts = elementMetadata(R85hiq)$ALT genos = elementMetadata(R85hiq)$geno table(nchar(refs)) alts[grep(",",unlist(alts))] @ Summary: references are recorded as DNAStrings, alternatives are compressed character strings with commas, and the phasing of the individual-level calls can be derived by parsing the \texttt{geno} component. \subsection{Isolating variants in the vicinity of a gene, for an individual} We are interested in gene ORMDL3. We will use the hg19 transcriptDb to obtain the locations and tabulate higher quality variants observed 100kb up and downstream of the transcript. <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) tx19 = TxDb.Hsapiens.UCSC.hg19.knownGene library(org.Hs.eg.db) get("ORMDL3", revmap(org.Hs.egSYMBOL)) ortx = transcriptsBy(tx19, "gene")$"94103" seqlevels(R85hiq) = "chr17" aro = subsetByOverlaps(R85hiq, ortx+100000) table(elementMetadata(aro)$geno) alts = unlist(elementMetadata(aro)$ALT) alts[nchar(alts)>1] @ There are deletion and insertion events, but I don't see any simple way of isolating and counting them at the moment. Some code will be added to address this. We can use VariantAnnotation to obtain structural contexts. It takes over a minute to use locateVariants, so I just show the code and results here for now. \begin{verbatim} mycache = new.env(hash=TRUE) lvaro = locateVariants(aro, tx19, cache=mycache) lvaro[1:4,] table(lvaro$loca,sapply((lvaro$geneID), function(x)strsplit(x, ",")[[1]][1])) DataFrame with 4 rows and 5 columns queryID location txID geneID cdsID 1 1 intron 60959 22806 189661 2 1 intron 60960 22806 189661 3 1 intron 60961 22806 189661 4 1 intron 60962 22806 189661 124626 1440 22806 284110 2886 55876 5709 94103 9862 3'UTR 0 0 0 0 0 0 0 0 20 5'UTR 0 0 0 1 0 8 0 0 10 coding 0 0 0 6 0 0 0 0 10 intergenic 34 1 10 4 1 2 25 10 0 intron 28 0 105 42 0 138 99 6 0 \end{verbatim} We see that this search for variants near ORMDL3 identifies variants affecting other nearby genes. \section{Filtering and analyzing variants on multiple individuals} The analysis of a ragged variant set requires infrastructure. We will illustrate with a focused analysis of variants in the vicinity of ORMDL3. We have used the GGdata and hmyriB36 packages to collect expression data on 12 individuals in the diversity cohort, in the CY17 smlSet instance. This includes expression on all genes on chr17, and the HapMap phase 2 genotypes as well. <>= suppressPackageStartupMessages(library(GGtools)) data(CY17) CY17 sn = sampleNames(CY17) @ \subsection{Sample filtering} The ragged variant set can be filtered to these individuals. <>= rv17 = rv[, sn] rv17 @ \subsection{Counting variants in a specified region, with quality filtering} The variant counting function takes two key parameters in addition to the variant set: a region within which to count, and a lower bound on call quality for retained variants. A third additional parameter tells how to iterate over samples with an lapply-like function. Since ORMDL3 is on the minus strand, the upstream region is to the right. We will create a region from start site to 50k upstream. <>= if (length(ortx)>1) ortx = ortx[2] ortss = end(ortx) ortup50 = GRanges("chr17", IRanges(ortss, width=50000)) cv50k = countVariants(rv17, ortup50, 160, lapply ) <>= cv50k @ We see that the second sample seems to have a quality problem. We will now drop it from both the expression and variant structures. <>= if (length(sampleNames(rv17))==12) rv17 = rv17[,-2] if (length(sampleNames(CY17))==12) CY17 = CY17[,-2] #redo <>= cv50k = countVariants(rv17, ortup50, 160, lapply ) @ We can acquire the full data on variants in the region under the quality constraint using variantGRanges. <>= vv50k = variantGRanges( rv17, ortup50, 160, lapply ) <>= vv50k[[1]][1:5] sapply(vv50k,length) @ As a naive hint of a connection of ``variant burden'' with ORMDL3 expression, consider the following display. <>= ORMDL3ex = as.numeric(exprs(CY17[genesym("ORMDL3"),])) ygr = ifelse((1:11)<=4, "red", "green") plot(ORMDL3ex~cv50k, col=ygr, pch=19, ylab="variant count from 50kb upstream to TSS") legend(10, 8.5, pch=19, col=c("red", "green"), legend=c("CEU", "YRI")) summary(lm(ORMDL3ex~cv50k*factor(ygr))) @ \subsection{Enumerating variants by structural context} Now we focus on variants in the ORMDL3 coding region. <>= library(parallel) options(mc.cores=max(c(2, parallel::detectCores()-2))) vv = variantGRanges( rv17, ortx, 160, mclapply ) vvv = lapply(vv, function(x) renameSeqlevels(x, c("17"="chr17"))) mycache = new.env(hash=TRUE) locs = lapply(vvv, function(x) { locateVariants(x, tx19, CodingVariants(),cache=mycache) }) @ Further work: illustrate the predictCoding behavior, streamline the catalog of variants relevant to a given gene over the 46 individuals. Relate to population membership, and, where available, to expression variation. \end{document}