%\VignetteIndexEntry{Detecting heterogenity in population structure across chromosomes with the "CAnD" package} %\VignettePackage{CAnD} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \usepackage{cite} \usepackage{amsmath,amsfonts} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=FALSE,png=TRUE,include=FALSE,width=4,height=4.5,resolution=150} \setkeys{Gin}{width=0.5\textwidth} % use a vertical rule for R output instead of prompts for R input \usepackage{fancyvrb} \definecolor{darkgray}{gray}{0.2} \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1em,formatcom={\color{darkgray}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em,frame=leftline,framerule=.6pt,rulecolor=\color{darkgray},framesep=1em,formatcom={\color{darkgray}}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \author{Caitlin McHugh$^{1*}$ \\ [1em] \small{$^{1}$ Department of Biostatistics, University of Washington} \\ \small{\texttt{$^*$mchughc (at) uw.edu}}} \title{Detecting Heterogeneity in Population Structure Across Chromosomes: the CAnD Package} \date{\today} \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% With the advent of dense, accurate and inexpensive genomic data, researchers are able to perform analyses that estimate ancestry across the entire genome. In particular, ancestry can be inferred across regions of the genome that are interesting for a disease trait, or can be inferred chromosome-wide to identify regions that have been passed down by an ancestral population. The \Rpackage{CAnD} package provides functionality for the method that compares proportion ancestry in a sample set across chromosomes or chromosomal regions [1]. We calculate p-values for the observed difference in ancestry across chromosomes, properly accounting for multiple testing. An overall CAnD statistic and p-value are stored for each analysis. This vignette describes a typical analysis workflow and includes some details regarding the statistical theory behind \Rpackage{CAnD}. For more technical details, please see reference [1]. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Example workflow for CAnD} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Preparing a data set for analysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% For our example, we will use a set of simulated data, the \Rcode{ancestries} data set from the \Rpackage{CAnD} package. We begin by loading relevant libraries, subsetting the data, and producing summary statistics. <>= library(CAnD) data(ancestries) dim(ancestries) @ We initially can look at the columns of our \Rcode{ancestries} object that correspond to the estimated proportion ancestries of chromosome one. <>= ancestries[1:2,c(1,2,25,48)] @ The \Rcode{ancestries} data.frame holds simulated proportions for a set of 50 samples. Every row corresponds to a sample and each sample has a unique id, stored as \Rcode{IID}. We imagine the proportions displayed in \Rcode{ancestries} were estimated from a program such as FRAPPE [2], ADMIXTURE [3] or RFMix [4]. In this particular example, three ancestral subpopulations were assumed, namely \Rcode{Euro}, \Rcode{Afr} and \Rcode{Asian}. The proportions can be locus-specific ancestry averaged across chromosomes, or could be any other sort of ancestral estimate for a portion of the genome. Furthermore, there can be any number of ancestral populations. Of course, the results are only interesting with two or more ancestries. In our sample data set, every sample has a column corresponding to the ancestral proportion for each of the three ancestries for all autosomal chromosomes 1-22 and the X chromosome. The three proportions should sum to one for each chromosome within a sample. First we can examine the estimated proportions, both by sample and by ancestry. We create histograms of these values, seen in Figure~\ref{fig:hists}. <>= par(mfrow=c(2,2)) hist(ancestries$Afr_1,main="Est Proportion African Ancestry\nChromosome 1", xlab="Chr 1 Prop African Ancest",cex.main=0.8) hist(ancestries$Afr_2,main="Est Proportion African Ancestry\nChromosome 2", xlab="Chr 2 Prop African Ancest",cex.main=0.8) afrCols <- seq(from=25,to=(25+22)) asianCols <- seq(from=(25+22+1),to=ncol(ancestries)) hist(as.numeric(ancestries[1,afrCols]),main="Est Proportion African Ancestry\nGenome-wide, Sample 1", xlab="Sample 1 Prop African Ancestry",cex.main=0.8) hist(as.numeric(ancestries[1,asianCols]),main="Est Proportion Asian Ancestry\nGenome-wide, Sample 1", xlab="Sample 1 Prop Asian Ancestry",cex.main=0.8) @ \begin{figure} \centering \includegraphics{CAnD-plotHists} \caption{Histograms of estimated proportion African ancestry for all samples on chromosome 1 and 2, and estimated proportion African and Asian ancestry, genome-wide, for Sample 1.} \label{fig:hists} \end{figure} The \Rcode{data.frame} is the only input file required to run the CAnD tests. For each test, we will subset the columns to the particular ancestry of interest. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Running the CAnD Test}\label{sec:CAnD} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The CAnD test detects heterogeneity in population structure patterns across chromosomes. CAnD uses local ancestry estimated from SNP genotype data to identify significant differences in ancestral contributions to chromosomes in samples from admixed populations. Statistically, CAnD compares a chromosome $c$ with a pool of all other chromosomes. The null hypothesis is that the mean difference between ancestry proportion on chromosome $c$ and the mean ancestry proportion across all other chromosomes is zero. For more details, see Section~\ref{paramMethods} and reference [1]. We will perform the CAnD test on the estimated proportions of European ancestry. In order to do this, we first subset \Rcode{ancestries} to the columns of interest. <>= euroCols <- seq(from=2,to=(2+22)) head(ancestries[,euroCols[20:23]],2) colnames(ancestries[euroCols]) euroEsts <- ancestries[,euroCols] dim(euroEsts) head(euroEsts[,1:5],2) @ Then, we can simply run the CAnD test across all chromosomes for the estimated European ancestry in our 50 samples. <>= param_cRes <- CAnD(euroEsts) param_cRes test(param_cRes) overallpValue(param_cRes) overallStatistic(param_cRes) BonfCorr(param_cRes) @ We notice that the CAnD p-value is significant when considering the difference in chromosomal estimates of European ancestry genome-wide. To further investigate this, we can examine the p-values calculated for each chromosome. <>= pValues(param_cRes) @ The p-value comparing the X chromosome to the autosomes is highly significant, implying that the estimated European ancestry on the X chromosome is statistically significantly different from that on the autosomes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Visualizing Results} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% There are two plotting functions available in \Rpackage{CAnD} to visualize results from the CAnD method. The \Rfunction{plotPvals} function plots the calculated $p$-values against each chromosome/chromosomal region. We will show the results from the CAnD test in Figure~\ref{fig:pvals}. <>= plotPvals(param_cRes, main="CAnD P-values\nProportion European Ancestry Genome-wide") @ \begin{figure} \centering \includegraphics{CAnD-plotb} \caption{The calculated p-value from the CAnD method to detect heterogeneity in proportion European ancestry by chromosome.} \label{fig:pvals} \end{figure} The \Rfunction{barPlotAncest} function plots the proportion ancestry for a given chromosome/chromosomal region for each sample. This visualization is an efficient way to compare the proportions ancestry across the entire sample. Note this is simply a summary plot and does not require running of the CAnD tests to produce. We see the results for our sample in Figure~\ref{fig:barplot}. <>= chr1 <- ancestries[,c("Euro_1","Afr_1","Asian_1")] barPlotAncest(chr1,title="Chromosome 1 Ancestry Proportions") @ \begin{figure} \centering \includegraphics{CAnD-plot2} \caption{Barplot of chromosome 1 ancestry proportions, ordered by increasing proportion European ancestry.} \label{fig:barplot} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Methods in brief} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Define the proportion ancestry from subpopulation $k$ for individual $i$ to be $a_{ik}$, $i \in \{1,\dots,N\}$. Let $G_{-c}=\{1,2,\dots,c-1,c+1,\dots,22,X\}$. For a given chromosome of interest $c$, we calculate the pooled mean of all chromosomes excluding $c$ as $$ a_{ik}^{-c}=\frac{1}{22}\sum_{M\in G_{-c}}a_{ik}^M $$ The difference in ancestry between a given chromosome $c$ and the average of all other chromosomes, in individual $i$ and for a given ancestry subpopulation $k$, is $$ D_{ik}^c=a_{ik}^{-c}-a_{ik}^c $$ Denote the mean $D_{ik}^c$ across all individuals $i$ as $\overline{D_k^c}$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{CAnD Methods}\label{paramMethods} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The CAnD method tests for heterogeneity across $m$ chromosomes [1]. We first define the t-statistic comparing differences in ancestry subpopulation $k$ on chromosome $c$ with a pool of the other chromosomes as $$ T_k = \overline{D_k^c} / \sqrt{v^2_k/n} $$ where $v_k^2=\frac{1}{n-1}\sum_{i=1}^n (D_{ik}^c-\overline{D_k^c})^2$ is the sample variance. Note this statistic takes into account the average ancestry difference between chromosome $c$ and the mean ancestry of the other chromosomes across all individuals as well as within individuals. $T_k$ has $n-1$ degrees of freedom and is a test statistic that tests the null hypothesis that the mean difference between the ancestry proportion on chromosome $c$ and the ancestry proportion across all other chromosomes for subpopulation $k$ is zero. We calculate $T_k$ for each chromosome $c$ of interest, and obtain $m$ $p$-values $p_c,c\in \{ 1,\dots,m\}$. Then, we define $$\widetilde T_k = \begin{pmatrix}T_k^1 \\ T_k^2 \\ \vdots \\ T_k^m \end{pmatrix} \sim \mbox{MVN}(0, \Sigma) $$ It follows that combined CAnD statistic, allowing for correlation between $T_k^c$ and $T_k^{c'}$, is \begin{equation} \chi^2_{CAnD}=\widetilde T'_k \Sigma^{-1} \widetilde T_k \end{equation} which follows a $\chi^2_m$ distribution under the null hypothesis. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= toLatex(sessionInfo()) @ <>= options(prompt="> ", continue="+ ") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{References} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. McHugh, C., Brown, L., Thornton, T. Detecting heterogeneity in population structure across the genome in admixed populations. Genetics, 2016. 2. Tang, H., Peng, J., Wang, P., Risch, N.J. Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology, 2005. 3. Alexander, D.H., Novembre, J., Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 2009. 4. Maples, B.K., Gravel, S., Kenny, E.E., Bustamante, C.D. RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. American Journal of Human Genetics, 2013. \end{document}