\documentclass[a4paper]{article} %\VignetteIndexEntry{Reading Files} %\VignettePackage{pegas} \usepackage{ape} \newcommand{\loci}{\code{"loci"}} \newcommand{\genind}{\code{"genind"}} \newcommand{\NA}{\code{NA}} \newcommand{\pegas}{\pkg{pegas}} \newcommand{\adegenet}{\pkg{adegenet}} \newcommand{\gdata}{\pkg{gdata}} \newcommand{\poppr}{\pkg{poppr}} \newcommand{\genetix}{\textsc{Genetix}} \newcommand{\fstat}{\textsc{Fstat}} \newcommand{\genepop}{\textsc{Genepop}} \newcommand{\struc}{\textsc{Structure}} \author{Emmanuel Paradis} \title{Reading Genetic Data Files Into R with \adegenet\ and \pegas} \begin{document} \DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom=\color{darkblue}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom=\color{black}\vspace{-1.5em}} \maketitle\vspace*{1pc}\hrule \vspace{1cm} <>= options(width = 80, prompt = "> ") @ \vspace{1cm} \adegenet\ \cite{Jombart2008} and \pegas\ \cite{Paradis2010} are two packages that complement each other for population genetic analyses in R. Since various computer programs have been used by population geneticists, most of them with their own data file formats, it is crucial that R can read them to ease users to switch to R. The present document explains how to read several file formats commonly used in population genetics. The following formats are considered: \begin{itemize} \item Text (ASCII) in tabular form, \item VCF files, \item \fstat\ data format, \item \genetix\ data format, \item \genepop\ data format, \item \struc\ data format, \item Excel. \end{itemize} Except the last one, these files are stored in text (usually standard ASCII) format the differences being in the layout of the data. \section{Data Structures in \adegenet\ and \pegas} First, let's have a brief look on the way allelic data are stored by our two packages. \adegenet\ has the class \code{"genind"} where individuals are rows and alleles are columns in a matrix named \code{tab}. This is an S4 class, so the elements are accessed with the \code{@} operator (e.g., \code{x@tab}). Additional information are stored in other slots (\code{@ind.names}, \code{@pop}, \dots) The details can be found with \code{class?genind}. \pegas\ has the class \code{"loci"} which is a simple data frame with a mandatory attribute named \code{"locicol"} which identifies the columns that are loci; the other columns are additional (individual) variables that may be of any kind. The loci are coded with factors where the different levels are the observed genotypes with the alleles separated with a forward slash, for instance, `A/A' for a classical genotype, or `132/148' for a microsatellite locus. This is an S3 class.\footnote{For details: emmanuelparadis.github.io/pegas/DefinitionDataClassesPegas.pdf.} Some examples are given in the next subsection. With version 0.6, \pegas\ supports phased and unphased genotypes (A|A and A/A, respectively). In unphased genotypes, the alleles are sorted with uppercase first, and then in alphabetical order, so that a/A is changed into A/a (even if the latter was not observed in the original data file). There is no need to choose between these two data structures: they are used by each package and they can be converted between each other with the functions \code{genind2loci} (or the generic form \code{as.loci}) and \code{loci2genind}. Therefore, it is straightforward to run analyses with both packages on the same data. \section{Reading Genetic Data Files} \subsection{Reading Text Tabular Files}\label{sec:tab} It is intuitive to organise allelic data in a tabular form where each row is an individual and the columns are loci and possibly other variables such as population, spatial locations, and so on. A simple example of such a file would be (file `toto'):\footnote{File contents are printed in blue to distinguish them from R input/output.} \color{blue} \begin{verbatim} a/a \end{verbatim} \color{black} This file can be read with the \pegas\ function \code{read.loci}: <<>>= library(pegas) x <- read.loci("toto", header = FALSE) x print(x, details = TRUE) class(x) @ Since the file has no label for the column, we used \code{header = FALSE}. Note that printing a \code{"loci"} object displays a very brief account of the data; the option \code{details = TRUE} allows to print them like a standard data frame.\footnote{The function \code{View} can also be used: this will use the same spreadsheet interface than \code{fix} but the data cannot be edited (see below).} If the same data were formatted with a different allele separator (file `titi'): \color{blue} \begin{verbatim} a-a \end{verbatim} \color{black} Then this file would be read with: <<>>= y <- read.loci("titi", header = FALSE, allele.sep = "-") print(y, details = TRUE) identical(x, y) @ Let us have a look on the different options of \code{read.loci}: <<>>= args(read.loci) @ We already know \code{file}, \code{header}, and \code{allele.sep}. Note the default value for this last option which specifies that the allele separator can be either a slash or a vertical bar. \code{loci.sep} is the separator of the columns (not only the loci) which is one or several spaces by default (use \code{sep = "\textbackslash t"} if a tabulation). \code{col.pop} must be an integer giving the index of the column that will be labelled ``population'' in the returned data; by default there is none. \code{col.loci} does the same for the loci; by default all columns are treated as loci except if \code{col.pop} is used. Finally `\code{\dots}' may be any further (named) arguments passed to \code{read.table} (e.g., \code{skip} in case there are comments at the top of the file). Any level of ploidy is accepted and \pegas\ checks the order of the alleles (see above). For instance the file `tutu' is: \color{blue} \begin{verbatim} a/a/A A/a/a \end{verbatim} \color{black} <<>>= print(read.loci("tutu", FALSE), TRUE) @ Phased and unphased genotypes can be mixed in a file. For instance the file `tyty' is: \color{blue} \begin{verbatim} Loc1 A/a a/A A|a a|A \end{verbatim} \color{black} <<>>= X <- read.loci("tyty") print(X, TRUE) summary(X) @ A more realistic example with four columns---an allozyme locus, a microsat locus, a population assignment, and a phenotypic variable---might be (file `tata'): \color{blue} \begin{verbatim} Adh2 SSR1 pop size IndA1 A/A 100/200 A 2.3 IndA2 A/a 100/120 A 2.5 IndB1 A/A 100/100 B 2.1 IndB2 a/a 120/120 B 2.8 \end{verbatim} \color{black} which will be read with: <<>>= z <- read.loci("tata", loci.sep = "\t", col.loci = 2:3, col.pop = 4, row.names = 1) z print(z, details = TRUE) @ Note \code{row.names} which is passed with the `\code{\dots}' argument. To make sure that only the first and the second columns are treated as loci, let us extract the alleles from this data set: <<>>= getAlleles(z) @ We may check that the attribute \code{"locicol"} has been set correctly, but usually the user does not need: <<>>= attr(z, "locicol") @ Finally we display the internal structure of the data to see that the additional variables are treated as they should be: <<>>= str(z) @ \subsection{Reading VCF Files} Starting with version 0.6, \pegas\ can read VCF files with the function \code{read.vcf}. Version 0.8 has a completely rewritten code: <<>>= args(read.vcf) @ By default, the first 10,000 loci are read. The option \code{which.loci} is an alternative way to specify which loci to read in the file. For instance, the following is the same than the default (the arguments \code{from} and \code{to} are ignored here): \begin{verbatim} read.vcf(file, which.loci = 1:1e4) \end{verbatim} In practice, the numbers passed to this option will be obtained from additional functions which query information from VCF files (see \code{?VCFloci} for more information). \code{read.vcf} returns an object of class \code{"loci"}. \subsection{Importing \fstat, \genetix, \genepop, and \struc\ Data Files} These four programs have their own data format. Roughly, these formats have the same idea: they store the genotypes of individuals from different populations. So, they store genotypes at several loci and an individual categorical variable. Additionally, the \genetix\ and \struc\ formats allow for individual labels. \adegenet\ includes four data files in each of these formats of the same microsatellite data set. These files can be displayed in the R console with: \begin{Schunk} \begin{Sinput} > file.show(system.file("files/nancycats.dat", package="adegenet")) > file.show(system.file("files/nancycats.gtx", package="adegenet")) > file.show(system.file("files/nancycats.gen", package="adegenet")) > file.show(system.file("files/nancycats.str", package="adegenet")) \end{Sinput} \end{Schunk} If you want to copy these files into the working directory to further display or edit them with your favourite editor, use these commands: \begin{Schunk} \begin{Sinput} > file.copy(system.file("files/nancycats.dat", package = "adegenet"), getwd()) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \begin{Sinput} > file.copy(system.file("files/nancycats.gtx", package = "adegenet"), getwd()) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \begin{Sinput} > file.copy(system.file("files/nancycats.gen", package = "adegenet"), getwd()) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \begin{Sinput} > file.copy(system.file("files/nancycats.str", package = "adegenet"), getwd()) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \end{Schunk} \adegenet\ provides four functions to read these formats. Reading the first three formats is straightforward: \begin{Schunk} \begin{Sinput} > A <- read.fstat("nancycats.dat", quiet = TRUE) > B <- read.genetix("nancycats.gtx", quiet = TRUE) > C <- read.genepop("nancycats.gen", quiet = TRUE) \end{Sinput} \end{Schunk} Reading a \struc\ file is slightly more complicated because the function needs some exta information. This can be done interactively (the default), or by specifying the appropriate options in which case we will use \code{ask = FALSE}: \begin{Schunk} \begin{Sinput} > D <- read.structure("nancycats.str", onerowperind=FALSE, n.ind=237, n.loc=9, col.lab=1, col.pop=2, ask=FALSE, quiet=TRUE) \end{Sinput} \end{Schunk} All four data sets are identical (we only compare the \code{tab} slots): \begin{Schunk} \begin{Sinput} > identical(A@tab, C@tab) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \begin{Sinput} > identical(B@tab, D@tab) \end{Sinput} \begin{Soutput} [1] TRUE \end{Soutput} \end{Schunk} Once the data have been read into R, they can be analysed with \adegenet\ or with \pegas\ after eventually converting them with \code{as.loci}. We now delete the data files: \begin{Schunk} \begin{Sinput} > unlink(c("nancycats.dat", "nancycats.gtx", "nancycats.gen", "nancycats.str")) \end{Sinput} \end{Schunk} Finally, \pegas\ has the function \code{read.gtx} to read a \genetix\ data file and return an object of class \loci. This function has no option. \subsection{Importing Excel Files} Excel is widely used for trivial data management, but clearly these data must be exported to other programs for most analyses. This also applies to the free spreadsheet editors such as OpenOffice's Calc or Gnumeric. Several solutions to get such data into R are given below. I assume that the allelic data in the spreadsheet are in a tabular form similar to what we have seen in Section \ref{sec:tab}, so the objective is to have them in R as a \code{"loci"} object. \begin{enumerate} \item The simplest solution is to save the spreadsheet as a text file using either the tab-delimited or comma-separated-variable (csv) format. This can be done with any spreadsheet editor since Calc or Gnumeric can import Excel files. Once the text file is created, \code{read.loci} can be used with the option \code{loci.sep = "\textbackslash t"} or \code{loci.sep = ","}, as well as any other that may be needed. \item If the ``Save as\dots'' solution does not work, it is possible to save a sheet, or part of it, in a text file by following these steps: \begin{enumerate} \item Open the file, again this may be done with any program. \item Select the cells you want to export; this can be done by clicking once on the top-left cell, and then clicking a second time on the bottom-right cell after pressing the Shift key (this could avoid you a tunnel syndrome and is much easier if many cells must be selected). \item Copy the selected cells in the clipboard (usually Ctrl-C). \item Open a text editor (do not use a word processor), paste the content of the clipboard (usually Ctrl-V), and save the file. \end{enumerate} The text file can now be read with \code{read.loci(..., loci.sep = "\textbackslash t")}. \item If Perl is installed on your computer (this is true for almost all Linux distributions), you can use the function \code{read.xls} from the package \pkg{gdata} (available on CRAN) to read directly an Excel file into R (the Perl program actually does the same job than the user does manually in the ``Save as\dots'' solution above). By default the first sheet is used, but this can be changed with the \code{sheet} option. The returned object is a data frame and can be converted as a \code{"loci"} object with \code{as.loci}. In that case, the same options that in \code{read.loci} can be used (see \code{?as.loci}.). In my experience, \code{read.xls} works well with small to moderate size Excel files but can be very slow with bigger files ($>$ 10 MB). \end{enumerate} We also note the function \code{read.genealex} in the package \poppr\ \cite{Kamvar2014} which reads a Genalex file that has been exported into csv format and returns an object of class \code{"genind"}. \section{An Example From Dryad} This section shows how the \code{jaguar} data set was prepared. This data set, deliver with \pegas, was published by Haag et al.\ \cite{Haag2010}, and was deposited on Dryad.\footnote{http://datadryad.org/resource/doi:10.5061/dryad.1884} The main data file is in Excel format and can be accessed directly at the locations indicated below so that it can be read remotely with \gdata: \begin{Schunk} \begin{Sinput} > f <- "http://datadryad.org/bitstream/handle/10255/dryad.1885/\ MicrosatelliteData.xls?sequence=1" > library(gdata) > x <- read.xls(f, row.names = 1) \end{Sinput} \begin{Soutput} essai de l'URL 'http://datadryad.org/bitstream/handle/10255/\ dryad.1885/MicrosatelliteData.xls?sequence=1' Content type 'application/vnd.ms-excel;charset=ISO-8859-1'\ length 29184 bytes (28 KB) ================================================== downloaded 28 KB \end{Soutput} \end{Schunk} The object \code{x} is a data frame with the row names set correctly with the line identifiers of the original file since we have used the option \code{row.names = 1}. In this data frame each column is an allele so that two columns are used to represent the genotype at a given locus. This is clearly not the format used by the class \code{"loci"}. Furthermore, some rows indicating the populations have been inserted with missing values (\NA) for all columns. Fortunately, the rows with genotypes have no \NA, so it is easy to find the rows with the population names, and drop them before transforming the data frame with the function \code{alleles2loci}. This function has been specially designed to transform such data sets. The commands are relatively straightforward: \begin{Schunk} \begin{Sinput} > s <- apply(x, 1, anyNA) > y <- alleles2loci(x[!s, ]) \end{Sinput} \end{Schunk} We can now extract the population names and assign them to each observation; this is slightly more complicated, but the logical is based on the fact that the rows below a population name should be assigned to it:\footnote{In practice, a \code{for} loop can be used: it would be less efficient but more intuitive and easier to read.} \begin{Schunk} \begin{Sinput} > w <- which(s) > n <- diff(c(w, nrow(x) + 1)) - 1 > pop <- factor(rep(1:4, n), labels = names(w)) > y$population <- pop \end{Sinput} \end{Schunk} The data are now ready to be analysed in R. We can check that the row- and colnames are correctly set with the labels from original file: \begin{Schunk} \begin{Sinput} > y \end{Sinput} \begin{Soutput} Allelic data frame: 59 individuals 13 loci 1 additional variable \end{Soutput} \begin{Sinput} > dimnames(y) \end{Sinput} \begin{Soutput} [[1]] [1] "bPon01" "bPon02" "bPon133" "bPon134" "bPon135" [6] "bPon140" "bPon137" "bPon139" "bPon138" "bPon136" [11] "bPon141" "bPon143" "bPon142" "bPon124" "bPon366" [16] "bPon12" "bPon91" "bPon04" "bPon25" "bPon48" [21] "bPon49" "bPon50" "bPon51" "bPon52" "bPon53" [26] "bPon54" "bPon35" "bPon46" "bPon40" "bPon41" [31] "bPon47" "bPon78" "bPon36" "bPon359" "bPon44" [36] "bPon80" "bPon03" "bPon11" "bPon15" "bPon16" [41] "bPon17" "bPon18" "bPon19" "bPon20" "bPon21" [46] "bPon22" "bPon23" "bPon27" "bPon29" "bPon30" [51] "bPon31" "bPon32" "bPon38" "bPon45" "bPon130" [56] "bPon131" "bPon132" "bPon58" "bPon24" [[2]] [1] "FCA742" "FCA723" "FCA740" "FCA441" [5] "FCA391" "F98" "F53" "F124" [9] "F146" "F85" "F42" "FCA453" [13] "FCA741" "population" \end{Soutput} \end{Schunk} \section{Editing and Writing Genetic Data Files} After the data have been read into R, they can be manipulated in the standard way. This is straightforward for the class \code{"loci"} since it is a direct extension of data frames. \pegas\ has a few method functions adapted to \loci: \code{rbind}, \code{cbind}, and the indexing operator \code{[}. Some other functions, such as \code{subset}, \code{split}, \code{rownames}, \code{colnames}, or the \code{\$} operator, can be used without problem since they respect additional attributes. Others, such as \code{transform}, drop the attributes and so will return a simple data frame. \adegenet\ allows to edit an object of class \genind, but since this is an S4 class, the elements are accessed with the \code{@} operator. The help page \code{?genind} describes them in details. A few functions are provided to ease the manipulation of \genind\ objects because setting all elements by hand may be tedious: \code{seploc} splits the data with respect to each locus, \code{seppop} does the same with respect to each population, and \code{repool} allows to do the opposite operation. It is also possible to select a part of a \genind\ object with the usual indexing operator \code{[}, the indices will apply to the rows and/or columns of the \code{@tab} slot, and the other slots will be modified accordingly. Some care should be paid when using numerical indexing on columns because something like \code{x[, 1]} will only select one allele column, eventually excluding other alleles of the same locus. It is safer to use the option \code{nloc} which specifies the loci to be selected, and so will select the appropriate allele columns. Further details are available in \code{?pop} as well as information on some other functions. In addition to standard data editing, \pegas\ allows to edit a \loci\ object with \code{edit} or \code{fix}. The command \code{edit(x)} opens the data in R's spreadsheet-like data editor. Here are a few points about this procedure: \begin{itemize} \item It is possible to change the row and column labels (\code{rownames} and \code{colnames}). \item It is possible to add new rows (individuals): if some columns are not filled they will be given \code{NA}. \item You can add new genotypes and/or alleles to a locus column: the levels of the corresponding factor will be adjusted and a warning message will inform you of that. \item New columns may be added, but they can only be numerical or character vectors. \item Like most R functions, \code{edit} returns its results in the console if no assignment has been done, so you may prefer to call the editor with \code{x <- edit(x)} or \code{fix(x)}. If you forgot the assignment and don't want to lose all the changes you did, after you have closed the editor you can save the modified data with (only if you don't do any other operation after \code{edit}): \begin{verbatim} x <- .Last.value \end{verbatim} \end{itemize} \bibliographystyle{plain} \bibliography{pegas} \end{document}