%% Sweave("GeneR/inst/doc/GeneR.Rnw") \documentclass[a4paper]{article} \usepackage{graphics} \title{GeneR Package} \author{L. Cottret, A. Lucas, O. Rogier, E. Marrakchi, V. Lefort,\\ P. Durosay, A. Viari, C. Thermes \& Y. d'Aubenton-Carafa} %\VignetteIndexEntry{Introduction to GeneR} %\VignettePackage{GeneR} \SweaveOpts{echo=FALSE} \usepackage{a4wide} \begin{document} \maketitle \tableofcontents \section{Overview} To summarize, we can split major functions into 5 categories. \subsection*{Reading and writing sequences} GeneR has been designed for fast sequence retrieving even from very large sequence databanks, in Fasta, Embl or Genbank formats. It is also possible to enter sequences directly from a R command line. \subsection*{Handling sequences} The usual copy-paste of parts of sequences or other manipulations can be performed by functions using vectors of strands and positions. Annotations from the features field within formatted sequence entries can be extracted and used directly in vectors. By this way, it is easy to extract sequence fragments of interest. \subsection*{Analyzing sequences } Some tools are designed to count oligo-nucleotides, to look for exact word positions or to shuffle sequences (useful for statistical validations). \subsection*{Manipulation of regions on a chromosome} In all the amount of annotations, we often have mass informations on genes (introns, CDS, isoforms) but we have to deduce intergenic regions, exons or more sophisticated regions. We provide a complete set of tools to easily compute any subregions, without an exhaustive texture on a whole chromosome. \subsection*{Genetic tools } Finally, the package also contains functions related to genetic and structural aspects of the sequences : ORF localization, translation, or RNA secondary structure determination (with extention of GeneR: GeneRfold package).\\ \section{Installation} %From your R session, type: %\begin{verbatim} %source("http://www.bioconductor.org/getBioC.R") %getBioC(libName='GeneR',develOK=TRUE) %\end{verbatim} %Or as for all R package sources, you can try also something like: With package sources, you can try also something like: \begin{verbatim} R CMD INSTALL GeneR_x.x-x.tar.gz \end{verbatim} or try in the menu (for Windows and mac users). \section{Quick Start} First steps on GeneR could be to check man page of GeneR: \begin{verbatim} library(GeneR) help(GeneR) \end{verbatim} \noindent And for impatient: \begin{verbatim} demo(GeneR) \end{verbatim} \section{Usage} \subsection{Basique manipulation with sequences as character string} Standard R commands allow to extract subpart of a character string, or append two character string, put in upper case (functions {\em substr, paste, toupper}). We add tools for the specific use of genetic sequences: insert a sequence into a first one, compute the reverse complementary, count mono, di or tri-nucleotides of sequence and the possibility to import sequence from a Embl, Fasta or GeneBank file.\\ \noindent {\bf Example} \noindent Insert a poly A into sequence <>= library(GeneR) s <- "gtcatgcatgctaggtgacagttaaaatgcgtctaggtgacagtctaacaa" s2 <- insertSeq(s,"aaaaaaaaaaaaaa",20) s2 @ \noindent Compute the reverse complementary: <>= library(GeneR) strComp(s) @ \noindent Count mono nucleotides, di-nucleotides (wsize can be from 1 to 15 if computer has enough memory). <>= strCompoSeq(s2,wsize=1) strCompoSeq(s2,wsize=2) @ \noindent Get some data from the web <>= seqNcbi("BY608190",file='toto.fa') strReadFasta('toto.fa',from=10,to=35) @ \subsection{Manipulation of sequences with buffers} \subsubsection{Introduction} To work on large sequences (i.e. a whole chromosome), we use a C++ adapted class that requiers a minimum memory allocation. Standard use: a chromosome load in the buffer, all computation done with GeneR functions, and only usefull results returns to standard R objects.\\ \subsubsection{Basic manipulation} \noindent Sequence read from a fasta file: <>= readFasta('toto.fa') readFasta('toto.fa',seqno=1) @ \noindent Show current buffer <>= getSeq(from=10,to=35) @ \noindent Sames previous functions with buffers: Extract from multiple positions <>= getSeq(seqno=0,from=c(1,10,360),to=c(10,20,0)) assemble(seqno=0,destSeqno=1,from=c(1,10,360),to=c(10,20,0)) getSeq(seqno=1) @ \noindent See the end of sequence <>= sizeSeq(seqno=0) size0 <- sizeSeq(seqno=0) getSeq(seqno=0,from=size0-20,to=0) @ \noindent Append and concat: <>= getSeq(0,from=1,to=35) getSeq(1) appendSeq(0,1) getSeq(seqno=0,from=size0-20,to=size0+10) concat(seqno1=0,seqno2=1,destSeqno=3, from1=2,to1=10,from2=8,to2=0, strand1=1) getSeq(3) @ \subsubsection{Look for motifs, composition, masked position} Several tools are designed to \begin{itemize} \item gets match positions of an oligomer in fragments of a sequence \item gets ORFs (Open Reading Frames) from a sequence. \item Get composition in mono, di or trinucleotides of fragments of a sequence \item mask sequence in lower case or with letter 'N' \item get masked position from a sequence. \item change sequence to Dna or Rna. \end{itemize} \noindent Look for motifs <>= exactWord("AAAG",seqno=0) @ \noindent Find Orfs <>= getOrfs(seqno=0) maxOrf(seqno=0) @ \noindent Mask sequences while lowering case of parts of sequence <>= x=c(5,15,30);y=c(10,20,35) lowerSeq(from=x,to=y) getSeq(from=1,to=35) @ \noindent And the reverse: get masked positions from a sequence: <>= posMaskedSeq(seqno=0,type="lower") posMaskedSeq(seqno=0,type="upper") @ \noindent Or masked with N: <>= s <- "ATGCtgTGTTagtacATNNNNNNNNNNNNNNNTGGGTTTaAAAattt" placeString(s,upper=FALSE,seqno=0) posMaskedSeq(seqno=0,type="N") @ \noindent Change from Dna to Rna alphabet <>= dnaToRna() getSeq() @ \subsubsection{Adresses manipulations} \paragraph{Presentation} When GeneR load a subset of a larger sequence stored in a bank file, % (by example a gene from a whole chromosome). it will store the following informations in the C adapted class (buffers, by default 100 buffers than can be extended if necessary): \begin{itemize} \item subsequence (i.e. the succession of A,T,G,C). \item postions of the extremities of the subsequence in the master sequence \item size of the whole sequence in the bank file \item name of the sequence \end{itemize} For specific purposes as renaming a sequence, all these variables can be viewed and carefully changed at any time (here functions {\em getAccn} and {\em setAccn}). Several sequences can be stored simultaneously and called by their buffer number. Strand is another global variable which can be set and viewed (functions {\em getStrand} and {\em setStrand}). It is used as input parameter in many functions to analyze complementary strand. It was designed to avoid doing explicitly the complement of the loaded strand then to store it in a buffer with, as consequence, loss of the informations linked to the master sequence. We have defined 3 types of addresses on a subsequence extracted from a master sequence: \begin{itemize} \item Absolute addresses i.e. addresses on the master sequence, from the 5' end of the input strand refered as forward (noted A) \item Real addresses, i.e. addresses on the master sequence, from the 5' end of one of strands (noted R) \item Relative addresses, i.e. addresses on working subsequence, from the 5' end of one of strands (noted T). \end{itemize} Let's show an example, if we read sequence from 11 to 25 from a gene of size 40 (see Figure\ref{adressage}). Obviously, when an entire sequence is stored, real and relative addresses will be the same. Although all functions using positions need and return absolute addresses, 6 functions allow to convert R, A, T into any other type (functions {\em RtoA, RtoT, AtoR, AtoT, TtoR, TtoA}). A global variable {\em strand} is used to convert positions (see {\em setStrand getStrand}). We keep globals variables of sequences... We implement tools to ... \begin{figure} \resizebox{\textwidth}{!}{\includegraphics{GeneR_address}} \caption{Adresses of a sequence} \label{adressage} \end{figure} \paragraph{Example} \noindent Imagine our chromosome is as follow, and we study only part of chromosome between position 11 to 25: <>= s <- "xxxxxxxxxxATGTGTCGTTAATTGxxxxxxxxxxxxxxx" placeString(s) setStrand(0) writeFasta(file="toto.fa") readFasta(file="toto.fa",from=11,to=25) getSeq() @ \noindent Orfs on 'Absolute' and 'relative' Adresses <>= getOrfs() AtoT(getOrfs()) @ \noindent More Fun: <>= getSeq() exactWord(word="AA") AtoT(exactWord(word="AA")[[1]]) setStrand(1) getSeq() exactWord(word="AA") AtoT(exactWord(word="AA")[[1]]) exactWord(word="TT") AtoT(exactWord(word="TT")[[1]]) @ \subsection{Bank files} \subsubsection{Retrieve sequences} We develop 2 functions to get sequence files from internet: seqUrl (from a srs server) and seqNcbi (from Ncbi). Sequence are retrieve in Fasta or Embl format for seqUrl; Fasta or GenBank format for seqNcib. <>= seqNcbi("BY608190",file="bank.fa") seqNcbi("BY608190",file="BY608190.gbk",type="genbank") #seqSrs("embl|BY608190|BY608190",file="BY608190.embl",submotif=TRUE) @ <>= setStrand(0) seqNcbi("BY608190",file="bank.fa") seqNcbi("BY608190",file="BY608190.gbk",type="genbank") #seqSrs("embl|BY608190|BY608190",file="BY608190.embl",submotif=TRUE) @ \subsubsection{Fasta files} Basic tools to get or write informations from bank file <>= readFasta(file="bank.fa", name="gi|26943372|gb|BY608190.1|BY608190", from=1,to=30) getSeq() fastaDescription(file="bank.fa", name="gi|26943372|gb|BY608190.1|BY608190") @ And to write a to the fasta bank file: <>= s="cgtagctagctagctagctagctagctagcta" placeString(s,seqno=0) writeFasta(seqno=0,file="bank.fa",name="MySequence", comment="A sequence Generated by R",append=TRUE) ## Show bank file cat(paste(readLines("bank.fa"),collapse='\n')) @ \subsubsection{Embl files} We made basic tools to get sequence and write Embl files. <>= # readEmbl(file='BY608190.embl',name='BY608190',from=10,to=30) # getSeq() s<-"gtcatgcatcctaggtgtcagggaaaatgcgtctacgtgacagtctaacaa" placeString(s) # Add lines with "CC bla bla bla" and a line "XX" writeEmblComment(file="toto.embl",code="CC",text="This is a comment for \ this dummy sequence... I try to be long enough to show that this comment \ will be written on several lines",append=FALSE) # Add a line with "FT CDS bla bla bla" writeEmblLine(file="toto.embl",code="FT",header="CDS",text="<1..12", nextfield = FALSE) # Add lines with "FT bla bla bla" writeEmblLine(file="toto.embl",code="FT",header="",text="/codon_start=2", nextfield = FALSE) writeEmblLine(file="toto.embl",code="FT",header="",text="/gene=\"toto\"", nextfield = FALSE) writeEmblLine(file="toto.embl",code="FT",header="",text="/note=\"Here is \ what I think about this gene\"",nextfield = TRUE) ## Translation prot <- translate(seqno=0,from=getOrfs()[1,1],to=getOrfs()[1,2]) writeEmblLine (file="toto.embl",code='FT',header='', text=paste('/translation="',prot ,'\"', sep=''),nextfield =FALSE) # Add sequence writeEmblSeq(file="toto.embl") ## Show file cat(paste(readLines("toto.embl"),collapse='\n')) @ \subsubsection{GenBank files} We have a function to open any sequence from a GeneBank files <>= readGbk(file='BY608190.gbk',name='BY608190',from=10,to=30) getSeq() @ \subsection{Segments manipulations} \subsubsection{Overview} We made a set of tools manipulation segments. The aim of these classes is to describe regions on chromosomes that are discontinuous segments on a line like: \begin{verbatim} 1 10 25 40 Region A: ########## ####### Region B: #### #### ############# Region C: ###### #### \end{verbatim} \noindent We made two kind of class \begin{itemize} \item {\tt segSet}: segments set, is a matrix nx2 composed of a column of "from", and a column of "to". Used to describe a region like 'A' or 'B' in our example. (A matrix 3x2 to describe region 'B'). \item {\tt globalSeg}: a list of {\tt segSet}. It allows the notion of list of discontinuous segments (our use: a list of gene's models as a gene's model is stored as a list of its exons). In our sample, globalSeg will be the list of the 3 regions A,B and C. Note that it store more information than just a matrix with 2 columns containing all {\tt segments} of theses 3 regions. \end{itemize} \noindent For a better comprehension of other man pages, we introduce this notation: \begin{itemize} \item a segment is just a part of a line determined by two values (from and to) \item an object of class segSet is a set of {\tt n} segments, determined by a matrix {\tt n}x2 \item an object of class globalSeg is a set of segSets, determined by a list of matrix. \end{itemize} \subsubsection{Examples} We show segments usage with figure: \begin{itemize} \item Figure~\ref{segset} presents our objects \item Figure~\ref{unionseg} show union, intersection, and basic manipulation on object of class {\em segSet}. \item Figure~\ref{unionglobseg} show union, intersection, and basic manipulation on object of class {\em globalSeg}. \item Figure~\ref{unary} show unary operators on our objects. \end{itemize} \begin{center} \begin{figure} <>= a = matrix(c(1,5,15,45,17,38, 100,120,130,140, 135,145,142,160), ncol=2,byrow=TRUE) b = matrix(c(15,18, 28,45, 1,10, 15,20, 25,40, 17,23, 35,38,100,105, 110,120),ncol=2,byrow=TRUE) a <- as.segSet(a) b <- as.segSet(b) A = list( matrix( c( 1, 15, 17, 5, 45, 38),ncol=2), matrix( c( 100 , 120),ncol=2), matrix( c( 130, 135, 140, 145),ncol=2), matrix( c( 142 , 160),ncol=2)) B = list( matrix( c(15, 28, 18, 45),ncol=2), matrix( c(1, 15, 25, 10, 20, 40),ncol=2), matrix( c(17, 35, 23, 38),ncol=2), matrix( c(100, 110, 105, 120),ncol=2)) A = as.globalSeg(A) B = as.globalSeg(B) c = or(a,b) d = and(a,b) e = not(a,b) f = Xor(a,b) par(mfrow=c(4,1)) ##par(mfrow=c(2,1)) plot(a,xlim=c(1,160),main="a (segSet)") plot(b,xlim=c(1,160),main="b (segSet)") plot(A,xlim=c(1,160),main="A (globalSet)") plot(B,xlim=c(1,160),main="B (globalSet)") @ \caption{Representation of the two class of elements.} \label{segset} \end{figure} \end{center} \begin{center} \begin{figure} <>= g = or(a) h = and(a) #i = not(a) #j = Xor(a) C = or(A,B) D = and(A,B) E = not(A,B) F0 = Xor(A,B) G = or(A) H = and(A) #I = not(A) #J = Xor(A) par(mfrow=c(4,1)) ##par(mfrow=c(2,1)) plot(c,xlim=c(1,160),main="a or b") plot(d,xlim=c(1,160),main="a and b") plot(e,xlim=c(1,160),main="a not b") plot(f,xlim=c(1,160),main="a Xor b") @ \caption{Union, intersection, substraction on segments set.} \label{unionseg} \end{figure} \begin{figure} <>= par(mfrow=c(4,1)) ##par(mfrow=c(2,1)) plot(C,xlim=c(1,160),main="A or B") plot(D,xlim=c(1,160),main="A and B") plot(E,xlim=c(1,160),main="A not B") plot(F0,xlim=c(1,160),main="A Xor B") @ \caption{Union, intersection, substraction on global segments.} \label{unionglobseg} \end{figure} \noindent \begin{figure} <>= par(mfrow=c(4,1)) ##par(mfrow=c(2,1)) plot(G,xlim=c(1,160),main="or A") plot(H,xlim=c(1,160),main="and A") #plot(I,xlim=c(1,160),main="not A") #plot(J,xlim=c(1,160),main="Xor A") plot(g,xlim=c(1,160),main="or a") ##plot.segSet(h,xlim=c(1,160),main="and a") #plot(i,xlim=c(1,160),main="not a") #plot(j,xlim=c(1,160),main="Xor a") @ \label{unary} \caption{Unary operator.} \end{figure} \end{center} \subsection{Random sequences} \subsubsection{Presentation} We made two tools to generate a random sequence, or to randomize an existing sequence. Function {\em randomSeq} creates a random sequence from a distribution of nulcleotides, of ploy-nucleotides. A real composition of nucleotides can be use from function {\em compoSeq}, with param {\tt p=TRUE}. Function {\em shufleSeq} generate a sequence while assembling at random specific number of each nucleotides (or ploy-nucleotides). These number of nucleotide can be provided by function {\em compoSeq}, with param {\tt p=FALSE}: it is then a re-assemblage of all nucleotides (or tri-nucleotides, or ploy-nucleotides) of a real sequence. \subsubsection{Example} <>= ## Set seed of your choice (not requiered) set.seed(3) #### ---- RANDOMSEQ ---- ## Create a sequence of size 30, GC rich randomSeq(prob = c(0.20, 0.30, 0.20, 0.30), letters = c("T", "C","A", "G"), n = 30) ## [1] "CTGGAACCGAGGGGTTCATCCCCCCAGTGA" ## use with bi-nucleotides randomSeq(prob=rep(0.0625,16),letters = c("TT","TC","TA","TG", "CT","CC","CA","CG","AT","AC","AA","AG","GT","GC","GA","GG"),n=10) ## [1] "CGCATGATCCCAGGCTAACT" #### ---- SHUFFLESEQ ---- ## Create a sequence with 7 T, 3 C and A, and 4 G. shuffleSeq(count=c(7,3,3,4,0),letters=c("T","C","A","G","N")) ## [1] "TATCTTTTGTCGGACGA" ## Same with bi-nucleotides shuffleSeq(count=c(rep(4,4),rep(2,4),rep(1,4),rep(0,4)), letters = c("TT","TC","TA","TG","CT","CC","CA", "CG","AT","AC","AA","AG","GT","GC","GA","GG")) ## [1] "TCTTTCCATTCCTTCTAGTGTACCCGTATACGTGTCTGTGTACTTCAACAACTTAT" ## From a real sequence: seqNcbi("BY608190",file="BY608190.fa") readFasta("BY608190.fa") ## create a random sequence from a real tri-nucleotides distribution ## Size of sequence will be 10*3. randomSeq(compoSeq(wsize=3,p=TRUE),n=10) ## re assemble real tri-nucleotides of a real sequence shuffleSeq(compoSeq(wsize=3,from=1,to=30,p=FALSE)) @ \subsection{Profiles} \subsubsection{Overview} Computes profile(s) of user defined quantities around sites of interest in sequence fragments. Profile(s) is(are) constituted of bins of equal size around the sites of interest named origins. It produces for each bin, and for each quantity the mean, the standard deviation and the number of valid events. \subsubsection{Examples} <>= ## We create a sequence with a biais every 100 nucleotides s <- "" for(i in 1:20) s <- paste(s, randomSeq(n=100),randomSeq(prob=c(0.3,1,1,1,0)/3.3,n=100),sep="") placeString(s,seqno=0) dens <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20), seqno=0, fun=seqSkew,nbinL=10,nbinR=10,sizeBin=10) plot(dens$skta,main="TA skew") ## Example with flagged 'N' ## We create a sequence with a biais every 100 nucleotides s <- "" for(i in 1:20) s <- paste(s, randomSeq(n=100),randomSeq(prob=c(0.3,1,1,1,0.2)/3.5,n=100),sep="") placeString(s,seqno=1) dens2 <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20), seqno=1, fun=compoSeq,nbinL=10,nbinR=10,sizeBin=10) plot(dens2$T,main="#T") ## The same but more permissive (allow 4 N in each bin) dens3 <- densityProfile(ori=200*(1:20)-100,from=200*(0:19)+1,to=200*(1:20), seqno=1, fun=compoSeq,nbinL=10,nbinR=10,sizeBin=10,threshold=4) plot(dens3$T,main="#T") @ \section{A complete Example} In this section, we will gets intronic, 3' UTR and intergenic compositions on a whole chromosome (Human: chromosome 3). Data will be retrieve from Ucsc Mysql server, (it requiers to install RMySql package). First download a fasta file of chromosome 3: \begin{verbatim} download.file('http://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/chr3.fa.gz', destfile='chr3.fa.gz') \end{verbatim} Gets all information on this gene from Ucsc Mysql database: \begin{verbatim} library(RMySQL) con <- dbConnect("MySQL", host="genome-mysql.cse.ucsc.edu", user="genome", dbname="hg17") rs <- dbSendQuery(con, statement = paste( "SELECT * ", "FROM knownGene", "WHERE chrom = 'chr3'", "AND strand = '+'" )) ucscSQLdataKnownGene <- fetch(rs, n = -1) rs <- dbSendQuery(con, statement = paste( "SELECT * ", "FROM chr3_mrna" )) ucscSQLdataMRNA <- fetch(rs, n = -1) \end{verbatim} Now we transfrom matrix of data (n row: one by gene) to intervals (a list of n elements, one by gene). We will transform something like \begin{verbatim} ucscSQLdataMRNA[1:3,] \end{verbatim} to a list adapted to our computations: \begin{verbatim} Genes3plus <- apply(ucscSQLdataKnownGene,1,function(u){ matrix( c(as.integer(strsplit(as.character(u[9]),",")[[1]]), as.integer(strsplit(as.character(u[10]),",")[[1]])), ncol=2) }) Genes3plus <- as.globalSeg(Genes3plus) getsOneExon <- function(u) { start<- as.integer(strsplit(as.character(u[22]),",")[[1]]) stop <- as.integer(strsplit(as.character(u[20]),",")[[1]]) + start if(length(start) >0) return(matrix(c(start,stop),ncol=2)) else return(NULL) } AllMrna3 <- apply(ucscSQLdataMRNA,1,getsOneExon) AllMrna3 <- as.golbalSeg(AllMrna3) rm(ucscSQLdataKnownGene,ucscSQLdataMRNA,rs) AllMrna3[1:4] \end{verbatim} Here we gets {\tt Genes3plus}: an object with all known human genes on strand forward and on chromosome 3, and {\tt AllMrna3}: an object with all possible human genes on chromosome 3. No we easily deduce all known introns (on strand forward), and all possible genic regions: \begin{verbatim} Genes3ranges <- range(Genes3plus) possibleExons <- or(AllMrna3) rangeDNA <- matrix(range.globalSeg(possibleExons,global=TRUE),ncol=2) notExon <- xorRecouvr(as.matrix(possibleExons),rangeDNA) knownIntrons <- xorRecouvr(as.matrix(Genes3plus),as.matrix(Genes3ranges)) ## Safe Introns safeIntrons <- and(knownIntrons,notExon) rangeMRNA <- range.intervals(possibleExons) safeIntergenic <- xorRecouvr(as.matrix(possibleExons),as.matrix(rangeMRNA)) \end{verbatim} Now compute\dots \begin{verbatim} readFasta("chr3.fa") safeIntrons <- as.matrix(safeIntrons) safeIntergenic <- as.matrix (safeIntergenic) composIntrons <-compoSeq(from=safeIntrons[,1],to=safeIntrons[,2]) composIntergenic <-compoSeq(from=safeIntergenic[,1],to=safeIntergenic[,2]) composIntrons composIntergenic \end{verbatim} \section*{Aknowledgments} A. Viary \end{document}