%\VignetteIndexEntry{STRINGdb Vignette} %\VignetteDepends{} %\VignetteKeywords{STRINGdb} %\VignettePackage{STRINGdb} \documentclass[10pt]{article} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \SweaveOpts{concordance=TRUE, strip.white=false} \title{STRINGdb Package Vignette} \author{Andrea Franceschini and Damian Szklarczyk} \date{} \maketitle \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} \section{INTRODUCTION} STRING (https://www.string-db.org) is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. Each interaction is associated with a combined confidence score that integrates the various evidences. STRING v12.0 covers 59.3 million proteins from 12,535 organisms and more than 20 billion interactions. As you will learn in this guide, the STRING database can be useful to add meaning to lists of genes, for example the best hits coming out from a screen or the most differentially expressed genes coming out from a microarray or RNA-seq experiment. We provide the STRINGdb R package in order to facilitate our users in accessing the STRING database from R. In this guide we explain, with examples, most of the package's features and functionalities.\\ In the STRINGdb R package we use the new ReferenceClasses of R (search for "ReferenceClasses" in the R documentation.). Besides we make use of the iGraph package (http://igraph.sourceforge.net) as a data structure to represent our protein-protein interaction network. \\ To begin, you should first know the NCBI taxonomy identifiers of the organism on which you have performed the experiment (e.g. 9606 for Human, 10090 for mouse). If you don't know that, you can search the NCBI Taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy) or start looking at our species table (that you can also use to verify that your organism is represented in the STRING database). \\ Hence, if your species is not Human (i.e. our default species), you can find it and their taxonomy identifiers on STRING webpage under the 'organisms' section (https://string-db.org/cgi/input.pl?input\_page\_active\_form=organisms) or download the full list in the download section of STRING website. \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} <>= library(STRINGdb) string_db <- STRINGdb$new( version="12.0", species=9606, score_threshold=400, network_type="full", link_data="detailed", input_directory="") @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} As it has been shown in the above commands, you start instantiating the STRINGdb reference class. In the constructor of the class you can also define the STRING version to be used and a threshold for the combined scores of the interactions, such that any interaction below that threshold is not loaded in the object (by default the score threshold is set to 400).\\ You can also specify the network type "functional" for full functional STRING network or "physical" for physical subnetwork, which link only the proteins which share a physical complex.\\ Besides, if you specify a local directory to the parameter input-directory, the database files will be downloaded into this directory and most of the methods can be used off-line. Otherwise, the database files will be saved and cached in a temporary directory that will be cleaned automatically when the R session is closed. By setting \texttt{link\_data="detailed"} we also load the evidence score columns, such as the experimental score, in addition to the combined score.\\\\ For a better understanding of the package two other commands can be useful: <>= STRINGdb$methods() # To list all the methods available. STRINGdb$help("get_graph") # To visualize their documentation. @ For all the methods that we are going to explain below, you can always use the help function in order to get additional information/parameters with respect to those explained in this guide.\\\\ As an example, we use the analyzed data of a microarray study taken from GEO (Gene Expression Omnibus, GSE9008). This study investigates the activity of Resveratrol, a natural phytoestrogen found in red wine and a variety of plants, in A549 lung cancer cells. Microarray gene expression profiling after 48 hours exposure to Revestarol has been performed and compared to a control composed by A549 lung cancer cells threated only with ethanol. This data is already analyzed for differential expression using the limma package: the genes are sorted by fdr corrected pvalues and the log fold change of the differential expression is also reported in the table. <>= data(diff_exp_example1) head(diff_exp_example1) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} As a first step, we map the gene names to the STRING database identifiers using the "map" method. In this particular example, we map from gene HUGO names, but our mapping function supports several other common identifiers (e.g. Entrez GeneID, ENSEMBL proteins, RefSeq transcripts ... etc.).\\\\ The map function adds an additional column with STRING identifiers to the dataframe that is passed as first parameter. <>= example1_mapped <- string_db$map( diff_exp_example1, "gene", removeUnmappedRows = TRUE ) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} As you may have noticed, the previous command prints a warning showing the number of genes that we failed to map. In this particular example, we cannot map all the probes of the microarray that refer to position of the chromosome that are not assigned to a real gene (i.e. all the LOC genes). If we remove all these LOC genes before the mapping we obtain a much lower percentage of unmapped genes (i.e. < 6 \%). \\ If you set to FALSE the "removeUnmappedRows" parameter, than the rows which corresponds to unmapped genes are left and you can manually inspect them.\\ Finally, we extract the genes with p-value < 0.05 and, for the examples below, we focus on the top 200 genes sorted by p-value. The image shows clearly the genes and how they are possibly functionally related. On the top of the plot, we insert a pvalue that represents the probability that you can expect such an equal or greater number of interactions by chance. <>= options(SweaveHooks=list(fig=function() par(mar=c(2.1, 0.1, 4.1, 2.1)))) @ \setkeys{Gin}{width=1.2\linewidth} <>= example1_mapped_pval05 <- subset(example1_mapped, pvalue < 0.05) example1_mapped_pval05 <- example1_mapped_pval05[order(example1_mapped_pval05$pvalue), ] hits <- example1_mapped_pval05$STRING_id[1:200] @ <>= string_db$plot_network( hits ) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} If you want more control over the appearance of the network image, the \texttt{get\_png} and \texttt{get\_link} methods expose additional styling options. In particular, you can switch between \texttt{evidence} and \texttt{confidence} edge styles, hide node labels, hide disconnected proteins, disable the structure pictures inside bubbles, use the flat node design, center node labels, and change the label font size. \section{PAYLOAD MECHANISM} This R library provides the ability to interact with the STRING payload mechanism. The payload appears as an additional colored "halo" around the bubbles. For example, this allows to color in green the genes that are down-regulated and in red the genes that are up-regulated. For this mechanism to work, we provide a function that posts the information on our web server. <>= # filter by p-value and add a color column # (i.e. green down-regulated gened and red for up-regulated genes) example1_mapped_pval05 <- string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05), logFcColStr="logFC" ) @ <>= # post payload information to the STRING server payload_id <- string_db$post_payload( example1_mapped_pval05$STRING_id, colors=example1_mapped_pval05$color ) @ <>= # display a STRING network png with the "halo" string_db$plot_network( hits, payload_id=payload_id ) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} \section{ENRICHMENT} We provide a method to compute the enrichment in Gene Ontology (Process, Function and Component), KEGG and Reactome pathways, PubMed publications, UniProt Keywords, and PFAM/INTERPRO/SMART domains for your set of proteins all in one simple call. The enrichment itself is computed using an hypergeometric test and the FDR is calculated using Benjamini-Hochberg procedure. <>= enrichment <- string_db$get_enrichment( hits ) head(enrichment, n=20) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} You can also retrieve an enrichment figure directly from STRING for a selected category. The image can be returned as a PNG array or as SVG markup, and you can customize the category, similarity-based grouping, color palette, the number of terms shown, and the variable displayed on the X-axis. <>= enrichment_img <- string_db$get_enrichment_figure( hits, category = "Component", group_by_similarity = 0.8, color_palette = "yellow_pink", x_axis = "signal" ) grid::grid.newpage() grid::grid.raster(enrichment_img, interpolate = FALSE) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} If you have performed your experiment on a predefined set of proteins, it is important to run the enrichment statistics using that set as a background (otherwise you would get a wrong p-value !). Hence, before to launch the method above, you may want to set the background: <>= backgroundV <- example1_mapped$STRING_id string_db$set_background(backgroundV) @ If you already have a \texttt{STRINGdb} object, calling \texttt{set\_background} is sufficient and you do not need to instantiate a new object again. Passing \texttt{backgroundV} to the constructor is simply an alternative when you already know the background at initialization time: <>= string_db <- STRINGdb$new( score_threshold=400, backgroundV = backgroundV ) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} Once the background has been set, the enrichment results can change substantially because the statistical test is now performed against the selected background rather than the whole proteome. In many cases the enrichment signals become smaller, which is expected: the test is now calibrated against the proteins that were actually measurable or considered in the experiment, rather than against all proteins in the organism. <>= enrichment_background_corrected <- string_db$get_enrichment( hits ) head(enrichment_background_corrected, n=20) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} If you just want to know which terms are assigned to your set of proteins, and not necessarily which ones are enriched, you can use "get\_annotations" method. This method outputs the terms from most categories (the exceptions are KEGG terms due to licensing issues and PubMed due to the size of the output) that are associated with your set of proteins. <>= annotations <- string_db$get_annotations( hits ) head(annotations, n=20) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} \section{CLUSTERING} The iGraph package provides several clustering/community algorithms: "fastgreedy", "walktrap", "spinglass", "edge.betweenness". We encapsulate this in an easy-to-use function that returns the clusters in a list. <>= # get clusters clustersList <- string_db$get_clusters(example1_mapped_pval05$STRING_id[1:600]) @ Now we visualize the first four clusters one by one. <>= par(mar = c(0, 0, 0.5, 0)) string_db$plot_network(clustersList[[1]], add_summary = FALSE) par(mar = c(0, 0, 0.5, 0)) string_db$plot_network(clustersList[[2]], add_summary = FALSE) par(mar = c(0, 0, 0.5, 0)) string_db$plot_network(clustersList[[3]], add_summary = FALSE) par(mar = c(0, 0, 0.5, 0)) string_db$plot_network(clustersList[[4]], add_summary = FALSE) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} \section{ADDITIONAL PROTEIN INFORMATION} You can get a table that contains all the proteins that are present in our database of the species of interest. The protein table also includes the preferred name, the size and a short description of each protein. <>= string_proteins <- string_db$get_proteins() head(string_proteins) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} In the following section we will show how to query STRING with R on some specific proteins. In the examples, we will use the famous tumor proteins TP53 and ATM .\\\\ First we need to get the STRING identifier of those proteins, using our mp method: <>= tp53 = string_db$mp( "tp53" ) atm = string_db$mp( "atm" ) @ The mp method (i.e. map proteins) is an alternative to our map method, to be used when you need to map only one or few proteins.\\ It takes in input a vector of protein aliases and returns a vector with the STRING identifiers of those proteins. \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} You can retrieve the interactions that connect certain input proteins between each other.\\ The "get\_interactions" method returns a data frame describing the interactions among the queried proteins. The table contains the STRING identifiers in the \texttt{from} and \texttt{to} columns, the corresponding preferred names in \texttt{from\_name} and \texttt{to\_name}, the \texttt{combined\_score}, and any additional evidence scores available for the current \texttt{link\_data} setting. This makes it easy to inspect both the interacting proteins and the supporting scores for each interaction. <>= string_db$get_interactions( c(tp53, atm) ) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} If you want to retrieve the first-shell interaction partners of one or more proteins from the locally loaded graph, you can use the "get\_interaction\_partners" method.\\ The output keeps the same interaction-table structure as "get\_interactions". <>= string_db$get_interaction_partners( c(tp53, atm), limit=5 )[, c("from", "to", "combined_score", "experimental", "from_name", "to_name")] @ You can also output only interactions supported by a particular type of evidence. Here we retrieve only experimentally supported interactions, keeping only the protein names together with the experimental score: <>= interaction_partners <- string_db$get_interaction_partners( c(tp53, atm), limit=5 ) interaction_partners_exp <- subset(interaction_partners, experimental > 0, select = c("from_name", "to_name", "experimental")) interaction_partners_exp @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} STRING provides a way to get homologous proteins: in our database we store ALL-AGAINST-ALL alignments within species. You can retrieve all paralogs of the protein using "get\_paralogs" method. <>= # Get all homologs of TP53 in human. string_db$get_paralogs(tp53) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} STRING also stores best hits (as measured by bitscore) between proteins from different species. "get\_homologs\_besthits" lets you retrieve these homologs. <>= # get the best hits of TP53 in all STRING species tp53_besthits <- string_db$get_homologs_besthits(tp53) tp53_besthits_100 <- subset(tp53_besthits, bitscore > 100) nrow(tp53_besthits_100) head(tp53_besthits_100, n=10) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} ... or you can specify the species of interest: <>= # get the homologs of the following two proteins in the mouse (i.e. species_id=10090) string_db$get_homologs_besthits(c(tp53, atm), target_species_id=10090) @ \begin{Schunk} \begin{Sinput} \end{Sinput} \end{Schunk} \section{CITATION} Please cite:\\\\ Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023 Jan 6;51(D1):D638-D646. doi: 10.1093/nar/gkac1000. \end{document}