---
title: "Secondary identifiers"
author: "Egon Willighagen"
package: BridgeDbR
output:
  BiocStyle::html_document:
    toc_float: true
    includes:
      in_header: bioschemas.html
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{Secondary IDs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: tutorial.bib
---

# Introduction

Databases use identifiers to point to records. One of the goals of BridgeDb is
is to map identifier for a record from one database to the identifier of a 
matching record in another database. This is explained in the Tutorial vigenette.

However, within one databases, linking from one identifier to another can also
be useful. Many databases have older and newer identifiers, where the newer
identfier replaces the older identifier. For example, [HGNC](https://www.genenames.org/)
replaces symbols to make naming more consistent or to solve ambiguity.

The recent [sec2pri](https://github.com/sec2pri) and BridgeDb [Tiwid](https://github.com/bridgedb/tiwid)
projects track outdated identifiers. These projects define secondary identifiers
as identifiers that should not be used anymore. Some of them have been replaced
by new identifiers, called primary identifiers.

This vignette explains how to use `sec2pri` BridgeDb identifier mapping files
to detect secondary identifiers, and, where possible, suggest the replacing
primary identifier.

The [Bioconductor BridgeDbR package](https://doi.org/10.18129/B9.bioc.BridgeDbR)
page describes how to install the package. After installation, the library can be loaded with the following command:

```{r, eval=FALSE}
library(BridgeDbR)
```

## Downloading sec2pri databases

The first thing to do is download a BridgeDb identifier mapping database. This
database is a `.bridge` file just as those from the BridgeDb website. The difference
is that these `sec2pri` databases focus on primary and secondary identifiers of
a single data source.

Here we will use ChEBI [@Hastings2016] as an example. These files can be downloaded
as artifacts produced during one of the GitHub Actions in the `mapping_preprocessing`
repository, specifically, [here](https://github.com/sec2pri/mapping_preprocessing/actions/workflows/chebi.yml).

Click the most recently run `Check and test ChEBI updates` run, and notice the
artifacts:

![](chebi_action_run_with_artifacts.png)

Click the `chebi_procesed` artifact and save this locally as a `chebi_processed.zip`
file. Unzip the file to get the `ChEBI_secID2priID.bridge` file.

## Loading the sec2pri database

This downloaded file is then loaded with (see also the Tutorial vignette):

```{r, eval=FALSE}
sec2pri = BridgeDbR::loadDatabase("ChEBI_secID2priID.bridge")
```

## Analyzing ChEBI identifiers

Let's say we have the ChEBI identifier `CHEBI:1234` in our dataset and we want to
know if this is a primary or a secondary identifier. We can check with the `sec2pri`
databases which other identifiers it is mapped to:

```{r, eval=FALSE}
BridgeDbR::map(sec2pri, source="Ce", identifier="CHEBI:1234")
```

The output looks like this:

```
  source identifier target     mapping isPrimary
1     Ce CHEBI:1234     Ce  CHEBI:1234         F
2     Ce CHEBI:1234     Ce CHEBI:19730         F
3     Ce CHEBI:1234     Ce CHEBI:28423         T
```

Two first two columns are your input identifier, and column three and four are
mapped identifiers. The fifth column indicates if the column is primary (`T`) or
secondary (`F`). We see here that `CHEBI:1234` is a secondary identifier.

We also see that there is a matching primary identifier, `CHEBI:28243`.

We can extract the primary identifier with regular R code, e.g. like:

```{r, eval=FALSE}
mappedIDs = BridgeDbR::map(sec2pri, source="Ce", identifier="CHEBI:1234")
mappedIDs[intersect(which(mappedIDs[,"target"] == "Ce"),which(mappedIDs[,"isPrimary"] == "T")),]
```

## Conclusion

With the appropriate `sec2pri` you can use this approach to identify secondary
identifiers, and where possible, replace them with the matching primary identifier.

Please keep in mind, most databases have their own reasons for when to replace
an identifier, and rules how to do that. Some databases do not, however, have
1-to-1 relationships, and manual inspection is recommended.

# References {.unnumbered}

# Session info

Here is the output of `sessionInfo()` on the system on which this document was
compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r sessionInfo, echo=FALSE}
sessionInfo()
```