---
title: "Contribution guidelines"
author:
    - name: Christophe Vanderaa
    - name: Laurent Gatto
output:
    BiocStyle::html_document:
        self_contained: yes
        toc: true
        toc_float: true
        toc_depth: 2
        code_folding: show
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('scpdata')`"
vignette: >
    %\VignetteIndexEntry{Contribution guidelines}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

Welcome to the `scpdata` package, and thank you for your interest in
contributing!

The `scpdata` data package is a repository of curated mass
spectrometry-based single-cell proteomics (SCP) datasets. The purpose
of `scpdata` is to provide users with streamlined access to
high-quality SCP data, alleviating the need for time-consuming data
wrangling. We currently provide data at the peptide-to-spectrum match
(PSM) level, the peptide level and/or the protein level. The package
also encompasses a large diversity of technologies, including DDA and
DIA, label-free and multiplexed experiments from various laboratories
such as the Slavov Lab, the Kelly Lab, and the Schoof Lab.

Contributions are very much welcome. We happily accept major
contributions such as adding a new dataset, as well as minor
contributions as fixing typos or improving current documentation.

To facilitate our collaboration, this vignette will guide you through
the process of adding a new dataset to the package. We will first get
you started with some basic guidelines on how to contribute using
GitHub. We'll proceed with a description of the data structure and the
data pieces we expect. Next, we will provide an overview of the
package's folder structure to help you navigate through the project.
Finally, we'll explain the workflow you should follow to add your
dataset to the repository.

# Getting started with GitHub

1. Fork the `scpdata` GitHub repository ([click
   here](https://github.com/UCLouvain-CBIO/scpdata/fork)). 
2. Clone the forked repo locally using `git`:

```
git clone git@github.com:YOUR_USER_NAME/scpdata
```
3. Adapt the cloned repo as desired. Do not forget to regularly `git
   commit`` your changes.
4. Once finished, send your improvements and/or new features as a [pull
   request](https://github.com/UCLouvain-CBIO/scpdata/compare).

If you have any questions or face any hurdles, do not hesitate to open
a [new
issue](https://github.com/UCLouvain-CBIO/scpdata/issues/new/choose)
and we'll be happy to provide additional guidance. 

# What do we expect?

## `QFeatures` object

All datasets in `scpdata` are stored in a `QFeatures` object (see
[intro
vignette](https://uclouvain-cbio.github.io/scp/articles/QFeatures_nutshell.html)).
The object is created following the
[`scp`](https://github.com/UCLouvain-CBIO/scp) data framework, as
described in [this short
demo](https://uclouvain-cbio.github.io/scp-teaching/read_scp_data).

### Feature data

We refer to feature data as the data generated by MS data
identification and quantification tools. Depending on the tool,
features may represent PSMs, peptides and/or proteins. For instance,
MaxQuant provides an `evidence.txt` file with PSM-level information,
a `peptides.txt` file with peptide-level information and
`proteinGroups.txt` with protein-level information. We encourage
adding as many of the three feature layers when contributing a dataset
to `scpdata`. 

For each feature, the tools provide quantification data as well as
feature annotations. These two pieces of information should be
separated in a `SingleCellExperiment` object. Feature annotations are
stored in the `rowData` and the quantitative values are stored in the
`assay`.

### Sample annotations

Sample annotations contain information about each sample (single cell)
in the dataset. This information is generated by the experimenter
and should contain biological descriptors, such as the cell line or
the treatment applied, and technical descriptors, such as the day of
acquisition, the acquisition batch, the LC batch, etc. The sample
annotations are stored in the `colData` of the `QFeatures` object. 

If you want to contribute to `scpdata` with a dataset you generated
yourself, we suggest you read the last section of initial
recommendations for SCP experiments that provides a comprehensive
discussion about descriptors of interest you should collect:

> Gatto, Laurent, Ruedi Aebersold, Juergen Cox, Vadim Demichev, Jason
> Derks, Edward Emmott, Alexander M. Franks, et al. 2023. “Initial
> Recommendations for Performing, Benchmarking and Reporting
> Single-Cell Proteomics Experiments.” Nature Methods 20 (3): 375–86. 

## Experiment description

We also require the collection of experimental data that describes the
dataset. This information is commonly retrieved from the publication
associated with the dataset and provides a scientific context to the
dataset. This information is used for building the dataset
documentation.

## Data source information

Finally, the `ExperimentHub` project, on which `scpdata` relies,
requires every dataset to thoroughly provide a description of the data
sources.

# Folder structure

We here provide an overview of the key folders and files relevant when
contributing a new dataset. The current files may provide a source of
inspiration when preparing a new dataset.

## inst/scripts/

The folder contains all R scripts used to generate the `QFeatures`
objects from the source files, one script for each dataset. Each
script is named as follows: `make-data_` + `DATASET_NAME` + `.R`. 

Note the file called `make-metadata.R`. It generates a CSV table
required by `ExperimentHub` where each line corresponds to a dataset
and the columns contains the data source information. The table is 
stored in `inst/extdata/metadata.csv`, which should never be changed
manually.

## R/

The folder contains 3 R scripts, but new contributions should only
consider the `data.R` and can safely ignore the other two. The
`data.R` script contains the documentation for each dataset, formatted
using `roxygen2` markup. 

## man/

The folder contains the compiled documentation manuals, one for each
dataset. These were automatically generated by `roxygen2` and
should never be changed manually.

# Workflow

In practice, contributing a new dataset involves 6 steps.

## 1. Collect data

If you want to contribute an already published dataset, identify the
data sources for all feature data and the sample annotations. This is
generally provided in the article, but you may need to request
additional information from the authors.

If you want to contribute with your own dataset, make sure that all
feature data and the sample annotation table are available from a
public repository (eg PRIDE, MASSive or Zenodo).

## 2. Create the `QFeatures` object

Create a new R script, `inst/scripts/make-data_DATASET_NAME.R`, which
contains all the code to convert the data source data into the
`QFeatures` object. Here are some tips and tricks for generating a
high-quality dataset:

- Sample annotations are often cluttered, and spread over different
  tables or contained within sample names. Generating high-quality
  sample annotations may be time-consuming and frustrating. Don't
  overlook this task, sample annotations are essential for rigourous
  and accurate downstream analysis.
- Converting feature data tables and annotation tables into
  `QFeatures` or `SingleCellExperiment` objects can be streamlined
  using
  [`scp::readSCP()`](https://uclouvain-cbio.github.io/scp/reference/readSCP.html)
  and
  [`scp::readSingleCellExperiment()`](https://uclouvain-cbio.github.io/scp/reference/readSingleCellExperiment.html),
  respectively.
- Always start with the lowest feature level (eg PSMs). If available,
  you should add peptide and protein data using
  [`QFeatures::addAssay()`](https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-class.html).
  You should then add links between the assays. This is streamlined
  using
  [`QFeatures::addAssayLink()`](https://rformassspectrometry.github.io/QFeatures/reference/AssayLinks.html).
- Make sure to add data with as little processing as possible. For
  instance, MaxQuant provides peptide intensities, but also iBAQ and
  MaxLFQ normalised values. You should favour the former over the
  latter two, which you could add as supplementary assays (for
  example, see
  [here](https://github.com/UCLouvain-CBIO/scpdata/blob/master/inst/scripts/make-data_woo2022_macrophage.R)).

## 3. Document the dataset

Add the data documentation and the data collection procedure in 
`scpdata/R/data.R`. Use `roxygen2` markup language. The documentation
is structured as follows, but you can best use the documentation of an
existing dataset as a template:

- *Title*: First authors et al. Year (Journal): minimal description.
- *Description*: short description of the data set. What and how many cells were
  acquired? What technology? What is the research question? 
- *Format*: describe your `QFeatures` object. Describe each assay,
  namely what level features it contains, the number of features and
  the number of cells/samples
- *Data acquisition*: summarise the data acquisition protocol, namely
  the sample isolation, sample preparation, liquid chromatography,
  mass spectrometry and raw data processing.
- *Data collection*: summarise the steps you undertook to generate the
  `QFeatures` object, and where to find the script you created.
- *Source*: link the public repository with the source data
- *References*: if published, refer to the original work that
  acquired the data.
- *Example*: add an example to show how to retrieve the dataset. To
  avoid the associated overhead when testing the package, we recommend
  adding the example as follows: 

```
##' \donttest{
##' dataset_name()
##' }
```
- *Keywords*: add the line `##' @keywords datasets` 
- `"dataset_name"`: end the documentation with the name of your
  dataset, ensuring your data set is correctly exported. 

## 4. Update metadata
 
Add the data source information in the `inst/script/make-metadata.R`
script and run the complete script that will update the
`inst/extdata/metadata.csv`. You can use a previous dataset as
template. All fields are mandatory: Title, Description, BiocVersion,
Genome, SourceType, SourceUrl, SourceVersion, Species, TaxonomyId,
Coordinate_1_based, DataProvider, Maintainer, RDataClass,
DispatchClass, PublicationDate, NumberAssays, PreprocessingSoftware,
LabelingProtocol, PsmsAvailable, PeptidesAvailable, ProteinsAvailable,
ContainsSingleCells, Notes. See
`?ExperimentHubData::makeExperimentHubMetadata` for a comprehensive
description of the fields. 

Next, ensure that your updated `metadata.csv` file is valid by
running `ExperimentHubData::makeExperimentHubMetadata("scpdata")`.

## 5. Create a pull request

Push any change you made to GitHub and open a pull request to notify
us of your contribution. The pull request should include all the
commits related to the dataset you want to contribute. Provide in the
description where we can retrieve your `QFeatures` object, e.g.
through Zenodo.

## 6. Almost done!

Once your pull request is submitted, we will take over and will proceed 
to the following steps: 

1. We will review your changes to ensure you comply with the above
   guidelines. We may eventually request changes. 
2. We will contact the Bioconductor team
   ([hubs@bioconductor.org](mailto:hubs@bioconductor.org)) to upload
   your Rda to Microsoft Azure, if needed, and to update the
   `metadata.csv` on their server. See the [help
   page](https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html#uploading-data-to-microsoft-azure-genomic-data-lake)
   for more information. 
3. We will compile the documentation with roxygen2 and check the 
   package is still valid. We may eventually request changes.
4. We will update the NEWS.md file and bump package version
5. If this is your first contribution, we will add your name to the
   package authors.