---
title: "Link high quality flycodes and nanobody sequences by NGS filtering"
author: 
- name: Lennart Opitz
  affiliation: Functional Genomics Center Zurich
  email: lennart.opitz@fgcz.ethz.ch
- name: Christian Panse
  affiliation: Functional Genomics Center Zurich
  email: cp@fgcz.ethz.ch
package: NestLink
date: '2018-08-29, 2018-11-08'
bibliography:
  - NestLink.bib
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{1. Link high quality flycode and nanobody sequences by NGS filtering.}
  %\VignetteEncoding{UTF-8}
output: BiocStyle::html_document
abstract: |
  This vignette demonstrates the processing of raw reads from NGS to generate the flycode to nanobody linkage used in the proteomics data analysis. The workflow includes several filtering steps to extract high quality nanobody and flycode amino acid sequences.
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<!-- TODO(LO): Define Input -->
The following content is described in more detail in @NestLink, (under review  NMETH-A35040).

# Load Package
```{r workflow_chunk0, message=FALSE, warning=FALSE}
library(NestLink)
```

# Load Data

```{r}
library(ExperimentHub)

eh <- ExperimentHub()

query(eh, "NestLink")
```

# Define Input & Output-Folder

<!-- TODO(LO): do we have a reference to the entire data set? no-->

```{r define.input.output}
# dataFolder <- file.path(path.package(package = 'NestLink'), 'extdata')
# expFile <- list.files(dataFolder, pattern='*.fastq.gz', full.names = TRUE)

expFile <- query(eh, c("NestLink", "NL42_100K.fastq.gz"))[[1]]
scratchFolder <- tempdir()
setwd(scratchFolder)
```

# Load known nanobodies (NB) for QC Checks

For data QC some known NB were spiked in.
Here, we load the NB DNA sequences and translate them to the corresponding AA sequences.

```{r load.knownNB}
# knownNB_File <- list.files(dataFolder,
#      pattern='knownNB.txt', full.names = TRUE)
knownNB_File <- query(eh, c("NestLink", "knownNB.txt"))[[1]]

knownNB_data <- read.table(knownNB_File, sep='\t',
      header = TRUE, row.names = 1, stringsAsFactors = FALSE)

knownNB <- Biostrings::translate(DNAStringSet(knownNB_data$Sequence))
names(knownNB) <- rownames(knownNB_data)
knownNB <- sapply(knownNB, toString)
```

# Set Example Parameters

The workflow uses the first 100 reads only for a rapid processing time.

```{r setupParameter}
param <- list()
param[['nReads']] <- 100 #Number of Reads from the start of fastq file to process
param[['maxMismatch']] <- 1 #Number of accepted mismatches for all pattern search steps
param[['NB_Linker1']] <- "GGCCggcggGGCC" #Linker Sequence left to nanobody
param[['NB_Linker2']] <- "GCAGGAGGA" #Linker Sequence right to nanobody
param[['ProteaseSite']] <- "TTAGTCCCAAGA" #Sequence next to flycode
param[['FC_Linker']] <- "GGCCaaggaggcCGG" #Linker Sequence next to flycode
param[['knownNB']] <- knownNB
param[['minRelBestHitFreq']] <- 0.8 #minimal fraction of the dominant nanobody for a specific flycode
param[['minConsensusScore']] <- 0.9 #minimal fraction per sequence position in nanabody consensus sequence calculation
param[['minNanobodyLength']] <- 348 #minimal nanobody length in [nt]
param[['minFlycodeLength']] <- 33  #minimal flycode length in [nt]
param[['FCminFreq']] <- 1 #minimal number of subreads for a specific flycode to keep it in the analysis
```

# Run NGS Workflow

The following steps are included:

- read FASTQ
- filter
- extract
- translate into AA sequences


```{r filterExtractTranslateSequences, message=FALSE}
system.time(NB2FC <- runNGSAnalysis(file = expFile[1], param))
```

# Create Input FASTA File for Proteomics Analysis

```{r sanityCheck.NB.FC.linkage}
head(NB2FC, 2)
```

```{r write.AA.FASTA}
head(nanobodyFlycodeLinking.as.fasta(NB2FC))
```

To analyze the expressed flycodes mass spectrometry is used.
the FASTA file containing the nanobody - flycode linkage can
be written to a file using functions such as `cat`.

The exec directory provides alternative shell scripts
using command line GNU tools and AWK.

# Session info

Here is the output of the `sessionInfo()` command.

```{r sessionInfo, echo=FALSE}
sessionInfo()

# References