<!--
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Airway smooth muscle cells}
-->

# Airway smooth muscle cells

Here we provide the code which was used to contruct the
*RangedSummarizedExperiment* object of the *airway* experiment data
package. The experiment citation is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM,
Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr,
Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling
Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates
Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun
13;9(6):e99625.
PMID: [24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665).
GEO: [GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778).

From the abstract, a brief description of the RNA-Seq experiment on
airway smooth muscle (ASM) cell lines: "Using RNA-Seq, a
high-throughput sequencing method, we characterized transcriptomic
changes in four primary human ASM cell lines that were treated with
dexamethasone - a potent synthetic glucocorticoid (1 micromolar for 18
hours)."

# Obtaining sample information from GEO

The following code chunk obtains the sample information from the
series matrix file downloaded from GEO. The columns are then parsed
and new columns with shorter names and factor levels are added.

```{r}
suppressPackageStartupMessages( library( "GEOquery" ) )
suppressPackageStartupMessages( library( "airway" ) )
dir <- system.file("extdata",package="airway")
geofile <- file.path(dir, "GSE52778_series_matrix.txt")
gse <- getGEO(filename=geofile)
pdata <- pData(gse)[,grepl("ch1",names(pData(gse)))]
names(pdata) <- c("treatment","tissue","ercc_mix","cell","celltype")
pdataclean <- data.frame(treatment=sub("treatment: (.*)","\\1",pdata$treatment),
                         cell=sub("cell line: (.*)","\\1",pdata$cell),
                         row.names=rownames(pdata))
pdataclean$dex <- ifelse(grepl("Dex",pdataclean$treatment),"trt","untrt")
pdataclean$albut <- ifelse(grepl("Albut",pdataclean$treatment),"trt","untrt")
pdataclean$SampleName <- rownames(pdataclean)
pdataclean$treatment <- NULL
```

The information which connects the sample information from GEO with
the SRA run id is downloaded from
[SRA](http://www.ncbi.nlm.nih.gov/sra/?term=SRP033351) using the
**Send to: File** button.

```{r}
srafile <- file.path(dir, "SraRunInfo_SRP033351.csv")
srp <- read.csv(srafile)
srpsmall <- srp[,c("Run","avgLength","Experiment","Sample","BioSample","SampleName")]
```

These two *data.frames* are merged and then we subset to only the
samples not treated with albuterol (these samples were not included in
the analysis of the publication).

```{r}
coldata <- merge(pdataclean, srpsmall, by="SampleName")
rownames(coldata) <- coldata$Run
coldata <- coldata[coldata$albut == "untrt",]
coldata$albut <- NULL
coldata
```

Finally, the sample table was saved to a CSV file for future
reference. This file is included in the `inst/extdata` directory of
this package.

```{r eval=FALSE}
write.csv(coldata, file="sample_table.csv")
```

# Downloading FASTQ files from SRA

A file containing the SRA run numbers was created: `files`. This
file was used to download the sequenced reads from the SRA using
`wget`. The following command was used to extract the FASTQ file from
the `.sra` files, using the
[SRA Toolkit](http://www.ncbi.nlm.nih.gov/books/NBK47540/) 

```
cat files | parallel -j 7 fastq-dump --split-files {}.sra
```

# Aligning reads

The reads were aligned using the 
[STAR read aligner](https://code.google.com/p/rna-star/) 
to GRCh37 using the annotations from Ensembl release 75.

```
for f in `cat files`; do STAR --genomeDir ../STAR/ENSEMBL.homo_sapiens.release-75 \
--readFilesIn fastq/$f\_1.fastq fastq/$f\_2.fastq \
--runThreadN 12 --outFileNamePrefix aligned/$f.; done
```

[SAMtools](http://samtools.sourceforge.net/) was used to generate BAM files.

```
cat files | parallel -j 7 samtools view -bS aligned/{}.Aligned.out.sam -o aligned/{}.bam
```

# Counting reads

A transcript database for the homo sapiens Ensembl genes was obtained
from Biomart.

```{r eval=FALSE}
library( "GenomicFeatures" )
txdb <- makeTranscriptDbFromBiomart( biomart="ensembl", dataset="hsapiens_gene_ensembl")
exonsByGene <- exonsBy( txdb, by="gene" )
```

The BAM files were specified using the `SRR` id from the SRA. A yield
size of 2 million reads was used to cap the memory used during
read counting.

```{r eval=FALSE}
sampleTable <- read.csv( "sample_table.csv", row.names=1 )
fls <- file.path("aligned",rownames(sampleTable), ".bam")
library( "Rsamtools" )
bamLst <- BamFileList( fls, yieldSize=2000000 )
```

The following `summarizeOverlaps` call distributed the 8 paired-end
BAM files to 8 workers. This used a maximum of 16 Gb per worker and
the time elapsed was 50 minutes.

```{r eval=FALSE}
library( "BiocParallel" )
register( MulticoreParam( workers=8 ) )
library( "GenomicAlignments" )
airway <- summarizeOverlaps( features=exonsByGene, reads=bamLst,
                            mode="Union", singleEnd=FALSE,
                            ignore.strand=TRUE, fragments=TRUE )
```

The sample information was then added as column data.

```{r eval=FALSE}
colData(airway) <- DataFrame( sampleTable )
```

Finally, we attached the `MIAME` information using the Pubmed ID.

```{r eval=FALSE}
library( "annotate" )
miame <- list(pmid2MIAME("24926665"))
miame[[1]]@url <- "http://www.ncbi.nlm.nih.gov/pubmed/24926665"
# because R's CHECK doesn't like non-ASCII characters in data objects
# or in vignettes. the actual char was used in the first argument
miame[[1]]@abstract <- gsub("micro","micro",abstract(miame[[1]]))
miame[[1]]@abstract <- gsub("beta","beta",abstract(miame[[1]]))
metadata(airway) <- miame
save(airway, file="airway.RData")
```

# Information on the RangedSummarizedExperiment

Below we print out some basic summary statistics on the `airway`
object which is provided by this experiment data package.

```{r}
library("airway")
data(airway)
airway
as.data.frame(colData(airway))
summary(colSums(assay(airway))/1e6)
metadata(rowRanges(airway))
```

# Session information

```{r}
sessionInfo()
```