The {arrayQualityMetrics} package produces, through a
single function call, a comprehensive HTML report of quality
metrics about a microarray dataset McCall,
Murakami, and Irizarry (2010). The quality metrics are mainly on
the per array level, i.e., they can be used to assess the
relative quality of different arrays within a dataset. Some of the
metrics can also be used to diagnose batch effects, and thus the quality
of the overall dataset.
The report can be extended to contain further diagnostics through additional arguments, and we will see examples for this in Section @ref(sec:ext).
The aim of the {arrayQualityMetrics} package is to
produce information that is relevant for your decision making - not, to
make the decision. It will often be applied to two, somewhat distinct,
use cases: (i) assessing quality of a ``raw’’ dataset, in order to get
feedback on the experimental procedures that produced the data; (ii)
assessing quality of a normalised dataset, in order to decide whether
and how to use the dataset (or subsets of arrays in it) for subsequent
data analysis.
Different types of microarray data (one colour, two colour,
Affymetrix, Illumina) are represented by different object classes in
Bioconductor. The function arrayQualityMetrics() will work
in the same way for all of them. Further information about its arguments
can be found in its manual page.
When the function arrayQualityMetrics() is finished, a
report is produced in the directory specified by the function’s
outdir argument. By default, a directory with a suitable
name is created in the current working directory. This directory
contains an HTML page index.html that can be opened by a
browser. The report contains a series of plots explained by text. Some
of the plots are interactive (see Figure @ref(fig:svgplots)).
Technically, this is achieved by the use of SVG (scalable vector
graphics) and JavaScript, and it requires that you use a recent (HTML5
capable) web browser 1. Other plots, where interactivity is less
relevant, are provided as bitmaps (PNG format) and are also linked to
PDF files that provide high resolution versions e.,g. for
publication.
Plus (+) or minus (-) symbols at the begin of different section headings of the report (as in the left panel of Figure @ref(fig:svgplots)) indicate that you can show or hide these sections by clicking on the heading. After (re)loading, all sections are shown except for the Outlier detection barplots, which are hidden and can be expanded by clicking on them.
An example plot from the report. The plot shows the arrays (points) in a two-dimensional plot area spanned by the first two axes of a principal component analysis (PCA). By moving the mouse over the points, the corresponding array’s metadata is displayed in the table to the right of the plot. By clicking on a point, it can be selected or deselected. Selected arrays are indicated by larger points or wider lines in the plots and by ticked checkboxes in the array table shown in the right panel. Arrays can also be (de)selected by clicking the checkboxes. Initially, when the report is loaded (or reloaded) by the browser, all arrays are selected that were called outliers by at least one criterion.
Metadata about the arrays is shown at the top of the report as a
table (see Figure @ref(fig:svgplots)). It is extracted from the
phenoData slot of the data object supplied to
arrayQualityMetrics(). It can be useful to adjust the
contents this slot before producing the report, and to make sure it
contains the right quantity of information to make an informative report
- not too much, not too little.
In the case of AffyBatch input, some Affymetrix specific
sections are added to the standard report. Also for other types of
arrays, sections can be added to the standard report if certain metadata
are present in the input object (see Section @ref(sec:ext)).
The function arrayQualityMetrics() also produces an R
object (essentially, a big list) with all the information contained in
the report, and this object can be used by downstream tools for
programmatic analysis of the report. This is discussed in the vignette
If you are working with Affymetrix GeneChips, an
AffyBatch object is the most appropriate way to import your
raw data into Bioconductor. Starting from CEL files, this is typically
done using the function ReadAffy() from the
{affy} package 2
Here, we use the dataset MLL.A, an object of class
AffyBatch provided in the data package
{ALLMLL}.
Now that the data are loaded, we can call
arrayQualityMetrics() 3.
library("arrayQualityMetrics")
arrayQualityMetrics(expressionset = MLL.A[, 1:5],
outdir = "Report_for_MLL_A",
force = TRUE,
do.logtransform = TRUE)This is the simplest way of calling the function. We give a name to
the directory (outdir) and we overwrite the possibly
existing files of this directory (force). Finally, we set
do.logtransform to logarithm transform the intensities. You
can then view the report by directing your browser to the file
index.html in the directory whose name is indicated by
outdir.
We can call the RMA algorithm on MLL.A to obtain a
preprocessed dataset. The preprocessing includes background correction,
between array intensity adjustment (normalisation) and probeset
summarisation. The resulting object nMLL is of class
ExpressionSet and contains one value (expression estimate)
for each gene for each array.
We can then call again the function
arrayQualityMetrics().
We do not need to set do.logtransform as after
rma() the data are already logarithm transformed.
If you are working on one colour arrays other than Affymetrix
genechips, you can load your data into Bioconductor as an
ExpressionSet object , or if you work with Illumina data
and the {beadarray} package, as an
ExpressionSetIllumina object. You can then proceed exactly
as above.
The package {limma} imports a wide range of data formats
used for two colour arrays and produces objects of class
RGList or MAList. When presented with an
object of these classes, arrayQualityMetrics() tries to
convert them into an NChannelSet and then proceeds with
calling its NChannelSet method.
Alternatively, you can create an NChannelSet to contain
your data
`from scratch''. The documentation of the{Biobase}` package
gives instructions on how to do so.
The arrayQualityMetrics() function expects the
assayData slot of the NChannelSet object to
contain the elements R and G, for the
red'' and thegreen’’ intensities. Optionally, it can
contain elements Rb and Gb for associated
`background'' intensities. As an alternative to all that, thearrayQualityMetrics()function also acceptsNChannelSetobjects with a single slotexprs, and will then simply behave like it does for (single-colour)ExpressionSet`
objects.
As an example, we consider the dataset CCl4 from the
data package {CCl4} and normalize it using the variance
stabilization method available in the package {vsn}.
You can use the {ArrayExpress} package (Kauffmann et al. 2009) to download datasets
from the EBI’s ArrayExpress database. The resulting
ExpressionSet, AffyBatch or
NChannelSet objects can be directly fed into
arrayQualityMetrics().
A useful feature of arrayQualityMetrics() is the
possibility to show the results in the context of an experimental factor
of interest, i.,e. a categorical variable associated with the arrays
such as hybridisation date, treatment level or
replicate number. Specifying a factor does not change
how the quality metrics are computed. By setting the argument
intgroup() to contain the names of one or multiple columns
of the data object’s phenoData slot, a bar on the side of
the heatmap with colours representing the respective factors is added.
Similarly, the colours of the boxplots and density plots reflect the
levels of the first of the factors named by intgroup().
We use the nMLL example again, and create artificial array
metadata factors condition and batch (see
Section~@ref(sec:rin) for a more realistic example).
{#sec:ext}
Some of the quality metrics that the package can compute require specific information about the features on the arrays. To use these, you need to make sure that this information is provided in your input object. We use the nCCl4 example again.
{#sec:spatial}
To plot the spatial distributions of the intensities of the arrays,
arrayQualityMetrics() needs the spatial coordinates of the
features on the chip. For AffyBatch or
BeadLevelList, this information is automatically available
without further user input. For the other types of objects, two columns
corresponding to \(X\) and \(Y\) coordinates of the features are
required in the featureData slot of the object. These
columns should be named “X” and “Y”. If the arrays are split into
blocks, rows and columns, then the function addXYfromGAL()
(please check its manual page for details) can used to convert the row,
column and blocks indices into absolute “X” and “Y” coordinates on the
array. In the example of the dataset CCl4, the coordinates
of the spots are in the columns named “Row” and “Column” of the
featureData (the slot of the object containing the annotation of the
probes). We copy this information into columns named “X” and “Y”
respectively
The next call to arrayQualityMetrics() with this refined
version of nCCl4 (see Section~@ref(sec:rin)) will now
include this information in the report, and the spatial distribution of
the intensities will be shown.
The report can also include an assessment of the effect of the target
mapping of the reporters. You can define a featureData
column named hasTarget that indicates, by logical
TRUE, if the reporter matches a known transcript, and by
FALSE, if not. In the CCl4 example, many of the
reporter names are RefSeq identifiers, while others are not. Thus, we
let hasTarget indicate whether the name begins with
“NM”.
featureData(nCCl4)$hasTarget = (regexpr("^NM", featureData(nCCl4)$Name) > 0)
table(featureData(nCCl4)$hasTarget)##
## FALSE TRUE
## 33296 10332
The next call to arrayQualityMetrics() with this refined
version of nCCl4 (see Section @ref(sec:rin)) will now
include this information in the report, and the spatial distribution of
the intensities will be shown.
{#sec:rin}
The RNA hybridized to the arrays in the {CCl4} dataset
was intentionally made to good, medium or poor quality, and this is
recorded by a so-called RIN value (see CCl4 vignette).
## Cy3 Cy5 RIN.Cy3 RIN.Cy5
## 1 DMSO CCl4 9.0 9.7
## 2 DMSO CCl4 9.0 5.0
## 3 DMSO CCl4 9.0 2.5
## 4 DMSO CCl4 9.0 2.5
## 5 CCl4 DMSO 9.7 9.0
## 6 CCl4 DMSO 5.0 9.0
## 7 CCl4 DMSO 2.5 9.0
## 8 DMSO CCl4 9.0 9.7
## 9 CCl4 DMSO 9.7 9.0
## 10 CCl4 DMSO 5.0 9.0
## 11 CCl4 DMSO 2.5 9.0
## 12 DMSO CCl4 9.0 9.7
## 13 DMSO CCl4 9.0 5.0
## 14 DMSO CCl4 9.0 5.0
## 15 DMSO CCl4 9.0 2.5
## 16 CCl4 DMSO 9.7 9.0
## 17 CCl4 DMSO 5.0 9.0
## 18 CCl4 DMSO 2.5 9.0
The RIN is always 9 for the reference (DMSO), the relevant value is that for the test sample (CCl4).
RIN = with(pd, ifelse( Cy3=="CCl4", RIN.Cy3, RIN.Cy5))
fRIN = factor(RIN)
levels(fRIN) = c("poor", "medium", "good")
pData(nCCl4)$"RNA-integrity" = fRINNow we can use this to set the argument intgroup when
calling the function arrayQualityMetrics().
arrayQualityMetrics(expressionset = nCCl4,
outdir = "Report_for_nCCl4_with_RIN",
force = TRUE,
intgroup = "RNA-integrity")Boxplots, PCA plot and heatmap in the report will now indicate the
values of the factor RNA-integrity for each array.
R version 4.5.3 (2026-03-11) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.4 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] CCl4_1.49.0 limma_3.67.3
[3] vsn_3.79.6 hgu133acdf_2.18.0
[5] ALLMLL_1.51.0 affy_1.89.0
[7] Biobase_2.71.0 BiocGenerics_0.57.1
[9] generics_0.1.4 arrayQualityMetrics_3.67.4 [11] BiocStyle_2.39.0
loaded via a namespace (and not attached): [1] DBI_1.3.0 deldir_2.0-4
gridExtra_2.3
[4] rlang_1.2.0 magrittr_2.0.5 matrixStats_1.5.0
[7] compiler_4.5.3 RSQLite_2.4.6 png_0.1-9
[10] systemfonts_1.3.2 vctrs_0.7.3 reshape2_1.4.5
[13] stringr_1.6.0 crayon_1.5.3 fastmap_1.2.0
[16] backports_1.5.1 XVector_0.51.0 labeling_0.4.3
[19] rmarkdown_2.31 preprocessCore_1.73.0 bit_4.6.0
[22] xfun_0.57 cachem_1.1.0 jsonlite_2.0.0
[25] blob_1.3.0 BeadDataPackR_1.63.0 beadarray_2.61.3
[28] jpeg_0.1-11 cluster_2.1.8.2 R6_2.6.1
[31] bslib_0.10.0 stringi_1.8.7 RColorBrewer_1.1-3
[34] genefilter_1.93.0 rpart_4.1.27 GenomicRanges_1.63.2 [37]
jquerylib_0.1.4 Rcpp_1.1.1-1 Seqinfo_1.1.0
[40] knitr_1.51 gcrma_2.83.0 base64enc_0.1-6
[43] IRanges_2.45.0 Matrix_1.7-5 splines_4.5.3
[46] nnet_7.3-20 rstudioapi_0.18.0 yaml_2.3.12
[49] codetools_0.2-20 hwriter_1.3.2.1 lattice_0.22-9
[52] plyr_1.8.9 withr_3.0.2 KEGGREST_1.51.1
[55] S7_0.2.1-1 evaluate_1.0.5 foreign_0.8-91
[58] survival_3.8-6 Biostrings_2.79.5 affyio_1.81.0
[61] BiocManager_1.30.27 MatrixGenerics_1.23.0 KernSmooth_2.23-26
[64] checkmate_2.3.4 stats4_4.5.3 S4Vectors_0.49.2
[67] ggplot2_4.0.2 scales_1.4.0 xtable_1.8-8
[70] glue_1.8.1 Hmisc_5.2-5 maketools_1.3.2
[73] tools_4.5.3 interp_1.1-6 hexbin_1.28.5
[76] sys_3.4.3 data.table_1.18.2.1 affyPLM_1.87.0
[79] annotate_1.89.0 buildtools_1.0.0 XML_3.99-0.23
[82] grid_4.5.3 latticeExtra_0.6-31 AnnotationDbi_1.73.1 [85]
colorspace_2.1-2 htmlTable_2.4.3 Formula_1.2-5
[88] cli_3.6.6 textshaping_1.0.5 gridSVG_1.7-7
[91] setRNG_2024.2-1 svglite_2.2.2 gtable_0.3.6
[94] sass_0.4.10 digest_0.6.39 htmlwidgets_1.6.4
[97] farver_2.1.2 memoise_2.0.1 htmltools_0.5.9
[100] lifecycle_1.0.5 httr_1.4.8 statmod_1.5.1
[103] bit64_4.8.0
If in doubt, please see the notes about browser compatibility at the top of the report; or contact the Bioconductor mailing list.↩︎
For more information on how to produce an
AffyBatch from your data, please see the documentation of
the {affy} package.↩︎
For this vignette, in order to save computation time, we only call the function on the first 5 arrays; in your own application, you can call it on the complete data object.↩︎