---
title: "The biomformat package vignette"
author: "Paul J. McMurdie"
output: 
  BiocStyle::html_document:
    fig_height: 7
    fig_width: 10
    toc: yes
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{The biomformat package Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

`r library("knitr")`
`r opts_chunk$set(cache=FALSE, fig.width=9, message=FALSE, warning=FALSE)`

## Paul J. McMurdie & Joseph N. Paulson

If you find *biomformat* and/or its tutorials useful, 
please acknowledge and cite *biomformat* in your publications:

Paul J. McMurdie and Joseph N Paulson (2015). *biomformat: An interface package for the BIOM file format.* R/Bioconductor package version 1.0.0. 


## [phyloseq Home Page](https://github.com/joey711/biomformat/)

phyloseq has many more utilities for interacting
with this kind of data than are provided in this I/O-oriented package.


## Motivation for the biomformat package

This is an [R Markdown document](http://www.rstudio.com/ide/docs/r_markdown). Markdown is a simple formatting syntax for authoring web pages. Further details on [R markdown here](http://www.rstudio.com/ide/docs/r_markdown).

The BIOM file format (canonically pronounced "biome") is designed to be a general-use format for representing biological sample by observation contingency tables. BIOM is a recognized standard for [the Earth Microbiome Project](http://www.earthmicrobiome.org/) and is a [Genomics Standards Consortium](http://gensc.org/) candidate project. Please see [the biom-format home page](http://biom-format.org/) for more details.

This demo is designed to provide an overview of the biom package to get you started using it quickly. The biom package itself is intended to be a utility package that will be depended-upon by other packages in the future. It provides I/O functionality, and functions to make it easier to with data from biom-format files. It does not (and probably should not) provide statistical analysis functions. However, it does provide tools to access data from BIOM format files in ways that are extremely common in R (such as `"data.frame"`, `"matrix"`, and `"Matrix"` classes).

**Package versions** at the time (`r date()`) of this build:

```{r packages}
library("biomformat"); packageVersion("biomformat")
```


# Read BIOM format

Here is an example importing BIOM formats of different types into R using the `read_biom` function. The resulting data objects in R are given names beginning with `x`.

```{r read-biom-examples}
min_dense_file   = system.file("extdata", "min_dense_otu_table.biom", 
                               package = "biomformat")
min_sparse_file  = system.file("extdata", "min_sparse_otu_table.biom", 
                               package = "biomformat")
rich_dense_file  = system.file("extdata", "rich_dense_otu_table.biom", 
                               package = "biomformat")
rich_sparse_file = system.file("extdata", "rich_sparse_otu_table.biom", 
                               package = "biomformat")
min_dense_file   = system.file("extdata", "min_dense_otu_table.biom", package = "biomformat")
rich_dense_char  = system.file("extdata", "rich_dense_char.biom", package = "biomformat")
rich_sparse_char  = system.file("extdata", "rich_sparse_char.biom", package = "biomformat")
x1 = read_biom(min_dense_file)
x2 = read_biom(min_sparse_file)
x3 = read_biom(rich_dense_file)
x4 = read_biom(rich_sparse_file)
x5 = read_biom(rich_dense_char)
x6 = read_biom(rich_sparse_char)
x1
```

It would be hard to interpret and wasteful of RAM to stream all the data from a BIOM format file to the standard out when printed with `print` or `show` methods. Instead, a brief summary of the contents BIOM data is shown. 


# Access BIOM data

To get access to the data in a familiar form appropriate for many standard R functions, we will need to use accessor functions, also included in the biom package.

### Core observation data

The core "observation" data is stored in either sparse or dense matrices in the BIOM format file, and sparse matrix support is carried through on the R side via [the Matrix package](http://cran.r-project.org/web/packages/Matrix/index.html). The variables `x1` and `x2`, assigned above from BIOM files, have similar (but not identical) data. They are small enough to observe directly as tables in standard output in an R session:

```{r accessor-examples-table}
biom_data(x1)
biom_data(x2)
```

As you can see above, `x1` and `x2` are represented in R by slightly different matrix classes, assigned automatically based on the data. However, most operations in R will understand this automatically and you should not have to worry about the precise matrix class. However, if the R function you are attempting to use is having a problem with these fancier classes, you can easily coerce the data to the simple, standard `"matrix"` class using the `as` function:

```{r matrix-coercion}
as(biom_data(x2), "matrix")
```

### Observation Metadata

Observation metadata is metadata associated with the individual units being counted/recorded in a sample, as opposed to measurements of properties of the samples themselves. For microbiome census data, for instance, observation metadata is often a taxonomic classification and anything else that might be known about a particular OTU/species. For other types of data, it might be metadata known about a particular genome, gene family, pathway, etc. In this case, the observations are counts of OTUs and the metadata is taxonomic classification, if present. The absence of observation metadata is also supported, as we see for the minimal cases of `x1` and `x2`, where we see `r NULL` instead.

```{r observ-meta}
observation_metadata(x1)
observation_metadata(x2)
observation_metadata(x3)
observation_metadata(x4)[1:2, 1:3]
class(observation_metadata(x4))
```

### Sample Metadata

Similarly, we can access metadata -- if available -- that describe properties of the samples. We access this sample metadata using the `sample_metadata` function.

```{r plot-examples}
sample_metadata(x1)
sample_metadata(x2)
sample_metadata(x3)
sample_metadata(x4)[1:2, 1:3]
class(sample_metadata(x4))
```


### Plots

The data really is accessible to other R functions.

```{r plot}
plot(biom_data(x4))
boxplot(as(biom_data(x4), "vector"))
heatmap(as(biom_data(x4), "matrix"))
```


# Write BIOM format (JSON / v1)

The biom objects in R can be written to a file/connection using the `write_biom` function. If you modified the biom object, this may still work as well, but no guarantees about this as we are still working on internal checks. The following example writes `x4` to a temporary file, then reads it back using `read_biom` and stores it as variable `y`. The exact comparison of these two objects using the `identical` function shows that they are exactly the same in R.
```{r write-biom-examples}
outfile = tempfile()
write_biom(x4, outfile)
y = read_biom(outfile)
identical(x4, y)
```

Furthermore, it is possible to invoke standard operating system commands through the R `system` function -- in this case to invoke the `diff` command available on Unix-like systems or the `FC` command on Windows -- in order to compare the original and temporary files directly. Note that this is shown here for convenience, but not automatically run with the rest of the script because of the OS-dependence. During development, though, this same command is tested privately and no differences are reported between the files.

```{r compare-files-diff, eval=FALSE}
# On Unix OSes
system(paste0("diff ", rich_sparse_file, outfile))
# On windows
system(paste0("FC ", rich_sparse_file, outfile))
```


# HDF5 (BIOM v2) read and write

BIOM v2 uses the [HDF5](https://www.hdfgroup.org/solutions/hdf5/) binary format,
which stores both sample-major and observation-major compressed-sparse representations
of the count matrix.
Reading is handled automatically by `read_biom()` — it detects HDF5 files by
their magic bytes and routes to `read_hdf5_biom()` internally.
Writing to HDF5 is done with `write_hdf5_biom()`.

Both functions require the Bioconductor package
[rhdf5](https://bioconductor.org/packages/rhdf5).
If it is not installed they stop with a clear installation message rather than
a cryptic C-level error.

```{r hdf5-available, include=FALSE}
hdf5_ok <- requireNamespace("rhdf5", quietly = TRUE)
```

```{r hdf5-read, eval=hdf5_ok}
hdf5_file <- system.file("extdata", "rich_sparse_otu_table_hdf5.biom",
                          package = "biomformat")
xh <- read_biom(hdf5_file)
xh
biom_data(xh)
```

Round-trip through `write_hdf5_biom()`: write the object to a temporary HDF5
file, read it back, and confirm the count data and metadata are preserved.

```{r hdf5-write, eval=hdf5_ok}
hdf5_out <- tempfile(fileext = ".biom")
write_hdf5_biom(xh, hdf5_out)

xh2 <- read_biom(hdf5_out)
# Count matrices are identical
identical(biom_data(xh), biom_data(xh2))
# Observation metadata (taxonomy) is preserved
identical(observation_metadata(xh), observation_metadata(xh2))
# Sample metadata is preserved
identical(sample_metadata(xh), sample_metadata(xh2))
```

You can also convert an existing JSON/v1 biom object to HDF5 v2 format:

```{r hdf5-convert, eval=hdf5_ok}
json_to_hdf5 <- tempfile(fileext = ".biom")
write_hdf5_biom(x4, json_to_hdf5)
x4_hdf5 <- read_biom(json_to_hdf5)
identical(biom_data(x4), biom_data(x4_hdf5))
```


# Tidy long-format output

For downstream analysis with tidyverse tools it is often convenient to have
the count data in *long format*: one row per (feature, sample) pair, with
sample and observation metadata joined in automatically.

`biomformat` provides two methods for this:

* `as.data.frame(x)` — returns a plain `data.frame`  
* `as_tibble.biom(x)` — returns a `tibble` (requires the
  [tibble](https://tibble.tidyverse.org/) package)

```{r tidy-dataframe}
long_df <- as.data.frame(x4)
head(long_df)
dim(long_df)   # n_obs * n_samples rows, feature/sample/count + metadata cols
```

The result has one row per (feature × sample) pair.
Columns always include `feature_id`, `sample_id`, and `count`.
Any observation metadata (e.g. taxonomy) and sample metadata
(e.g. body site, barcode) are joined in as additional columns:

```{r tidy-colnames}
names(long_df)
```

```{r tidy-tibble, eval=requireNamespace("tibble", quietly=TRUE)}
library(tibble)
long_tbl <- as_tibble.biom(x4)
long_tbl
```

### purrr-style summarisation over samples

With the long-format data frame in hand, standard
[purrr](https://purrr.tidyverse.org/) + dplyr patterns work directly.
The chunk below is guarded so it only runs when purrr is available,
but the base-R equivalent (using `tapply` / `lapply`) is shown alongside
each example for clarity.

```{r purrr-available, include=FALSE}
purrr_ok <- requireNamespace("purrr", quietly = TRUE) &&
            requireNamespace("dplyr", quietly = TRUE)
```

```{r purrr-summary, eval=purrr_ok}
library(purrr)
library(dplyr)

# Total counts per sample (purrr + dplyr)
long_df |>
  group_by(sample_id) |>
  summarise(total_counts = sum(count), .groups = "drop") |>
  arrange(desc(total_counts))
```

```{r purrr-map, eval=purrr_ok}
# Per-sample Shannon diversity using purrr::map_dbl
long_df |>
  group_by(sample_id) |>
  group_split() |>
  purrr::map_dbl(function(df) {
    p <- df$count / sum(df$count)
    p <- p[p > 0]
    -sum(p * log(p))
  }) |>
  setNames(unique(long_df$sample_id))
```

```{r tapply-fallback, eval=!purrr_ok}
# Base-R equivalent when purrr is not available
tapply(long_df$count, long_df$sample_id, function(counts) {
  p <- counts / sum(counts)
  p <- p[p > 0]
  -sum(p * log(p))
})
```


# SummarizedExperiment interoperability

[SummarizedExperiment](https://bioconductor.org/packages/SummarizedExperiment)
is the standard Bioconductor container for rectangular feature-by-sample assay
data. `biomformat` provides two ways to coerce a `biom` object into a
`SummarizedExperiment`:

1. The explicit constructor `biom_to_SummarizedExperiment(x)`
2. S4 coercion syntax `as(x, "SummarizedExperiment")`

Both require the `SummarizedExperiment` and `S4Vectors` Bioconductor packages.

```{r se-available, include=FALSE}
se_ok <- requireNamespace("SummarizedExperiment", quietly = TRUE) &&
         requireNamespace("S4Vectors",            quietly = TRUE)
```

```{r se-constructor, eval=se_ok}
library(SummarizedExperiment)

se <- biom_to_SummarizedExperiment(x4)
se
```

The count matrix is stored in the `"counts"` assay slot, observation metadata
goes to `rowData`, and sample metadata goes to `colData`:

```{r se-slots, eval=se_ok}
# Count matrix (same as biom_data(x4))
head(assay(se, "counts"))

# Sample metadata in colData
colData(se)[, 1:3]

# Observation (OTU) metadata in rowData
rowData(se)[, 1:3]
```

The S4 `as()` coercion is equivalent:

```{r se-as-coercion, eval=se_ok}
se2 <- as(x4, "SummarizedExperiment")
identical(assay(se, "counts"), assay(se2, "counts"))
```

From here the full Bioconductor ecosystem is available — for example,
downstream use with `DESeq2`, `edgeR`, `limma`, etc.


# Constructing a BIOM from R data

The `make_biom()` function builds a valid biom object from standard R matrices
and data frames. This is the recommended path for workflows like
[dada2](https://benjjneb.github.io/dada2/) where you already have a count
matrix and taxonomy table in R.

```{r make_biom_workflow}
# Simulate a small ASV count table (rows = features, cols = samples)
set.seed(42)
mat <- matrix(
  sample(0:50, 15, replace = TRUE), nrow = 3, ncol = 5,
  dimnames = list(
    paste0("ASV", 1:3),
    paste0("Samp", 1:5)
  )
)

# Taxonomy table: one row per feature, one list-valued column "taxonomy"
# Each element is a character vector of rank assignments (kingdom -> species).
tax <- data.frame(
  taxonomy = I(list(
    c("Bacteria", "Firmicutes",      "Clostridia",   "Clostridiales",
      "Lachnospiraceae", "Blautia", "NA"),
    c("Bacteria", "Proteobacteria", "Gammaproteobacteria", "Enterobacteriales",
      "Enterobacteriaceae", "Escherichia", "coli"),
    c("Bacteria", "Bacteroidetes",  "Bacteroidia",  "Bacteroidales",
      "Bacteroidaceae", "Bacteroides", "fragilis")
  )),
  row.names = rownames(mat)
)

# Build the biom object
x <- make_biom(data = mat, observation_metadata = tax,
               matrix_element_type = "int")
x

# Write to a JSON BIOM file and read back
tmp <- tempfile(fileext = ".biom")
write_biom(x, tmp)
y <- read_biom(tmp)

# Count data survives the round-trip
identical(biom_data(x), biom_data(y))

# Observation metadata is preserved (taxonomy values, not column name)
head(observation_metadata(y))
```

> **Large datasets:** `write_biom()` serialises to a single JSON string and
> will fail for very large tables (> ~2 GB serialised).  For large datasets
> use `write_hdf5_biom()` instead — HDF5 has no size constraint.

```{r make_biom_hdf5, eval=requireNamespace("rhdf5", quietly=TRUE)}
# For large datasets, write HDF5 (BIOM v2) instead
hdf5_tmp <- tempfile(fileext = ".biom")
write_hdf5_biom(x, hdf5_tmp)
z <- read_biom(hdf5_tmp)
identical(biom_data(x), biom_data(z))
```


# Subsetting biom_data() by name

`biom_data()` accepts character vectors for the `rows` and `columns` arguments,
allowing you to subset by feature or sample name directly without needing to
look up integer indices first.

```{r named_subsetting}
biom_file <- system.file("extdata", "rich_sparse_otu_table.biom",
                          package = "biomformat")
x <- read_biom(biom_file)

# All samples for a specific feature
biom_data(x, rows = "GG_OTU_3")

# A specific set of samples for two features
biom_data(x, rows = c("GG_OTU_1", "GG_OTU_3"),
              columns = c("Sample1", "Sample3", "Sample5"))
```


# Session info

```{r session-info}
sessionInfo()
```