Compression

HDF5 supports transparent data compression, allowing you to drastically reduce the file size of your datasets with minimal effort. While the HDF5 ecosystem has historically relied on standard gzip and szip, modern data pipelines require higher throughput and advanced techniques like lossy floating-point compression and optimized bitshuffling.

Powered by hdf5lib, h5lite bundles an extensive suite of state-of-the-art compression filters natively, meaning you can use modern codecs like Blosc2, Zstandard (Zstd), LZ4, and ZFP without installing any external system dependencies.

This vignette covers how to configure these compression pipelines using the h5_compression() function, how to choose the right algorithm, how to tune chunk sizes, and how to inspect your results using h5_inspect().

library(h5lite)
file <- tempfile(fileext = ".h5")

The compress Argument and h5_compression()

For simple use cases, you can pass a configuration string directly to the compress argument of h5_write(). h5lite handles the underlying chunking requirements automatically.

# Standard gzip compression at level 5
h5_write(rnorm(1000), file, "data/simple_gzip", compress = "gzip-5")

# High-performance Blosc2 with Zstandard
h5_write(rnorm(1000), file, "data/simple_blosc2", compress = "blosc2-zstd-5")

For advanced control over the entire compression pipeline - including chunk sizing, pre-filters, data scaling, and checksums - use the h5_compression() function to build a configuration object to pass to h5_write().

# Advanced pipeline: LZ4 compression + optimal integer packing + Fletcher32 checksum
cmp <- h5_compression(
  compress    = "lz4-9", 
  int_packing = TRUE, 
  checksum    = TRUE,
  chunk_size  = 512 * 1024 # 512 KB chunks
)

h5_write(1:1000, file, "data/advanced", compress = cmp)

Valid Compression Strings Reference

The compress argument accepts specific string syntaxes to define both the codec and its operational level. The table below lists all valid combinations and indicates whether they require, permit, or forbid a level or parameter suffix.

Suffix Rule Valid Codec Strings Examples
Optional Level Suffix
(Defaults applied if omitted)
gzip, zstd, lz4, bzip2, bshuf-zstd
blosc1-lz4, blosc1-gzip, blosc1-zstd
blosc2-lz4, blosc2-gzip, blosc2-zstd
"zstd-7"
"blosc2-lz4"
No Suffix Allowed
(Strict exact match)
none, lzf, snappy, bshuf-lz4
szip-nn, szip-ec, zfp-rev
blosc1, blosc1-snappy, blosc2, blosc2-ndlz
"bshuf-lz4"
"blosc2"
Required Parameter Suffix
(Requires bits or tolerance)
zfp-prec, zfp-rate, zfp-acc
blosc2-zfp-prec, blosc2-zfp-rate, blosc2-zfp-acc
"zfp-rate-8"
"zfp-acc-0.01"

Choosing a Codec: Modern vs. Legacy

With so many options available, selecting the right codec depends on whether you are optimizing for extreme read/write speed, minimal file size, or universal compatibility.

2. Standalone Modern Codecs

If you prefer not to use the Blosc2 wrapper, you can call modern codecs directly:

3. Gzip (The Universal Standard)

4. Legacy Codecs (Obsolete or Niche)


Lossy Compression: ZFP and Scale-Offset

For massive numeric datasets, lossless compression may not provide enough space savings. h5lite supports two methods to discard mathematically insignificant precision in exchange for massive compression ratios.

ZFP (Floating-Point & Integer)

ZFP is a specialized algorithm designed for high-throughput, lossy compression of numerical arrays. It offers incredible ratios but requires purely numeric values.

(Note: The standalone "zfp-..." codecs support both integers and floats. However, if ZFP is wrapped inside Blosc2 via "blosc2-zfp-...", it can only encode floating-point values).

# Lossy compression: decompressed values will be accurate to within +/- 0.05
cmp_zfp <- h5_compression("zfp-acc-0.05")
h5_write(rnorm(1e5), file, "data/zfp_floats", compress = cmp_zfp)

Scale-Offset (Integer Packing & Float Rounding)

The native HDF5 Scale-Offset filter mathematically scales your data so it can be stored using fewer bits. It processes data one chunk at a time, and automatically reverses these operations when you read the file to reproduce your original values.

# 1. Integer Packing Example
# A dataset with a small range of values (e.g., years 2000 to 2050)
years <- sample(2000:2050, 100000, replace = TRUE)

# By default, R uses 32-bit integers. 
# With int_packing = TRUE, HDF5 subtracts 2000 from all values,
# leaving numbers from 0 to 50, which fit perfectly into just 6 bits!
cmp_int <- h5_compression("lz4-9", int_packing = TRUE)
h5_write(years, file, "data/packed_years", compress = cmp_int)

# 2. Float Rounding Example
# Sensor data where anything beyond 2 decimal places is just noise
sensor_data <- rnorm(100000, mean = 98.6, sd = 0.5)

# Multiplies by 10^2 (e.g., 98.614... -> 9861.4...), rounds to 9861, and bit-packs.
# When read back into R, it is automatically divided by 100 to restore 98.61.
cmp_float <- h5_compression("zstd-5", float_rounding = 2)
h5_write(sensor_data, file, "data/rounded_sensors", compress = cmp_float)

Filter Interactions & Invalid Combinations

Filters in HDF5 operate in a sequential pipeline, and certain filters destroy the underlying byte structures that downstream algorithms rely on. h5_compression() strictly enforces mutual exclusions and will throw an error if you attempt an invalid combination:

  1. Shuffling vs. Scale-Offset: Pre-filters like Bitshuffle and Byte Shuffle rearrange the byte stream to group similar bits together for better compression. Scale-Offset (int_packing or float_rounding) packs data into non-standard bit widths, which destroys byte alignment. Therefore, all automatic shuffling is forcefully disabled if Scale-Offset is active.

  2. Mathematical vs. Shuffling Codecs: ZFP and Szip perform mathematical compression directly on raw numerical values. They will completely fail or corrupt if the bitstream is rearranged beforehand. Do not combine ZFP or Szip with Scale-Offset, Bitshuffle, or Blosc2 pre-filters.

  3. String Data Limitations: Szip and ZFP cannot be applied to character vectors. String compression relies on standard algorithms like gzip or zstd, and only works on fixed-length strings. Variable-length strings (such as those containing NA values) cannot be compressed by chunk filters at all.


Tuning Chunk Size

HDF5 does not compress a dataset as one monolithic block. Instead, it divides the dataset into smaller “chunks” and compresses each independently.

By default, h5_compression() targets a 1 MB chunk size (chunk_size = 1048576), which is an excellent default. However, you should manually tune this depending on your specific access patterns:

# Optimizing for reading small, 100KB slices at a time
cmp_chunk <- h5_compression("blosc2-zstd-5", chunk_size = 102400)
h5_write(matrix(rnorm(10000), 100, 100), file, "data/tuned_chunks", compress = cmp_chunk)

Evaluating Results with h5_inspect()

It can be difficult to know exactly how well your compression strategy is working. The h5_inspect() function allows you to peek under the hood of any dataset, revealing its storage layout, chunk dimensions, the exact filter pipeline applied, and the resulting compression ratio.

# Write some highly compressible (sequential) integer data
cmp_pack <- h5_compression('lz4-9', int_packing = TRUE, checksum = TRUE)
h5_write(matrix(5001:5100, 10, 10), file, "inspect/packed_mtx", compress = cmp_pack)

# Inspect the dataset's properties
h5_inspect(file, "inspect/packed_mtx")

Output:

<HDF5 Dataset Properties>
  Type:    uint16              Size:    200.00 B
  Layout:  chunked             Disk:    120.00 B
  Chunks:  [10 x 10]           Ratio:   1.67x
  Pipeline: scaleoffset -> lz4 -> fletcher32

You can use this compression ratio readout to iteratively test different h5_compression() configurations until you find the perfect balance for your specific data.

# Clean up
unlink(file)

For additional details about these codecs and the underlying library, please see https://cmmr.github.io/hdf5lib/articles/compression.html.