HDF5 supports transparent data compression, allowing you to
drastically reduce the file size of your datasets with minimal effort.
While the HDF5 ecosystem has historically relied on standard
gzip and szip, modern data pipelines require
higher throughput and advanced techniques like lossy floating-point
compression and optimized bitshuffling.
Powered by hdf5lib, h5lite bundles an
extensive suite of state-of-the-art compression filters natively,
meaning you can use modern codecs like Blosc2,
Zstandard (Zstd), LZ4, and
ZFP without installing any external system
dependencies.
This vignette covers how to configure these compression pipelines
using the h5_compression() function, how to choose the
right algorithm, how to tune chunk sizes, and how to inspect your
results using h5_inspect().
compress Argument and
h5_compression()For simple use cases, you can pass a configuration string directly to
the compress argument of h5_write().
h5lite handles the underlying chunking requirements
automatically.
# Standard gzip compression at level 5
h5_write(rnorm(1000), file, "data/simple_gzip", compress = "gzip-5")
# High-performance Blosc2 with Zstandard
h5_write(rnorm(1000), file, "data/simple_blosc2", compress = "blosc2-zstd-5")For advanced control over the entire compression pipeline - including
chunk sizing, pre-filters, data scaling, and checksums - use the
h5_compression() function to build a configuration object
to pass to h5_write().
# Advanced pipeline: LZ4 compression + optimal integer packing + Fletcher32 checksum
cmp <- h5_compression(
compress = "lz4-9",
int_packing = TRUE,
checksum = TRUE,
chunk_size = 512 * 1024 # 512 KB chunks
)
h5_write(1:1000, file, "data/advanced", compress = cmp)The compress argument accepts specific string syntaxes
to define both the codec and its operational level. The table below
lists all valid combinations and indicates whether they require, permit,
or forbid a level or parameter suffix.
| Suffix Rule | Valid Codec Strings | Examples |
|---|---|---|
| Optional Level Suffix (Defaults applied if omitted) |
gzip, zstd, lz4,
bzip2, bshuf-zstdblosc1-lz4,
blosc1-gzip,
blosc1-zstdblosc2-lz4,
blosc2-gzip, blosc2-zstd |
"zstd-7""blosc2-lz4" |
| No Suffix Allowed (Strict exact match) |
none, lzf,
snappy, bshuf-lz4szip-nn,
szip-ec, zfp-revblosc1,
blosc1-snappy, blosc2,
blosc2-ndlz |
"bshuf-lz4""blosc2" |
| Required Parameter
Suffix (Requires bits or tolerance) |
zfp-prec, zfp-rate,
zfp-accblosc2-zfp-prec,
blosc2-zfp-rate, blosc2-zfp-acc |
"zfp-rate-8""zfp-acc-0.01" |
With so many options available, selecting the right codec depends on whether you are optimizing for extreme read/write speed, minimal file size, or universal compatibility.
Blosc2 is a high-performance meta-compressor optimized for binary data. It automatically handles multi-threading and applies a highly optimized internal bitshuffle algorithm before passing the data to a sub-compressor.
"blosc2-zstd-[level]": Offers the
best overall balance of extreme read/write speeds and excellent
compression ratios. It effectively replaces standard gzip for modern
analytical workloads.
"blosc2-lz4-[level]": Exceptionally
fast. Best used when read/write speed is the absolute highest priority
and storage space is less of a concern.
If you prefer not to use the Blosc2 wrapper, you can call modern codecs directly:
"zstd-[level]": Zstandard (levels
1-22). Vastly superior to gzip in both speed and compression
ratio.
"lz4-[level]": Standard LZ4 (level
0) or LZ4-HC (levels 1-12).
"gzip-[level]": Levels 1-9 (default is
5). Every compiled HDF5 library worldwide supports gzip. Use this
only if you plan to share your .h5 files with
external collaborators using older Python/Julia tools, or if you are
archiving them for long-term storage where universal compatibility is
mandatory."szip-nn" / "szip-ec":
Historically fast for scientific data, provided safely here via the
permissively licensed libaec library. Because the original
library was frequently missing from legacy HDF5 distributions, szip
never saw universal adoption and is now largely obsolete compared to
Blosc2 or Zstd.
"blosc1",
"snappy",
"lzf",
"bzip2": Included strictly to maintain
backward compatibility, allowing you to read archived .h5
files and write to legacy data processing pipelines. These
early-generation algorithms lack the multi-threading optimizations,
speeds, and compression ratios of modern alternatives, making them
generally unsuitable for completely new datasets.
For massive numeric datasets, lossless compression may not provide
enough space savings. h5lite supports two methods to
discard mathematically insignificant precision in exchange for massive
compression ratios.
ZFP is a specialized algorithm designed for high-throughput, lossy compression of numerical arrays. It offers incredible ratios but requires purely numeric values.
(Note: The standalone "zfp-..." codecs support both
integers and floats. However, if ZFP is wrapped inside Blosc2 via
"blosc2-zfp-...", it can only encode floating-point
values).
"zfp-acc-[tolerance]"):
Guarantees that no decompressed value will differ from the original by
more than the given absolute tolerance (e.g.,
"zfp-acc-0.001")."zfp-prec-[bits]"):
Preserves a specific number of bits of precision (e.g.,
"zfp-prec-16")."zfp-rate-[bits]"): Forces
the compressed data to use exactly a certain number of bits of storage
per value (e.g., "zfp-rate-8").The native HDF5 Scale-Offset filter mathematically scales your data so it can be stored using fewer bits. It processes data one chunk at a time, and automatically reverses these operations when you read the file to reproduce your original values.
Integer Packing (int_packing): When
you set int_packing = TRUE, HDF5 subtracts the minimum
value in the chunk from all the other values. It then encodes these new,
smaller values using the exact minimum number of bits necessary. For
datasets with small ranges or lots of zeros, this saves a massive amount
of space. (Alternatively, passing a number like
int_packing = 8 forces it to pack the data into exactly 8
bits).
Float Rounding (float_rounding):
When you pass an integer (like float_rounding = 3), HDF5
multiplies all the floating-point values by 10^3 to shift the decimal
point. It then rounds the results to the nearest whole integer. Once
they are integers, it applies the exact same bit-packing method
described above. When the data is decoded, the operations are run in
reverse to restore the original values, less any exact precision lost
during the initial rounding step.
# 1. Integer Packing Example
# A dataset with a small range of values (e.g., years 2000 to 2050)
years <- sample(2000:2050, 100000, replace = TRUE)
# By default, R uses 32-bit integers.
# With int_packing = TRUE, HDF5 subtracts 2000 from all values,
# leaving numbers from 0 to 50, which fit perfectly into just 6 bits!
cmp_int <- h5_compression("lz4-9", int_packing = TRUE)
h5_write(years, file, "data/packed_years", compress = cmp_int)
# 2. Float Rounding Example
# Sensor data where anything beyond 2 decimal places is just noise
sensor_data <- rnorm(100000, mean = 98.6, sd = 0.5)
# Multiplies by 10^2 (e.g., 98.614... -> 9861.4...), rounds to 9861, and bit-packs.
# When read back into R, it is automatically divided by 100 to restore 98.61.
cmp_float <- h5_compression("zstd-5", float_rounding = 2)
h5_write(sensor_data, file, "data/rounded_sensors", compress = cmp_float)Filters in HDF5 operate in a sequential pipeline, and certain filters
destroy the underlying byte structures that downstream algorithms rely
on. h5_compression() strictly enforces mutual exclusions
and will throw an error if you attempt an invalid combination:
Shuffling vs. Scale-Offset: Pre-filters like
Bitshuffle and Byte Shuffle rearrange the byte stream to group similar
bits together for better compression. Scale-Offset
(int_packing or float_rounding) packs data
into non-standard bit widths, which destroys byte alignment. Therefore,
all automatic shuffling is forcefully disabled if Scale-Offset
is active.
Mathematical vs. Shuffling Codecs: ZFP and Szip perform mathematical compression directly on raw numerical values. They will completely fail or corrupt if the bitstream is rearranged beforehand. Do not combine ZFP or Szip with Scale-Offset, Bitshuffle, or Blosc2 pre-filters.
String Data Limitations: Szip and ZFP cannot be
applied to character vectors. String compression relies on standard
algorithms like gzip or zstd, and only works
on fixed-length strings. Variable-length strings (such as those
containing NA values) cannot be compressed by chunk filters
at all.
HDF5 does not compress a dataset as one monolithic block. Instead, it divides the dataset into smaller “chunks” and compresses each independently.
By default, h5_compression() targets a 1 MB
chunk size (chunk_size = 1048576), which is an
excellent default. However, you should manually tune this depending on
your specific access patterns:
Too Small (< 10 KB): Imposes huge metadata overhead. The internal HDF5 B-tree will bloat the file size, and the compression algorithms won’t have enough data to identify repeating patterns.
Too Large (> 50 MB): If you only want to read a tiny slice (e.g., 10 rows) of your dataset, HDF5 is forced to load and decompress the entire chunk containing those rows into memory. Overly large chunks cause massive read latency for subsetting operations.
# Optimizing for reading small, 100KB slices at a time
cmp_chunk <- h5_compression("blosc2-zstd-5", chunk_size = 102400)
h5_write(matrix(rnorm(10000), 100, 100), file, "data/tuned_chunks", compress = cmp_chunk)h5_inspect()It can be difficult to know exactly how well your compression
strategy is working. The h5_inspect() function allows you
to peek under the hood of any dataset, revealing its storage layout,
chunk dimensions, the exact filter pipeline applied, and the resulting
compression ratio.
# Write some highly compressible (sequential) integer data
cmp_pack <- h5_compression('lz4-9', int_packing = TRUE, checksum = TRUE)
h5_write(matrix(5001:5100, 10, 10), file, "inspect/packed_mtx", compress = cmp_pack)
# Inspect the dataset's properties
h5_inspect(file, "inspect/packed_mtx")Output:
<HDF5 Dataset Properties>
Type: uint16 Size: 200.00 B
Layout: chunked Disk: 120.00 B
Chunks: [10 x 10] Ratio: 1.67x
Pipeline: scaleoffset -> lz4 -> fletcher32
You can use this compression ratio readout to iteratively test
different h5_compression() configurations until you find
the perfect balance for your specific data.
For additional details about these codecs and the underlying library, please see https://cmmr.github.io/hdf5lib/articles/compression.html.