BigDataStatMeth

Overview

BigDataStatMeth provides scalable statistical computing for matrices stored in HDF5 files. The package is designed as a two-level tool: it provides a standard R interface for users working with HDF5-backed matrices, and a reusable C++ infrastructure for developers implementing new block-wise statistical methods.

The R interface is based on HDF5Matrix objects and S3 methods, so users can work with familiar R calls such as dim(), [, %*%, crossprod(), scale(), cor(), svd(), prcomp(), qr(), chol(), and solve(). The C++ infrastructure provides classes and routines for managing HDF5 files, groups, and datasets, together with block-wise numerical methods that can be reused from Rcpp-based code.

flowchart LR
    subgraph R["R interface"]
        A["HDF5Matrix objects"] --> B["S3 generics: scale, crossprod, svd, qr, prcomp ..."]
    end
    subgraph CPP["C++ infrastructure"]
        C["C++ classes: files, groups, datasets"] --> D["Block-wise numerical routines"]
    end
    B --> D
    D --> E["HDF5 storage on-disk matrices"]
    C --> E

Most users will interact with the R/S3 interface. Developers can build on the C++ headers to extend the package with new HDF5-backed methods while retaining efficient execution through compiled code.

Key Features

HDF5-backed matrices through a familiar HDF5Matrix/S3 interface
Block-wise algorithms — process matrices larger than available RAM through intelligent partitioning
Parallel processing — multi-threaded operations for enhanced performance
Comprehensive decompositions — SVD, PCA, QR, Cholesky, eigendecomposition, and pseudoinverse
Statistical transformations — centering, scaling, correlation, sweep, and aggregations by row/column
C++ developer infrastructure — reusable classes and routines for building new scalable methods without reimplementing HDF5 management or block iteration
HDF5 compression and file-space reuse — controlled disk usage across iterative workflows

Installation

From CRAN (Stable Release)

install.packages("BigDataStatMeth")

From GitHub (Development Version)

# Install devtools if needed
install.packages("devtools")

devtools::install_github("isglobal-brge/BigDataStatMeth")

System Requirements

R packages:

Matrix
RcppEigen
RSpectra

System dependencies:

HDF5 library (>= 1.8)
C++17 compatible compiler
For Windows: Rtools

Quick Start

library(BigDataStatMeth)

h5file <- tempfile(fileext = ".h5")

set.seed(1)
X <- matrix(rnorm(500 * 100), nrow = 500, ncol = 100)

# Write an in-memory matrix to HDF5
X_h5 <- hdf5_create_matrix(
  filename = h5file,
  dataset  = "data/X",
  data     = X,
  overwrite = TRUE
)

dim(X_h5)
colMeans(X_h5)

# Standard R operations on the HDF5-backed matrix
XtX_h5  <- crossprod(X_h5)
X_sc_h5 <- scale(X_h5)

# Decompositions
svd_res <- svd(X_h5, nu = 5, nv = 5, center = TRUE, scale = TRUE)
pca_res <- prcomp(X_h5, center = TRUE, scale. = TRUE, ncomponents = 5)

close(X_h5)
hdf5_close_all()

Core Functionality

Standard R interface (`HDF5Matrix`/S3)

Category	Representative calls
Core object handling	`hdf5_create_matrix()`, `hdf5_matrix()`, `dim()`, `nrow()`, `ncol()`, `is_open()`, `close()`
HDF5 inspection and I/O	`list_datasets()`, `hdf5_import()`, `hdf5_import_multiple()`, `as.matrix()`, `as.data.frame()`
Subsetting and assignment	`X[i, j]`, `X[i, j] <- value`
Dimension names	`rownames()`, `colnames()`, `dimnames()`
Element-wise arithmetic	`X + Y`, `X - Y`, `X * Y`, `X / Y`
Matrix algebra	`%*%`, `crossprod()`, `tcrossprod()`, `cbind()`, `rbind()`
Aggregations	`colSums()`, `rowSums()`, `colMeans()`, `rowMeans()`, `colVars()`, `rowVars()`, `colSds()`, `rowSds()`, `colMins()`, `rowMins()`, `colMaxs()`, `rowMaxs()`
Scalar summaries	`mean()`, `var()`, `sd()`
Normalization and transformations	`scale()`, `sweep()`
Correlation	`cor()`
Decompositions	`svd()`, `prcomp()`, `eigen()`, `pseudoinverse()`
Factorizations and solvers	`qr()`, `chol()`, `solve()`
Diagonal operations	`diag()`, `diag<-()`, `diag_op()`, `diag_scale()`
Split, reduce, and apply	`split_dataset()`, `split()`, `reduce()`, `apply_function()`
Resource management and options	`hdf5matrix_options()`, `show_hdf5matrix_options()`, `hdf5_close_all()`

Specialized helpers (`bd*`)

Some utilities do not map directly to an existing base R generic and retain the bd* prefix. Examples include bdCreate_hdf5_group(), bdmove_hdf5_dataset(), and bdWrite_hdf5_dimnames(). These functions are part of the package API and are documented in their corresponding help pages.

Global Options

Common settings for HDF5-backed computations can be configured with hdf5matrix_options(). These include parallel execution, number of threads, block size, and HDF5 compression level.

hdf5matrix_options(
  paral       = TRUE,
  threads     = 4L,
  block_size  = 512L,
  compression = 6L
)

These settings are especially useful for operations dispatched through standard R generics, where the usual R call does not always expose all low-level execution parameters. Operation-specific parameters can also be passed directly when a method supports them (see ?svd.HDF5Matrix, ?prcomp.HDF5Matrix, ?qr.HDF5Matrix).

C++ Infrastructure for New Methods

The C++ API is a central part of BigDataStatMeth. The package exposes C++ classes for HDF5 files, groups, and datasets, and implements block-wise routines for matrix algebra, decompositions, and statistical operations. These are the same building blocks used internally by the R/S3 interface.

This design allows developers to focus on the statistical or numerical method itself, rather than reimplementing HDF5 file handling, block iteration, or data movement.

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace BigDataStatMeth;

// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string group, std::string dataset) {

    std::unique_ptr<BigDataStatMeth::hdf5Dataset> ds(nullptr);

    ds.reset( new BigDataStatMeth::hdf5Dataset(filename, group, dataset, false ) );
    ds->openDataset();

    // Block-wise processing using BigDataStatMeth routines
    // ...

    // ds is automatically closed and released when it goes out of scope
}

See Developing Methods for complete examples in both R and C++.

Documentation

Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/

Getting Started — installation and first steps
Fundamentals — HDF5 storage and block-wise computing concepts
Workflows — complete analysis examples
Developing Methods — building new statistical methods using the R and C++ APIs
API Reference — complete function documentation (R and C++)

# List available vignettes
vignette(package = "BigDataStatMeth")

# View the main vignette
vignette("BigDataStatMeth")

Performance

BigDataStatMeth is designed for efficiency at scale:

Block-wise computation — process very large matrices with a controlled, fixed memory footprint
Parallel algorithms — multi-core support for matrix operations and decompositions
Optimized I/O — efficient HDF5 chunking and access patterns
File-space reuse — space released by removed or overwritten intermediate datasets is tracked and reused within the same file

Use Cases

BigDataStatMeth is suited for any analytical workflow that involves large matrix operations. Typical scenarios include:

Large-scale matrix computations — multiplication, crossproducts, and element-wise operations on matrices that exceed available RAM
Dimensionality reduction — PCA and SVD on wide or tall matrices stored on disk
Statistical inference — regression, Cholesky-based solvers, and correlation analysis at scale
Multi-dataset integration — combining and analyzing matrices across multiple data sources, with support for multi-omics workflows
Method development — building and prototyping new scalable statistical methods using the C++ infrastructure without reimplementing HDF5 management or block iteration

HDF5 Resource Management

HDF5-backed objects keep file handles open while they are in use. Objects can be closed individually with close(), and all open HDF5 handles managed by the package can be closed with hdf5_close_all().

close(X_h5)
hdf5_close_all()

After calling hdf5_close_all(), HDF5-backed objects that were open should be reopened before being used again. Calling gc() may also help trigger R finalizers for objects that are no longer referenced.

Citation

If you use BigDataStatMeth in your research, please cite:

Pelegri-Siso D, Gonzalez JR (2026). BigDataStatMeth: Statistical Methods
for Big Data Using Block-wise Algorithms and HDF5 Storage.
R package version 2.0.0, https://github.com/isglobal-brge/BigDataStatMeth

BibTeX entry:

@Manual{bigdatastatmeth,
  title  = {BigDataStatMeth: Statistical Methods for Big Data},
  author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
  year   = {2026},
  note   = {R package version 2.0.0},
  url    = {https://github.com/isglobal-brge/BigDataStatMeth},
}

Contributing

Contributions are welcome. Please:

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -m 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

Development Guidelines

Follow existing code style (Rcpp coding standards for C++, tidyverse style for R)
Add tests for new functionality
Update documentation — Roxygen2 for R functions, Doxygen for C++ headers
Run R CMD check before submitting

Getting Help

Documentation: https://isglobal-brge.github.io/BigDataStatMeth/
Issues: GitHub Issues

License

MIT License — see LICENSE file for details.

Authors

Dolors Pelegri-Siso Bioinformatics Research Group in Epidemiology (BRGE) ISGlobal — Barcelona Institute for Global Health

Juan R. Gonzalez Bioinformatics Research Group in Epidemiology (BRGE) ISGlobal — Barcelona Institute for Global Health

Acknowledgments

Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
R		R
build		build
data		data
inst		inst
man		man
src		src
vignettes		vignettes
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
MD5		MD5
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cleanup		cleanup
configure		configure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigDataStatMeth

Overview

Key Features

Installation

From CRAN (Stable Release)

From GitHub (Development Version)

System Requirements

Quick Start

Core Functionality

Standard R interface (`HDF5Matrix`/S3)

Specialized helpers (`bd*`)

Global Options

C++ Infrastructure for New Methods

Documentation

Performance

Use Cases

HDF5 Resource Management

Citation

Contributing

Development Guidelines

Getting Help

License

Authors

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BigDataStatMeth

Overview

Key Features

Installation

From CRAN (Stable Release)

From GitHub (Development Version)

System Requirements

Quick Start

Core Functionality

Standard R interface (HDF5Matrix/S3)

Specialized helpers (bd*)

Global Options

C++ Infrastructure for New Methods

Documentation

Performance

Use Cases

HDF5 Resource Management

Citation

Contributing

Development Guidelines

Getting Help

License

Authors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Standard R interface (`HDF5Matrix`/S3)

Specialized helpers (`bd*`)

Packages