Skip to content

Address streaming writes and p-value correction conflicts #128

@araikes

Description

@araikes

Challege

When p-value correction is requested (the default: correct.p.value.terms = c("fdr")), all results must be held in memory. For millions of elements × dozens of statistics columns, this data frame can be very large. The correction itself via stats::p.adjust() requires the complete vector of p-values across all elements:

df_out[[tempstr.corrected]] <- stats::p.adjust(df_out[[tempstr.raw]], method = methodstr)

This means you cannot do a fully streaming, low-memory run if you want FDR correction — the entire results matrix must exist in memory simultaneously. As a result, when streaming writes and p-value corrections are both active, results are written twice.

# First: uncorrected chunks streamed during the loop
writer <- .results_stream_write_block(writer, chunk_df)

# Then: full corrected results overwritten after the loop
if (!is.null(writer) && (need_term_correction || need_model_correction)) {
  writeResults(
    fn.output = write_results_file,
    df.output = df_out,
    analysis_name = write_results_name,
    overwrite = TRUE
  )
}

This doubles the I/O cost.

Consider a two-pass approach to FDR correction:

  1. During the main element-wise loop, stream full results to HDF5 as currently done, but additionally collect only the p-value columns into memory using a pre-allocated matrix
  2. After the model fitting loop, correct the p-values using stats::p.adjust() on the in-memory p-value matrix, then patch the HDF5 file with the corrected columns.

Theoretical benefit:

Consider an example analysis with 1 million elements, the formula FD ~ age + sex + site produces an intercept and 3 predictors (4 terms), with var.terms = c("estimate", "statistic", "p.value") and var.model = c("adj.r.squared", "p.value")

Column type Count Description
element_id 1 Element index
<term>.estimate 4 Per-term estimates
<term>.statistic 4 Per-term t-statistics
<term>.p.value 4 Per-term p-values
model.adj.r.squared 1 Model R²
model.p.value 1 Model F-test p-value
Total columns 15
P-value columns only 5 4 term + 1 model
Approach In-memory footprint at 1M elements
Current (full df_out) 1M × 15 × 8 bytes = 120 MB
Two-pass (p-values only) 1M × 5 × 8 bytes = 40 MB
Savings 67%

With full.outputs = TRUE the column count jumps to ~30+, making the savings even larger — the p-value column count stays the same while total columns grow.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions