Skip to content

faster cactus-phast on fragmented references#1946

Open
glennhickey wants to merge 2 commits into
masterfrom
phast-release
Open

faster cactus-phast on fragmented references#1946
glennhickey wants to merge 2 commits into
masterfrom
phast-release

Conversation

@glennhickey

Copy link
Copy Markdown
Collaborator

Don't bother running on unaligned contigs.

Also: add general bed selection option for cactus-phast.

glennhickey and others added 2 commits June 29, 2026 16:38
…ntigs

Restrict cactus-phast to the reference regions worth scoring so the chunker
doesn't waste time on contigs phyloP can't score:

- --bedRanges <ranges.bed>: restrict the analysis to the reference ranges in a
  BED file (same option name/format as cactus-hal2maf). The BED parser is
  factored into a shared maf_chunk.parse_bed_ranges() now imported by both
  tools. Ranges are clamped to contig length and overlapping/touching ranges
  are merged (so the per-base wig has no duplicate positions); out-of-range
  intervals are warned about.

- Automatic exclusion of reference contigs unaligned to anything. phast_setup
  (the one job that already localizes the HAL -- no extra HAL copy) runs
  halAlignedExtract on the reference, a single scan of its top segments, to
  find contigs with at least one aligned base and drops the rest at planning
  time. Default on; --keepUnalignedContigs disables it. Only applied for a leaf
  reference (halAlignedExtract reports alignment to the parent only, so an
  internal/ancestral reference is skipped); degrades to "process all" if the
  scan fails or returns nothing.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
The chunker runs --chunkCores taffy|mafDuplicateFilter|bgzip pipelines
concurrently but requested no memory, so it ran at Toil's ~2 GiB default
regardless of -j. On a many-way MAF at -j 32 that starves the pipelines: a
taffy child is OOM-killed and the truncated stream surfaces downstream as
mafDuplicateFilter "premature end to maf file". A 577-way run OOMs at 2 GiB
but completes under 4 GiB at -j 32 (~100-130 MiB/pipeline), so request
128 MiB/core + a 2 GiB base (~6 GiB at -j 32), overridable with the new
--chunkMemory; --doubleMem still covers pathologically dense regions.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant