Skip to content

Auto-download missing BUSCO lineage in annotate; quiet per-contig log in predict#72

Merged
nextgenusfs merged 4 commits into
mainfrom
fix-logging
May 28, 2026
Merged

Auto-download missing BUSCO lineage in annotate; quiet per-contig log in predict#72
nextgenusfs merged 4 commits into
mainfrom
fix-logging

Conversation

@nextgenusfs

Copy link
Copy Markdown
Owner

Summary

Two small, unrelated user-facing fixes surfaced while exercising the dockerized pipeline end-to-end on Aspergillus nidulans.

1. Auto-download missing BUSCO lineage in annotate

The annotate command builds the BUSCO lineage path from FUNANNOTATE2_DB and the species + odb_version, but unlike train and predict it never checked whether that path actually exists on disk before handing it to buscolite. If it doesn't, buscolite dies with the confusing:

[May 28 01:59 PM] BUSCOlite [conserved ortholog] search using aspergillus models
[May 28 01:59 PM] Error: /opt/funannotate2_db/aspergillus_odb12 is not a directory

This is easy to hit in docker. The earlier train/predict run inside the container downloaded the lineage to /opt/funannotate2_db/<species>_odbXX, but that path is ephemeral — when the user launches a fresh container for the annotate step the lineage is gone, even though the host's predict_results/ still has the gene models from the earlier run.

Mirror the isdir-check + tarball download/extract block that already lives in predict.py and train.py, and fall back to a clear critical error if the download/extract somehow doesn't materialise the directory (rather than letting buscolite crash with its less informative message).

  • funannotate2/annotate.py: add download and runSubprocess to the utilities import, then insert the download block right after busco_model_path is constructed.

2. Demote per-contig success log in predict to debug

On an assembly with many contigs the

[…] Successfully ran tools for contig_4.fasta: snap, glimmerhmm, augustus
[…] Successfully ran tools for contig_5.fasta: snap, glimmerhmm, augustus
[…] Successfully ran tools for contig_6.fasta: snap, glimmerhmm, augustus
[…]

line emitted once per contig inside abinitio_wrapper drowns the user-facing info log — a 1000+ contig draft assembly would push every other useful message off-screen.

The downstream "<tool> predictions filtered: N kept, M filtered" summary lines that follow the parallel run already convey aggregate per-tool success at info level, so the per-contig detail is redundant at info. Demote to debug so it's still available with --debug for troubleshooting individual contigs but stays out of the default log.

Failure and OOM lines are intentionally left at warning/error — when something breaks the per-contig identity is essential context.

  • funannotate2/predict.py: change logger.info(...) to logger.debug(...) for the per-contig success line in abinitio_wrapper.

Verification

  • python -c 'import ast; ast.parse(open("funannotate2/annotate.py").read())' and the same for predict.py — syntax OK.
  • from funannotate2.annotate import annotate — import OK.

Pull Request opened by Augment Code with guidance from the PR author

Jon Palmer added 4 commits May 28, 2026 07:07
The annotate command builds the BUSCO lineage path from FUNANNOTATE2_DB
and the species/odb_version, but unlike train and predict it never
checked whether that path actually exists on disk before handing it
to buscolite. If it doesn't, buscolite dies with a confusing
"<path> is not a directory" error.

This shows up in docker. The previous train/predict run inside the
container downloads the lineage to /opt/funannotate2_db/<species>_odbXX,
but that path is ephemeral — when the user launches a new container
for the annotate step the lineage is gone, even though the host's
predict_results/ still has the gene models from the earlier run.

Mirror the isdir-check + tarball download/extract block that already
lives in predict.py and train.py, and fall back to a clear critical
error if the download/extract somehow doesn't materialise the
directory (rather than letting buscolite crash with its less
informative message).

- funannotate2/annotate.py: add `download` and `runSubprocess` to the
  utilities import, then insert the download block right after the
  busco_model_path is constructed.
On an assembly with many contigs the "Successfully ran tools for
<contig>: snap, glimmerhmm, augustus" line emitted once per contig
inside abinitio_wrapper drowns the user-facing info log — a 1000+
contig draft assembly would push every other useful message off-screen.

The downstream "<tool> predictions filtered: N kept, M filtered"
summary lines that follow the parallel run already convey aggregate
per-tool success at info level, so the per-contig detail is redundant
at info. Demote to debug so it's still available with --debug for
troubleshooting individual contigs but stays out of the default log.

Failure and OOM lines are intentionally left at warning/error — when
something breaks the per-contig identity is essential context.

- funannotate2/predict.py: change `logger.info(...)` to `logger.debug(...)`
  for the per-contig success line in abinitio_wrapper.
Replace the three identical copies of "compute lineage path, download
tarball if missing, extract, clean up" in train.py, predict.py and
annotate.py with a single ensure_busco_lineage(species, logger) helper
in utilities.py.

- utilities.py: add ensure_busco_lineage(); imports env from .config.
- train.py: drop get_odb_version/download/load_json imports; call the
  helper once near the top of train() so busco_model_path is always
  defined for the params.json output, regardless of --training-set.
- predict.py: drop get_odb_version/download/load_json/runSubprocess
  imports; replace the inline block with a single helper call.
- annotate.py: drop get_odb_version/download/runSubprocess imports;
  replace the recently-added download block with a single helper call.

No behavior change beyond train.py now resolving the lineage up front
(previously it was lazy and skipped when --training-set was supplied);
the download is idempotent and re-uses the cached directory when
present, so the only observable effect is a one-time download on a
fresh DB if a user supplies their own training set.
Commit 4d55eed moved the 'Successfully ran tools for <contig>: ...'
message from logger.info to logger.debug to avoid log pollution on
multi-contig assemblies. Update the test that asserted the message
landed on info so it now checks debug_messages instead, plus a
negative assertion that nothing matching 'Successfully ran tools' is
emitted at info level.
@nextgenusfs nextgenusfs marked this pull request as ready for review May 28, 2026 14:21
@nextgenusfs nextgenusfs merged commit 0c1307e into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant