Auto-download missing BUSCO lineage in annotate; quiet per-contig log in predict#72
Merged
Conversation
added 4 commits
May 28, 2026 07:07
The annotate command builds the BUSCO lineage path from FUNANNOTATE2_DB and the species/odb_version, but unlike train and predict it never checked whether that path actually exists on disk before handing it to buscolite. If it doesn't, buscolite dies with a confusing "<path> is not a directory" error. This shows up in docker. The previous train/predict run inside the container downloads the lineage to /opt/funannotate2_db/<species>_odbXX, but that path is ephemeral — when the user launches a new container for the annotate step the lineage is gone, even though the host's predict_results/ still has the gene models from the earlier run. Mirror the isdir-check + tarball download/extract block that already lives in predict.py and train.py, and fall back to a clear critical error if the download/extract somehow doesn't materialise the directory (rather than letting buscolite crash with its less informative message). - funannotate2/annotate.py: add `download` and `runSubprocess` to the utilities import, then insert the download block right after the busco_model_path is constructed.
On an assembly with many contigs the "Successfully ran tools for <contig>: snap, glimmerhmm, augustus" line emitted once per contig inside abinitio_wrapper drowns the user-facing info log — a 1000+ contig draft assembly would push every other useful message off-screen. The downstream "<tool> predictions filtered: N kept, M filtered" summary lines that follow the parallel run already convey aggregate per-tool success at info level, so the per-contig detail is redundant at info. Demote to debug so it's still available with --debug for troubleshooting individual contigs but stays out of the default log. Failure and OOM lines are intentionally left at warning/error — when something breaks the per-contig identity is essential context. - funannotate2/predict.py: change `logger.info(...)` to `logger.debug(...)` for the per-contig success line in abinitio_wrapper.
Replace the three identical copies of "compute lineage path, download tarball if missing, extract, clean up" in train.py, predict.py and annotate.py with a single ensure_busco_lineage(species, logger) helper in utilities.py. - utilities.py: add ensure_busco_lineage(); imports env from .config. - train.py: drop get_odb_version/download/load_json imports; call the helper once near the top of train() so busco_model_path is always defined for the params.json output, regardless of --training-set. - predict.py: drop get_odb_version/download/load_json/runSubprocess imports; replace the inline block with a single helper call. - annotate.py: drop get_odb_version/download/runSubprocess imports; replace the recently-added download block with a single helper call. No behavior change beyond train.py now resolving the lineage up front (previously it was lazy and skipped when --training-set was supplied); the download is idempotent and re-uses the cached directory when present, so the only observable effect is a one-time download on a fresh DB if a user supplies their own training set.
Commit 4d55eed moved the 'Successfully ran tools for <contig>: ...' message from logger.info to logger.debug to avoid log pollution on multi-contig assemblies. Update the test that asserted the message landed on info so it now checks debug_messages instead, plus a negative assertion that nothing matching 'Successfully ran tools' is emitted at info level.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small, unrelated user-facing fixes surfaced while exercising the dockerized pipeline end-to-end on Aspergillus nidulans.
1. Auto-download missing BUSCO lineage in
annotateThe
annotatecommand builds the BUSCO lineage path fromFUNANNOTATE2_DBand the species + odb_version, but unliketrainandpredictit never checked whether that path actually exists on disk before handing it to buscolite. If it doesn't, buscolite dies with the confusing:This is easy to hit in docker. The earlier
train/predictrun inside the container downloaded the lineage to/opt/funannotate2_db/<species>_odbXX, but that path is ephemeral — when the user launches a fresh container for theannotatestep the lineage is gone, even though the host'spredict_results/still has the gene models from the earlier run.Mirror the
isdir-check + tarball download/extract block that already lives inpredict.pyandtrain.py, and fall back to a clear critical error if the download/extract somehow doesn't materialise the directory (rather than letting buscolite crash with its less informative message).funannotate2/annotate.py: adddownloadandrunSubprocessto theutilitiesimport, then insert the download block right afterbusco_model_pathis constructed.2. Demote per-contig success log in
predictto debugOn an assembly with many contigs the
line emitted once per contig inside
abinitio_wrapperdrowns the user-facing info log — a 1000+ contig draft assembly would push every other useful message off-screen.The downstream
"<tool> predictions filtered: N kept, M filtered"summary lines that follow the parallel run already convey aggregate per-tool success at info level, so the per-contig detail is redundant at info. Demote todebugso it's still available with--debugfor troubleshooting individual contigs but stays out of the default log.Failure and OOM lines are intentionally left at
warning/error— when something breaks the per-contig identity is essential context.funannotate2/predict.py: changelogger.info(...)tologger.debug(...)for the per-contig success line inabinitio_wrapper.Verification
python -c 'import ast; ast.parse(open("funannotate2/annotate.py").read())'and the same forpredict.py— syntax OK.from funannotate2.annotate import annotate— import OK.Pull Request opened by Augment Code with guidance from the PR author