VirBench is a benchmark for evaluating how well different AI agents and tools retrieve viral sequence data. Each agent is given a natural-language query describing a pathogen and a set of filters (host, date range, sequence length, geographic location, etc.) and asked to return the number of matching sequences. Results are compared against a manually curated set of expected counts based on the NCBI Virus web UI.
NOTE: The full VirBench benchmark dataset is not publicly released to prevent potential data leakage into large language model training corpora, which could compromise the validity of evaluation results by enabling models to retrieve or memorize answers. Researchers interested in accessing the benchmark are encouraged to contact the corresponding authors to request access.
If you use VirBench or gget virus in a publication, please cite our preprint:
Nasri, F., Gurev, S., Varilly, P., Ramesh, K., O’Leary, N. A., Cool, J., Renard, B. Y., Sabeti, P. C., & Luebbert, L. (2026). Deterministic access to global viral sequence data enables robust agentic scientific discovery. arXiv:2606.06749.
VirBench/
├── config.yaml # Central configuration (benchmark CSV path, etc.)
├── requirements.txt # Python dependencies
│
├── docs/ # Benchmark inputs and reference data
│ ├── virbench_v{VERSION}_DO_NOT_SHARE.csv # Benchmark queries and expected counts - CONTACT US FOR ACCESS
│ └── gget_virus_docs.md # gget virus module documentation (provided to agents)
│
├── src/ # Benchmark scripts and shared utilities
│ ├── utils.py # Shared data classes, CSV parsing, query building, LLM extraction
│ ├── benchmark_edison_analysis.py # Edison Analysis agent benchmark
│ ├── benchmark_biomni.py # Biomni data-lake agent benchmark
│ ├── benchmark_gget_virus.py # gget virus benchmark (no agent)
│ ├── benchmark_claude.py # Claude benchmark (Anthropic API + code execution)
│ ├── benchmark_gpt.py # GPT benchmark (OpenAI Responses API + code execution)
│ └── rerun_errors.py # Rerun failed runs for all technologies
│
├── results/ # Benchmark output (CSV summaries + JSON reports)
│ ├── edison_analysis/ # Edison agent results
│ │ └── <date>/ # Results grouped by run date (YYYY-MM-DD)
│ ├── biomni/ # Biomni agent results
│ │ └── <model>/ # Results grouped by LLM model name
│ ├── gget_virus/ # gget virus results
│ ├── claude/ # Claude results
│ │ └── <model>/ # Results grouped by model name
│ └── gpt/ # GPT results
│ └── <model>/ # Results grouped by model name
│
├── notebooks/ # Analysis and visualization notebooks
│
└── figures/ # Plots generated in notebooks
# Add the virbench_v{VERSION}_DO_NOT_SHARE.csv file (CONTACT US FOR ACCESS) into the docs folder
# (Optional) To use a different benchmark version, edit config.yaml
# Set up environment
cp .env.example .env # add your API keys
pip install -r requirements.txt
# Run a quick test (first query only, 3 runs)
python src/benchmark_claude.py --test
python src/benchmark_gpt.py --test
python src/benchmark_edison_analysis.py --test
python src/benchmark_biomni.py --test
python src/benchmark_gget_virus.py --test
# Full benchmark run
python src/benchmark_claude.py
# With gget virus
python src/benchmark_claude.py --use-gget-virus
# Custom model / output directory
# By default, results are saved in a model subdirectory (e.g., results/gpt/gpt-5.2-pro/)
python src/benchmark_claude.py --model claude-sonnet-4-20250514 -o results/claude
python src/benchmark_gpt.py --model gpt-5.5-pro -o results/gptAll queries are defined in docs/virbench_v{VERSION}_DO_NOT_SHARE.csv. Each row specifies:
- query_id and pathogen — identifies the query
- tax_id — NCBI taxonomy ID for the virus
- expected_count — the ground-truth sequence count based on manual queries through the NCBI Virus web UI
- Filter columns — host, geographic_location, date ranges, sequence length bounds, refseq_only, annotated, segment, lineage, and many others
All queries include a maximum release date to increase the stability of expected counts over time.
The shared build_query() function in src/utils.py converts each row into a natural-language prompt (e.g. "Retrieve viral sequences from NCBI for TaxID 3052518 (CCHFV) with the following filters: geographic location: Africa, collected on or after 2020-01-01, minimum sequence length: 11000 bp...").
The shared build_query() function in src/utils.py converts each CSV row into a natural-language prompt. The prompt structure varies depending on the query type:
- Standard queries (most rows) — the prompt opens with "Retrieve viral sequences from NCBI for TaxID {tax_id} ({pathogen})" followed by any filter criteria (host, dates, sequence length, etc.).
- Accession queries (
is_accession=TRUE) — instead of searching by taxonomy, the prompt asks "Retrieve the sequence(s) that belong to NCBI accession ID {tax_id}". In these rows, thetax_idcolumn holds an accession ID (e.g.NC_105106.1) rather than a numeric taxonomy ID. - All-virus queries (no
tax_idorpathogen) — the prompt opens with "Retrieve all viral sequences (for any virus) from NCBI" and relies entirely on the filter columns to narrow results.
Filter columns (host, date ranges, sequence length, geographic location, lineage, etc.) are converted into human-readable phrases and appended after the opening sentence. Boolean flags like vaccine_strain and lab_passaged produce phrases such as "exclude vaccine strains" or "only lab-passaged samples". Columns in the exclude set (is_sars_cov2, is_alphainfluenza, is_accession) are used for routing logic but are not included in the prompt text.
When gget virus mode is enabled, the instruction "Use the gget virus module installable with 'pip install gget==0.30.3'. The documentation is attached." is prepended to the prompt.
Example — standard query (query 1, CCHFV from Africa, without gget virus):
Retrieve viral sequences from NCBI for TaxID 3052518 (Crimean-Congo Hemorrahagic fever virus (CCHFV)) that adhere to the following criteria: geographic location of sample collection: Africa, collected on or after 2020-01-01, collected on or before 2025-12-31, released on or after 2020-01-01, released on or before 2025-12-31, minimum sequence length: 11000 bp, maximum 10 ambiguous characters (N's). Return only the count of sequences that match these criteria.
Example — same query with gget virus:
Use the gget virus module installable with 'pip install gget==0.30.3'. The documentation is attached. Retrieve viral sequences from NCBI for TaxID 3052518 (Crimean-Congo Hemorrahagic fever virus (CCHFV)) that adhere to the following criteria: geographic location of sample collection: Africa, collected on or after 2020-01-01, collected on or before 2025-12-31, released on or after 2020-01-01, released on or before 2025-12-31, minimum sequence length: 11000 bp, maximum 10 ambiguous characters (N's). Return only the count of sequences that match these criteria.
Example — accession query (query 89):
Retrieve the sequence(s) that belong to NCBI accession ID NC_105106.1. Return only the count of sequences that match these criteria.
| Agent | Script | How it works |
|---|---|---|
| Edison Analysis | benchmark_edison_analysis.py |
Proprietary agent via Edison client SDK. Queries are sent as analysis jobs. |
| Biomni | benchmark_biomni.py |
Data-lake agent accessed via HTTP API. Uses claude-sonnet-4-20250514 as its LLM (configurable via --llm). |
| gget virus | benchmark_gget_virus.py |
Direct API call to gget.virus() — no LLM involved. Serves as a deterministic baseline. |
| Claude | benchmark_claude.py |
Claude via the Anthropic Messages API with tool use (default: claude-sonnet-4-20250514, configurable via --model). Tools: execute_python (local code execution) and web_search (server-side web search). Optionally enhanced with K-Dense scientific skills (see below). Claude Sonnet 4 (claude-sonnet-4-20250514) is the latest publicly available Anthropic model that can be used for this evaluation, as subsequent models have biosafety-related access restrictions that prevent viral sequence retrieval. To evaluate Anthropic's latest frontier model, we obtained special research access from Anthropic to run Claude Opus 4.7 without these restrictions. |
| GPT | benchmark_gpt.py |
GPT via the OpenAI Responses API with tool use (default: gpt-5.2-pro, configurable via --model). Tools: execute_python (local code execution) and web_search_preview (server-side web search). |
Both the Claude and GPT benchmarks call their respective APIs with two tools:
- Web search (server-side) — a built-in tool provided by each API (
web_search_20250305for Claude,web_search_previewfor GPT). The model can search the web for API documentation, examples, or any other information. Search execution and result retrieval are handled entirely by the provider; the benchmark script does not need to process these tool calls. execute_python(client-side) — a custom function tool that runs model-generated Python code locally in a subprocess with a 120-second timeout. The script executes the code and returns stdout/stderr (truncated to 10K chars) as the tool result.
The agentic loop sends the query, lets the model call tools (up to 25 turns), and collects the final integer response. This gives the model full autonomy to search for documentation, install packages (e.g. pip install gget), write multi-step scripts, and retry on errors.
For safety reasons, the Claude and GPT benchmarks restrict outbound network access from execute_python to a configurable domain allowlist. Before each code execution, a socket-level guard is injected that monkey-patches socket.getaddrinfo and socket.create_connection to block connections to any host not on the allowlist. This catches all common Python HTTP libraries (requests, urllib, httpx, etc.) at the socket layer. Raw IP connections are also blocked.
The default allowlist permits only NCBI and PyPI domains:
| Domain | Purpose |
|---|---|
ncbi.nlm.nih.gov |
NCBI web and FTP |
eutils.ncbi.nlm.nih.gov |
NCBI E-utilities API |
api.ncbi.nlm.nih.gov |
NCBI Datasets API |
ftp.ncbi.nlm.nih.gov |
NCBI FTP |
pypi.org |
pip package index |
files.pythonhosted.org |
pip package downloads |
To override the default allowlist, set the EXECUTE_PYTHON_DOMAIN_ALLOWLIST environment variable to a comma-separated list of domains:
export EXECUTE_PYTHON_DOMAIN_ALLOWLIST="ncbi.nlm.nih.gov,eutils.ncbi.nlm.nih.gov,pypi.org"Subdomain matching is supported: allowing ncbi.nlm.nih.gov also permits eutils.ncbi.nlm.nih.gov.
Each benchmark script follows the same pattern:
- Parse the benchmark CSV (path defined in
config.yaml) into query configurations - For each query, run
NUM_RUNS(default 3) independent trials:- Build the natural-language query from the config
- Send it to the agent/tool and collect the raw response
- For Edison and Biomni, extract the integer count from the response using
claude-sonnet-4-20250514(extract_count_from_response) - Compare against the expected count
- Write results incrementally — after every single run, append to both the CSV summary and JSON report so partial results are preserved if the process is interrupted
All agent benchmarks support a --use-gget-virus / -gv flag. When enabled, the agent is instructed to use the gget virus Python module and the documentation from docs/gget_virus_docs.md is provided (either uploaded to the agent's environment or prepended to the system prompt). This tests whether giving agents access to a purpose-built tool improves retrieval accuracy.
The Claude benchmark supports loading K-Dense scientific skills into the system prompt via the --kdense flag. Each skill is a curated documentation bundle (a SKILL.md file plus optional references/ and scripts/ directories) that teaches the model how to use a specific bioinformatics tool or database API.
The skills relevant to this benchmark are biopython (NCBI Entrez access, sequence I/O) and gget (unified queries to 20+ genomic databases). Loading these gives the model detailed API references and working code examples it can adapt when writing its NCBI queries.
Flags:
| Flag | Description |
|---|---|
--kdense DIR |
Path to the scientific-skills directory. Loads SKILL.md from every skill folder. |
--kdense-skills s1,s2 |
Comma-separated list of skill folder names to load instead of all ~140 skills. |
--kdense-refs |
Also load references/*.md and scripts/*.py from each selected skill (detailed API docs and example code). |
Setup:
# Clone the skills repo into data/
git clone https://github.com/kdense/claude-scientific-skills.git data/claude-scientific-skillsExample runs:
# Load biopython + gget skills (~40K chars added to system prompt)
python src/benchmark_claude.py --test \
--kdense data/claude-scientific-skills/scientific-skills \
--kdense-skills biopython,gget
# Include full reference docs and example scripts (~195K chars)
python src/benchmark_claude.py --test \
--kdense data/claude-scientific-skills/scientific-skills \
--kdense-skills biopython,gget --kdense-refsOutput files include a _kd suffix when K-Dense skills are enabled (e.g. claude_benchmark_summary_kd_20260220_120000.csv). When combined with gget virus mode the suffix is _gv_kd.
Each benchmark run produces two files in its results subdirectory:
- CSV summary (
*_benchmark_summary_YYYYMMDD_HHMMSS.csv) — one row per run with query_id, run_number, expected_count, retrieved_count, is_correct, error, and duration_seconds - JSON report (
*_benchmark_report_YYYYMMDD_HHMMSS.json) — full report including raw responses and running accuracy statistics
Files with _gv in the name were run with gget virus mode enabled.
When computing metrics (accuracy, stability, MAE, duration), there are two categories of incomplete runs:
-
Explicit errors — runs where the
errorcolumn in the CSV contains a non-empty value (e.g. a timeout, API error, or exception traceback). These are always excluded from all metric computations (treated asNaN). -
None results — runs where
retrieved_countis missing/empty but theerrorcolumn is also empty. These represent cases where the agent completed without raising an exception but did not return a parseable integer count.
The analysis notebook (notebooks/main_comparisons.ipynb) provides a toggle to control how None results are handled:
NONE_TREATED_AS_ZERO = True # None results count as 0 (included in metrics)
NONE_TREATED_AS_ZERO = False # None results are treated as errors (excluded from metrics)When NONE_TREATED_AS_ZERO = True, a missing retrieved_count with no error is interpreted as "the agent found no matching sequences" (i.e. 0). When False, it is treated the same as an explicit error and excluded from all metric computations.
The src/rerun_errors.py script identifies errored runs across all benchmark reports and reruns only those queries. It produces updated result files (with a _rerun_ suffix) and a consolidated report at results/rerun_errors_report_<timestamp>.json summarizing errors before and after the rerun.
# Rerun all errored queries across all technologies
python src/rerun_errors.py
# Rerun errors for a specific report
python src/rerun_errors.py --report results/gpt/gpt-5.2-pro/gpt_benchmark_report_20260226_211117.json| Export | Description |
|---|---|
QueryConfig |
Dataclass holding query_id, pathogen, expected_count, tax_id, and a filters dict |
BenchmarkResult |
Dataclass for a single run's outcome (counts, correctness, timing, errors) |
parse_csv(path) |
Parses the benchmark CSV into a list of QueryConfig objects |
build_query(config, use_gget_virus) |
Converts a QueryConfig into a natural-language prompt |
extract_count_from_response(text) |
Uses Claude to extract an integer count from free-text agent responses |
NUM_RUNS |
Number of independent trials per query (default: 3) |
GGET_VIRUS_DOC_MD_PATH |
Path to the gget virus documentation file |
