A Query.Farm VGI worker for DuckDB.
vgi-yara · a Query.Farm VGI worker · powered by yara-x
A VGI worker (Rust, a compiled binary) that brings
YARA malware scanning to DuckDB / SQL over Apache Arrow. DuckDB launches the
worker and talks to it over Arrow IPC; the functions appear under the catalog
yara, schema main. This is a defensive security tool: it scans
data/files against YARA rules for malware
detection.
Rule compilation and scanning are powered by
yara-x, VirusTotal's official pure-Rust
rewrite of YARA — no native libyara/C dependency.
LOAD vgi;
ATTACH 'yara' (TYPE vgi, LOCATION './target/release/yara-worker');
SET search_path = 'yara.main';
-- Per-row predicate over a column of blobs/files.
SELECT path
FROM files
WHERE yara_matches(content, 'rule eicar { strings: $a = "EICAR" condition: $a }');
-- First matching rule / how many rules matched.
SELECT yara_first_rule(content, $rules) FROM files; -- VARCHAR (NULL if none)
SELECT yara_match_count(content, $rules) FROM files; -- INT
-- Validate a ruleset compiles.
SELECT yara_check('rule r { condition: true }'); -- → true
-- Fan one constant blob into its matches (table functions).
SELECT * FROM yara_scan(read_blob('sample.bin'), $rules);
-- rule | namespace | tags
SELECT * FROM yara_string_matches(read_blob('sample.bin'), $rules);
-- rule | identifier | offset | matched| Function | Returns | Description |
|---|---|---|
yara_matches(data, rules) |
BOOLEAN |
Does data match any rule? |
yara_first_rule(data, rules) |
VARCHAR |
Identifier of the first matching rule (NULL if none). |
yara_match_count(data, rules) |
INT |
Number of matching rules. |
yara_check(rules) |
BOOLEAN |
Do the rules compile? (validation; never errors). |
yara_version() |
VARCHAR |
Worker version string. |
data is a BLOB or VARCHAR (the bytes/text to scan); rules is a YARA rule
source string.
| Function | Columns | Description |
|---|---|---|
yara_scan(data, rules) |
rule VARCHAR, namespace VARCHAR, tags VARCHAR[] |
One row per matching rule. |
yara_string_matches(data, rules) |
rule VARCHAR, identifier VARCHAR, "offset" BIGINT, matched VARCHAR |
One row per pattern (string) hit. |
DuckDB table functions take constant arguments (no subqueries), so the
dataandrulespassed toyara_scan/yara_string_matchesmust be constant-foldable expressions (literals,read_blob('…'), etc.).matchedis the matched bytes rendered as UTF-8 text when printable, else a lowercase hex string.
The scanned data is untrusted — by definition it may be live malware:
- A malformed, truncated, binary, or hostile blob never crashes the worker.
Scanning is total: it yields no matches (
false/NULL/0/ no rows), never an error. A bad blob beside a good one still produces the good one's matches. - Scanned data is bounded (64 MiB): an oversized blob is truncated to the cap before scanning so it cannot exhaust memory.
NULLinput →NULLoutput / no rows.- An invalid rule source (a user mistake) surfaces a clear DuckDB error from
the scan functions, carrying the compiler diagnostic.
yara_checkinstead returnsfalsefor a non-compiling source (it is the "does it compile?" predicate).
cargo build --release # build the worker
cargo test --workspace --all-features # unit + integration tests
cargo clippy --all-targets --all-features -- -D warnings # lint
make test-sql # DuckDB SQL end-to-endmake test-sql builds the release worker, points VGI_YARA_WORKER at it, and
runs the haybarn-unittest
sqllogictest suite under test/sql/. Install the runner once with
uv tool install haybarn-unittest.
- This worker: MIT — see LICENSE.
yara-x(the scanning engine): BSD-3-Clause.vgi/vgi-rpc(the worker SDK) andarrow-*: Apache-2.0.
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
