A Query.Farm VGI worker for DuckDB.
vgi-chem · a Query.Farm VGI worker · powered by RDKit
A VGI worker that brings cheminformatics into DuckDB/SQL. It computes molecular descriptors, fingerprints + similarity, substructure search, and InChI from SMILES strings as plain SQL functions, backed by RDKit (BSD-3-Clause) — the standard open-source cheminformatics toolkit.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'chem' (TYPE vgi, LOCATION 'uv run chem_worker.py');
SELECT chem.is_valid_smiles('CCO'); -- true
SELECT chem.canonical_smiles('OCC'); -- 'CCO'
SELECT chem.mol_formula('CC(=O)OC1=CC=CC=C1C(=O)O'); -- 'C9H8O4'
SELECT chem.mol_weight('CC(=O)OC1=CC=CC=C1C(=O)O'); -- ~180.16
SELECT chem.logp('CCO'), chem.tpsa('CCO'); -- Crippen logP, TPSA
SELECT chem.inchikey('CC(=O)OC1=CC=CC=C1C(=O)O'); -- 'BSYNRYMUTXBXSQ-UHFFFAOYSA-N'
SELECT chem.tanimoto('CCO', 'CCO'); -- 1.0
SELECT chem.substructure_match('c1ccccc1O', 'c1ccccc1'); -- true
SELECT * FROM chem.lipinski('CC(=O)OC1=CC=CC=C1C(=O)O'); -- rule-of-five rowsEverything runs offline and deterministically — RDKit computes locally with no network calls, so the same input always gives the same answer.
The catalog already ships a ducksmiles worker for basic SMILES handling
(validate / canonicalize). vgi-chem is scoped beyond that: it is the
descriptors / fingerprints / substructure / InChI worker. It does include
is_valid_smiles and canonical_smiles for convenience (so you can do
everything from one ATTACH), but its reason to exist is the chemistry that
ducksmiles does not cover — molecular weight/formula/exact mass, H-bond and
ring/rotatable-bond counts, Crippen logP and TPSA, Morgan fingerprints and
Tanimoto similarity, SMARTS substructure matching, InChI/InChIKey, and the
Lipinski rule-of-five breakdown. Reach for ducksmiles for lightweight SMILES
plumbing; reach for vgi-chem when you need real cheminformatics.
The split follows what the VGI SDK allows for each function shape:
-
Scalars take positional arguments only and resolve overloads by arity (DuckDB's
name := valuesyntax is a table-function/macro feature, not a scalar one). Every per-row answer is a scalar, so it works inline in any projection or predicate. Where a function takes optional parameters, those are extra positional arity overloads:SELECT morgan_fingerprint(smiles) FROM compounds; -- radius=2, nbits=2048 SELECT morgan_fingerprint(smiles, 3, 1024) FROM compounds; -- explicit radius/nbits SELECT tanimoto(a.smiles, b.smiles) FROM a, b; -- radius=2 SELECT tanimoto(a.smiles, b.smiles, 3) FROM a, b; -- explicit radius
-
Table functions return many rows.
lipinski(smiles)expands one molecule into one row per rule:SELECT * FROM chem.lipinski('CC(=O)OC1=CC=CC=C1C(=O)O');
NULL / invalid semantics. A NULL input yields NULL output for every function.
An invalid SMILES yields NULL for every value function (formula, weight,
fingerprint, …) and false for is_valid_smiles; lipinski returns no rows.
For substructure_match, an invalid SMILES or an invalid SMARTS
pattern yields NULL (a bad pattern is reported as NULL rather than aborting the
query). Nothing here ever raises a SQL error on bad chemical input.
| Function | Form | Signature | Returns |
|---|---|---|---|
is_valid_smiles |
scalar | (smiles) |
BOOLEAN |
canonical_smiles |
scalar | (smiles) |
VARCHAR (NULL if invalid) |
mol_formula |
scalar | (smiles) |
VARCHAR (Hill, NULL if invalid) |
mol_weight |
scalar | (smiles) |
DOUBLE (avg MW, g/mol) |
exact_mass |
scalar | (smiles) |
DOUBLE (monoisotopic mass) |
num_atoms |
scalar | (smiles) |
INT (heavy atoms) |
num_rings |
scalar | (smiles) |
INT |
num_rotatable_bonds |
scalar | (smiles) |
INT |
num_h_donors |
scalar | (smiles) |
INT |
num_h_acceptors |
scalar | (smiles) |
INT |
logp |
scalar | (smiles) |
DOUBLE (Crippen MolLogP) |
tpsa |
scalar | (smiles) |
DOUBLE (topological polar surface area) |
inchi |
scalar | (smiles) |
VARCHAR (standard InChI) |
inchikey |
scalar | (smiles) |
VARCHAR (standard InChIKey) |
morgan_fingerprint |
scalar | (smiles[, radius, nbits]) |
VARCHAR (hex bit-string) |
tanimoto |
scalar | (smiles_a, smiles_b[, radius]) |
DOUBLE in [0, 1] |
substructure_match |
scalar | (smiles, smarts) |
BOOLEAN (NULL on bad SMARTS) |
lipinski |
table | (smiles) |
(rule VARCHAR, value DOUBLE, passes BOOLEAN) |
mol_weight is the average molecular weight; exact_mass is the monoisotopic
mass. num_atoms counts heavy (non-hydrogen) atoms. logp is RDKit's
Crippen MolLogP; tpsa is the topological polar surface area. All return
NULL for an unparseable SMILES.
morgan_fingerprint(smiles) returns a Morgan (ECFP-like) fingerprint as a hex
bit-string (nbits / 4 hex chars; defaults radius=2, nbits=2048); the
(smiles, radius, nbits) overload sets both explicitly. tanimoto(a, b) is the
Morgan/Tanimoto similarity in [0, 1] (self-similarity is exactly 1.0); the
(a, b, radius) overload sets the Morgan radius.
substructure_match(smiles, smarts) is true when the molecule contains the
SMARTS pattern. An invalid SMILES or invalid SMARTS yields NULL (documented
behaviour) — a malformed pattern in one row never aborts the whole query.
lipinski(smiles) returns one row per rule — molecular_weight (≤ 500),
logp (≤ 5), h_bond_donors (≤ 5), h_bond_acceptors (≤ 10) — each with its
computed value and whether it passes. Roll it up to a single drug-likeness
flag:
SELECT bool_and(passes) AS drug_like
FROM chem.lipinski('CC(=O)OC1=CC=CC=C1C(=O)O'); -- trueAn invalid SMILES yields no rows.
| Component | License | Notes |
|---|---|---|
vgi-chem (this worker) |
MIT | This repository's own code. |
rdkit |
BSD-3-Clause | Cheminformatics engine (permissive). |
vgi-python |
Query Farm Source-Available | The VGI SDK. |
RDKit is BSD-3-Clause (permissive, commercial-use-friendly) and is used as an unmodified, separately-installed pip dependency — imported, never vendored or patched. Descriptor / fingerprint definitions are exactly those RDKit computes; consult the RDKit docs for their precise definitions.
uv sync --all-extras # create .venv with vgi-python + rdkit + dev tools
make test # pytest (unit + integration) + SQL end-to-end
make test-unit # pytest only
make test-sql # DuckDB sqllogictest files via haybarn-unittest
uv run ruff check . # lint
uv run mypy vgi_chem/tests/test_chem.py covers the pure cheminformatics logic with known reference
values (aspirin formula/MW/InChIKey, benzene ring/atom counts, ethanol H-bond
donors, Tanimoto self-similarity, substructure matching, and a battery of
invalid-input edge cases). tests/test_tables.py drives the lipinski table
function through the real bind→init→process lifecycle in-process;
tests/test_scalars.py spawns chem_worker.py over the VGI client/RPC stack
exactly as DuckDB would after ATTACH. The test/sql/*.test files are DuckDB
sqllogictest cases run by
haybarn-unittest
(uv tool install haybarn-unittest) against a real ATTACH + SELECT.
chem_worker.py entry point; assembles the `chem` catalog (inline uv script metadata)
Makefile test / test-unit / test-sql targets
vgi_chem/
chem.py pure cheminformatics logic over RDKit (no Arrow/VGI); imports RDKit once
scalars.py per-row scalars (arity overloads for fingerprint radius/nbits, tanimoto radius)
tables.py the lipinski(smiles) rule-of-five table function
schema_utils.py Arrow field/comment helper
tests/
harness.py in-process bind→init→process driver
test_chem.py pure-logic unit + edge tests with known reference values
test_tables.py lipinski table-function integration tests
test_scalars.py per-row scalar overloads via vgi.client.Client
test/sql/
*.test DuckDB sqllogictest end-to-end cases (haybarn-unittest)
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
