A VGI worker that brings XGBoost into DuckDB/SQL: train gradient-boosted models, persist them in a registry, predict over SQL tables, and interpret them (feature importance + SHAP contributions) — all as SQL functions.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'xgboost' (TYPE vgi, LOCATION 'uv run xgboost_worker.py');
-- train + persist a model (every non-id/target column becomes a feature)
SELECT * FROM xgboost.fit(
(SELECT * EXCLUDE (target_name) FROM xgboost.iris()),
model_name := 'iris_clf', estimator := 'xgb_classifier', target := 'target', id := 'sample_id');
-- predict later (input needs the same feature columns; selected by name)
SELECT * FROM xgboost.predict((SELECT * FROM new_flowers), model_name := 'iris_clf', id := 'id');XGBoost is built around stateful fit / predict estimators; SQL is set-oriented. Each piece is mapped to the VGI primitive that fits its data flow:
| Area | SQL surface | VGI primitive |
|---|---|---|
| Datasets | SELECT * FROM xgboost.iris() |
table function (source) |
| Fit | xgboost.fit((SELECT ...), model_name := 'm', ...) |
table-buffering → registry + BLOB |
| Typed fit | xgboost.fit_xgb_classifier((SELECT ...), n_estimators := 300, ...) |
table-buffering → registry + BLOB |
| Predict | xgboost.predict((SELECT ...), model_name := 'm') |
streaming table-in-out |
| Cross-val | xgboost.cross_val_predict / cross_val_score((SELECT ...), ...) |
table-buffering (no persistence) |
| Tuning | xgboost.grid_search / randomized_search((SELECT ...), estimator := union_value(...)) |
table-buffering (CV leaderboard) |
| Importance | xgboost.feature_importance('m') / permutation_importance(...) |
table function / buffering |
| Explain (SHAP) | xgboost.explain((SELECT ...), model_name := 'm') |
streaming table-in-out |
| Inspect | xgboost.partial_dependence((SELECT ...), model_name := 'm', feature := 'x') |
table-buffering |
Conventions for the fit / predict / explain functions:
- The input relation is the feature matrix
X, passed as a(SELECT ...)subquery. Named arguments use DuckDB'sname := value(or=>) syntax. idnames a passthrough column: it is excluded from the features and copied unchanged onto each output row, so you can join results back to the source. It is optional.target(required forfit/cross_val_predict) names the label column, also excluded from features. Classification targets may be any label dtype (string, int, …): they are label-encoded to integer codes for XGBoost, the original ordered labels are stored in the model, andpredictdecodes back to them (so a string target predicts string labels, andwith_probayieldsproba_<label>columns).- Every remaining column is a feature. Numeric columns are used directly; string columns become native categorical features (no one-hot needed — this is XGBoost's edge), and NULLs flow through as missing values. Only genuinely unsupported types (e.g. blobs, structs) raise a clear error.
- Hyperparameters can be passed as a JSON string:
params := '{"n_estimators": 300, "max_depth": 6}', or — better — as native typed arguments via thefit_<estimator>functions (see below). Unknown hyperparameters are rejected with the list of valid ones. fit/predictalign features by name, not position:predictselects the model's fitted feature columns by name (input order is irrelevant, extra columns are ignored) and errors if a required feature column is missing. Unseen categories at predict time are treated as missing rather than raising.fitalways returns the model as amodelBLOB (estimator + metadata, serialized with XGBoost's native format). It also persists to the registry only ifmodel_nameis given (somodel_nameis optional).predict,explain,feature_importance,permutation_importance, andpartial_dependencetake eithermodel_name :=ormodel :=(a BLOB).
Bundled (via scikit-learn) so demos and tests are self-contained: iris, wine,
breast_cancer (classification), diabetes, california_housing (regression),
and generators make_classification, make_regression.
SELECT target_name, avg(petal_length_cm) FROM xgboost.iris() GROUP BY target_name;
SELECT * FROM xgboost.make_classification(n_samples := 500, n_features := 8, n_classes := 3);fit, predict, cross_val_predict, cross_val_score, list_models,
model_info, drop_model.
Estimators: xgb_classifier, xgb_regressor, xgb_rf_classifier,
xgb_rf_regressor (the random-forest-flavoured boosters).
-- train + persist (params as JSON)
SELECT * FROM xgboost.fit(
(SELECT sample_id, sepal_length_cm, sepal_width_cm, petal_length_cm, petal_width_cm, target FROM xgboost.iris()),
model_name := 'iris_clf', estimator := 'xgb_classifier', target := 'target', id := 'sample_id',
params := '{"n_estimators": 200, "max_depth": 4}');
-- predict later (optionally with per-class probabilities)
SELECT * FROM xgboost.predict((SELECT * FROM new_flowers), model_name := 'iris_clf', id := 'id', with_proba := true);
-- predict output modes: raw margin or per-tree leaf indices
SELECT * FROM xgboost.predict((SELECT * FROM new_flowers), model_name := 'iris_clf', id := 'id', output_margin := true);
SELECT * FROM xgboost.predict((SELECT * FROM new_flowers), model_name := 'iris_clf', id := 'id', pred_leaf := true);
-- evaluate without persisting: out-of-fold predictions, or per-fold held-out scores
SELECT count(*) FROM xgboost.cross_val_predict(
(SELECT * FROM iris_xy), estimator := 'xgb_classifier', target := 'target', id := 'sample_id', cv := 5);
SELECT fold, score FROM xgboost.cross_val_score(
(SELECT * FROM iris_xy), estimator := 'xgb_classifier', target := 'target', cv := 5);
SELECT * FROM xgboost.list_models();
SELECT * FROM xgboost.drop_model('iris_clf');Each estimator also has a fit_<estimator> form that exposes XGBoost's most-tuned
hyperparameters as native typed SQL arguments — discoverable in autocomplete,
type-checked, no JSON: n_estimators, max_depth, learning_rate, subsample,
colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda,
objective, booster (gbtree/gblinear/dart), tree_method, random_state.
SELECT * FROM xgboost.fit_xgb_classifier(
(SELECT * FROM iris_xy), model_name := 'iris_clf', target := 'target', id := 'sample_id',
n_estimators := 300, max_depth := 6, learning_rate := 0.1);grid_search and randomized_search run cross-validated tuning and return the
leaderboard (one row per combination) with the refit best model BLOB on the best
(rank-1) row. The estimator and its grid are a single tagged-union argument:
union_value(<estimator> := {param: [values], ...}) — the union tag picks the
estimator and each member exposes only that estimator's hyperparameters. Only the
listed hyperparameters are searched; the rest stay at the estimator's defaults.
Grab the best model with WHERE model IS NOT NULL.
SELECT params, mean_test_score, rank
FROM xgboost.grid_search((SELECT * FROM iris_xy),
target := 'target', id := 'sample_id',
estimator := union_value(xgb_classifier := {'n_estimators': [100, 300], 'max_depth': [3, 5, 8]}))
ORDER BY rank;
SELECT params, mean_test_score FROM xgboost.randomized_search((SELECT * FROM iris_xy),
target := 'target', n_iter := 10,
estimator := union_value(xgb_classifier := {'learning_rate': [0.05, 0.1, 0.2], 'max_depth': [3, 5, 8]}));feature_importance, explain, permutation_importance, and
partial_dependence.
-- ranked per-feature importance for a stored model (weight/gain/cover/total_*)
SELECT * FROM xgboost.feature_importance('iris_clf', importance_type := 'gain');
-- SHAP contributions in LONG format: one row per (input row, [class], feature),
-- with shap_value + base_value. base_value + sum(shap_value) reconstructs the
-- model's raw-margin prediction; multiclass models add a `class` column.
SELECT * FROM xgboost.explain((SELECT * FROM xgboost.diabetes()), model_name := 'diab_reg', id := 'sample_id');
-- model-agnostic importance, ranked: the drop in score when each feature is shuffled.
SELECT * FROM xgboost.permutation_importance((SELECT * FROM xgboost.diabetes()),
model_name := 'diab_reg', target := 'target') ORDER BY rank;
-- partial dependence: how the average prediction moves as one numeric feature
-- varies over a grid (one curve per class for multiclass classifiers).
SELECT * FROM xgboost.partial_dependence((SELECT * FROM xgboost.diabetes()),
model_name := 'diab_reg', feature := 'bmi') ORDER BY feature_value;The booster deliberately ships no metric functions: score its predict output
with vgi-sklearn's metric aggregates instead of duplicating them. Attach both
workers and join the predictions to the ground truth:
ATTACH 'sklearn' (TYPE vgi, LOCATION 'uv run sklearn_worker.py');
ATTACH 'xgboost' (TYPE vgi, LOCATION 'uv run xgboost_worker.py');
-- accuracy of an xgboost model, computed by sklearn.accuracy_score
SELECT sklearn.accuracy_score(t.target, p.prediction)
FROM xgboost.predict((SELECT * FROM test_x), model_name := 'iris_clf', id := 'sample_id') p
JOIN test_labels t USING (sample_id);
-- ROC AUC from per-class probabilities (with_proba := true)
SELECT sklearn.roc_auc_score(t.target, p.proba_1)
FROM xgboost.predict((SELECT * FROM test_x), model_name := 'churn', id := 'id', with_proba := true) p
JOIN test_labels t USING (id);String columns are used as categorical features directly (no one-hot), and NULLs are XGBoost's native missing value — both at fit and predict, and unseen categories at predict map to missing rather than erroring:
SELECT * FROM xgboost.fit((SELECT id, color, score, churned FROM customers),
model_name := 'churn', estimator := 'xgb_classifier', target := 'churned', id := 'id');Fitted models are serialized with XGBoost's native save_model (UBJSON, not
pickle) plus a JSON metadata sidecar — so they are forward-compatible across
library upgrades and load without arbitrary code execution. The same packing
flows through SQL as a self-contained model BLOB. The store is chosen behind the
ModelStore interface in vgi_xgboost/registry.py:
- Local disk (default):
XGBOOST_MODELS_DIR(default./models). - S3 / Cloudflare R2: not yet implemented —
get_store()is the single seam where anS3Storedrops in.
On Fly.io the local store is backed by a mounted volume (see fly.toml) so models
survive machine restarts. predict records the XGBoost version used to fit and
logs a warning (visible in duckdb_logs()) if the worker's version differs.
make venv # create .venv with vgi + xgboost + scikit-learn (from PyPI)
make lint # ruff + mypy
make pytest # unit tests
make test-sql # SQL tests in-process via haybarn (no custom DuckDB build needed)
make test-stdio # SQL tests with the worker as a subprocess (custom unittest runner)
make test-http # SQL tests against a local HTTP serverThe test/sql/*.test files are the integration suite. test-stdio/test-http
run them with DuckDB's unittest runner built with the VGI extension
(VGI_BUILD_DIR) and are the local authority. test-sql replays the same
files in-process against the haybarn DuckDB distribution (which can
INSTALL vgi FROM community), so they also run on a stock CI runner.
.github/workflows/ci.yml runs ruff, mypy, the unit tests, the haybarn SQL
suite, and a Docker build + /health smoke test on every push and PR.
Dependabot (.github/dependabot.yml) keeps the Python deps, GitHub Actions, and
the Docker base image up to date weekly.

