Skip to content

Add vector_search_indexes resource (direct engine)#5123

Open
janniklasrose wants to merge 45 commits intomainfrom
janniklasrose/vs-index
Open

Add vector_search_indexes resource (direct engine)#5123
janniklasrose wants to merge 45 commits intomainfrom
janniklasrose/vs-index

Conversation

@janniklasrose
Copy link
Copy Markdown
Contributor

@janniklasrose janniklasrose commented Apr 29, 2026

Changes

Adds vector_search_indexes as a first-class DABs resource on the direct engine, alongside the existing vector_search_endpoints. Direct engine only — vector search has no Terraform provider.

resources:
  vector_search_endpoints:
    my_endpoint:
      name: my-endpoint
      endpoint_type: STANDARD
  vector_search_indexes:
    my_index:
      name: main.default.my_index
      endpoint_name: ${resources.vector_search_endpoints.my_endpoint.name}
      primary_key: id
      index_type: DELTA_SYNC
      delta_sync_index_spec:
        source_table: main.default.source
        pipeline_type: TRIGGERED
      grants:
        - principal: data-engineers
          privileges: [SELECT]

What's included:

  • Resource model in bundle/config/resources/vector_search_index.go (with grants) and bundle/direct/dresources/vector_search_index.go (state, lifecycle, drift classification). RemapState round-trips index_subtype so a populated remote subtype isn't classified as drift on the next plan.
  • UC grants wired through the generic grants path with securable type table.
  • recreate_on_changes for immutable spec fields (name, endpoint_name, index_type, index_subtype, primary_key, delta_sync_index_spec, direct_access_index_spec); delta_sync_index_spec.columns_to_sync marked ignore_remote_changes (request-only field — see follow-up note below). The index API has no rename or update path, so any config-side change has to round-trip through delete + create.
  • Index orphaning detection: index state persists the endpoint_uuid of the endpoint it was created against. DoRead looks up the current endpoint UUID by name; if the endpoint was deleted out-of-band the lookup returns "" and OverrideChangeDesc classifies the saved-vs-remote mismatch as Recreate. Builds on the endpoint UUID persistence merged in Persist endpoint UUID for vector_search_endpoints drift detection #5127.
  • Async delete handling: new optional WaitAfterDelete adapter method (sibling to WaitAfterCreate / WaitAfterUpdate). For VS indexes it polls GetIndex until 404 (15-minute cap). apply.Recreate runs DoDelete → DeleteState → WaitAfterDelete → DoCreate → SaveState → WaitAfterCreate, so a wait-time failure leaves the bundle consistent. Replaces the prior SaveState("", nil, nil) placeholder that produced invalid state: empty id planning failures on partial recreate.
  • Destructive-action prompt for VS indexes in bundle/phases/. The message intentionally covers both Delta Sync ("re-runs the embedding pipeline") and Direct Access ("upserted vectors lost") in one paragraph — picking a type-specific message from the bundle config would be wrong on type changes (DELTA_SYNCDIRECT_ACCESS recreates would describe the destination type while the actual teardown is of the source type).
  • Dev-mode name prefixing for indexes prefixes only the leaf component of catalog.schema.name, since catalog and schema are external references (the previous behavior produced invalid names like dev_jan_main.default.my_index). The mutator skips names that still carry literal ${...} tokens, since the leaf split would otherwise inject the prefix inside the trailing ref expression itself.
  • Testserver enforces endpoint existence on index create. Index status returns Ready: true immediately, matching the convention used by every other slow resource the testserver fakes (endpoints → ONLINE, database instances → AVAILABLE, apps → RUNNING).

index_type / spec-block consistency is intentionally not validated client-side — the CreateIndex API rejects mismatched combinations at deploy time, and replicating that check in DABs would just duplicate backend logic.

Why

The direct engine recently gained vector_search_endpoints (#4887). This PR extends the support to indexes, which were the missing half. Along the way it surfaces and fixes a number of issues:

  • Without persisted endpoint UUIDs, identity drift was undetectable. An index pointing at a deleted-and-recreated endpoint would appear live by name but its backing endpoint was gone, leading to confusing "index already exists" errors on subsequent deploys. Persist endpoint UUID for vector_search_endpoints drift detection #5127 added the same UUID tracking on the endpoint side; this PR mirrors it on the index side so the orphan is caught.
  • The async deletion model isn't documented in the SDK, but recreate deploys hit it every time. Without a wait, every recreate failed on the immediate Create.
  • apply.Recreate was writing a malformed empty-ID state entry as its "delete state" step, which then poisoned the next plan with invalid state: empty id.
  • Recreating a VS index is genuinely expensive — Delta Sync re-runs the full embedding pipeline; Direct Access loses every upserted vector. The destructive-action prompt now reflects that.

Follow-ups

  • delta_sync_index_spec.columns_to_sync is request-only in the SDK today: the field is accepted on Create but the Get response doesn't echo it back, which is why we mark it ignore_remote_changes here. There's an open backend PR to expose columns_to_sync on the read path; once the SDK is regenerated against that, we can drop the ignore_remote_changes entry and let normal drift detection handle the field.
  • vector_search_endpoints.budget_policy_id drift (effective vs. requested) and the SDK doc-comment for vector_search_endpoints.usage_policy_id are intentionally not in this PR — both will be addressed by the next SDK bump and the corresponding ./task generate-schema regen.

Tests

  • ./task fmt, ./task checks, ./task lint — all clean.
  • ./task test — unit tests green across bundle/....
  • New unit test TestVectorSearchIndexNameWithUnresolvedRefsLeftAlone in apply_target_mode_test.go exercises the leaf-prefix skip on ${var.catalog}.${var.schema}.${var.index}.
  • New acceptance directories under acceptance/bundle/resources/vector_search_indexes/: basic, drift/columns_to_sync, drift/deleted_remotely, drift/orphaned_endpoint, recreate/index_type, recreate/mixed_types, grants/select.
  • The recreate request log (recreate/index_type/out.requests.recreate.direct.json) captures GET → DELETE → GET → POST with --get enabled in print_requests.py. The middle GET is the WaitAfterDelete poll; if a future change drops the wait the regenerated capture loses that line and the test fails.
  • acceptance/bundle/validate/presets_name_prefix covers the leaf-only name prefix on a 3-part index name.
  • acceptance/bundle/invariant/configs/vector_search_index.yml.tmpl exercises the resource through the invariant matrix; the testserver enforces endpoint existence on index create.
  • Live tested with --profile tmp against staging across initial deploy / drift / recreate / destroy.

This PR was written by Claude Code.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 29, 2026

Approval status: pending

/acceptance/bundle/ - needs approval

64 files changed
Suggested: @denik
Also eligible: @andrewnester, @shreyas-goenka, @pietern, @anton-107, @lennartkats-db

/bundle/ - needs approval

31 files changed
Suggested: @denik
Also eligible: @andrewnester, @shreyas-goenka, @pietern, @anton-107, @lennartkats-db

General files (require maintainer)

4 files changed
Based on git history:

  • @denik -- recent work in bundle/direct/dresources/, bundle/direct/, libs/testserver/

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@janniklasrose janniklasrose marked this pull request as draft April 29, 2026 13:12
@janniklasrose janniklasrose force-pushed the janniklasrose/vs-index branch from 1943af9 to 87018ce Compare April 29, 2026 13:19
janniklasrose added a commit that referenced this pull request Apr 30, 2026
)

## Changes

Persist `endpoint_uuid` in state and detect identity drift on
`vector_search_endpoints`.

The endpoint name is stable but its UUID changes if the endpoint is
deleted and recreated by name (e.g. via the workspace UI). Without
persisting the UUID:

- The bundle silently rebound permissions to a different backing
endpoint without recreating the endpoint resource.
- Anything else referencing `endpoint_uuid` (most importantly the
permissions object_id, but also indexes added on top in the next PR)
raced the recreate.

`VectorSearchEndpointState` now embeds `vectorsearch.CreateEndpoint` and
adds `EndpointUuid`. `DoCreate` records the UUID from the create
response; `DoUpdate` copies it from `entry.RemoteState` so unrelated
updates (e.g. `min_qps`) don't blank it out. `OverrideChangeDesc`
classifies `endpoint_uuid` drift as `Recreate` when saved differs from
remote, `Skip` otherwise.

`drift/recreated_same_name` flips from a "badness snapshot" (which
captured the old behavior of permissions silently rebinding) to the
recreate behavior, with a permissions block on the endpoint to verify
the cascade rebinds correctly.

`drift/min_qps/out.plan.direct.json` regenerates to include the new
`endpoint_uuid` skip entry in the detailed plan.

## Why

Splitting this out of the larger `vector_search_indexes` PR
([#5123](#5123)) so it can land
independently. The index PR builds on the persisted UUID for orphan
detection, but the endpoint UUID work stands on its own and is useful
regardless.

## Tests

- `make fmtfull`, `make checks`, `make lintfull` — clean.
- `make test` — green (`libs/apps/runlocal` needed `NODE_OPTIONS=` for
the harness leak; unrelated). `bundle/internal/schema
TestRequiredAnnotationsForNewFields` panics, which is failing on `main`
for unrelated reasons.
- `go test ./acceptance -run
'TestAccept/bundle/resources/vector_search_endpoints'` — all green,
including the flipped `drift/recreated_same_name`.

_This PR was written by Claude Code._
- invariant vector_search_index config now creates the required endpoint
  in-bundle so the testserver's endpoint-existence check passes
- drift/budget_policy now asserts that the remote drift is ignored
  (matches ignore_remote_changes for budget_policy_id: effective value
  may include inherited workspace policy, not user-set)
- vector_search_indexes/basic URL updated to match the current
  catalog.schema.name-based InitializeURL behavior

Co-authored-by: Isaac
Covers the case where an index is deleted outside the bundle: the next
plan should propose creation and the next deploy restores it.

Mirrors acceptance/bundle/resources/volumes/remote-delete and closes
the missing drift coverage for indexes (endpoints have
drift/budget_policy and drift/min_qps; indexes had none).

Co-authored-by: Isaac
…ectorIndexRequest

make lint surfaces exhaustruct issues on changed packages. Explicitly
initialize the zero-value fields introduced by the prior two commits
(EndpointUuid on the new state type, IndexSubtype on the existing
RemapState literal).

Co-authored-by: Isaac
The name is a 3-part UC identifier (catalog.schema.index) where catalog
and schema are external references the bundle does not own. Prefixing
the whole name turned "main.default.my_index" into
"dev_user_main.default.my_index", which is not a valid catalog.

Prefix only the leaf component, matching how references work:
"main.default.my_index" -> "main.default.dev_user_my_index".

Co-authored-by: Isaac
Exercises the fix from the previous commit: for
vector_search_indexes.vs_index.name = my_catalog.${resources.schemas.my_schema.name}.my_index
only the leaf (my_index) is prefixed, the catalog is preserved as-is,
and the schema reference stays intact — it resolves at deploy time to
the already-prefixed schema, yielding the expected 3-part name.

Co-authored-by: Isaac
Indexes are UC-only (delta sync against a UC table, UC grants as ACL).
Mirrors the flag already set on vector_search_endpoints so these tests
are skipped on non-UC cloud runs instead of failing on missing catalog.

Co-authored-by: Isaac
If the endpoint an index is attached to is deleted out-of-band, the
index still exists by name but its backing endpoint is gone. The
planner saw the index's remote state as valid and produced "skip",
then deploy recreated the endpoint (new UUID) and left the index
orphaned against the old UUID — on the next deploy attempt the user
saw "index already exists" with no way to resolve short of manual
deletion.

VectorSearchIndexState now persists endpoint_uuid alongside the create
request. DoRead looks it up from the endpoint service (the index API
doesn't return it). OverrideChangeDesc classifies saved vs remote
endpoint UUID drift as Recreate so the plan correctly shows
"recreate vector_search_indexes.*" when the endpoint has been
replaced, and the apply path delete+creates the index against the new
endpoint.

Covered by acceptance/.../vector_search_indexes/drift/orphaned_endpoint.

Co-authored-by: Isaac
Recreate of a Vector Search index re-runs the full embedding pipeline
(Delta Sync) or drops every upserted vector (Direct Access). Both can
take significant time and cost. Bring vector_search_indexes into the
same destructive-action prompt path used for schemas, pipelines,
volumes, dashboards, and the Lakebase resources, with a message
explaining the cost.

Acceptance scripts that recreate indexes now pass --auto-approve, and
all VS index test outputs pick up the prompt message at destroy time.

Co-authored-by: Isaac
DELETE on a Vector Search index is asynchronous: the backend tears
down the embedding pipeline over a few minutes, and any operation on
the same name during that window returns "Index ... is currently
pending deletion". Recreate (delete+create on the same name) hit this
on every deploy.

DoDelete now polls GetIndex until it returns 404, so a follow-up
DoCreate sees a clean slate. Timeout is 15 minutes, matching observed
worst-case teardown.

Also harden recovery from a partial recreate: if Create after Delete
fails, apply.Recreate now drops the state entry instead of leaving it
with an empty ID. Pre-fix state files that already carry an empty ID
are tolerated by the planner — empty ID is treated the same as a
missing entry, so the next plan re-creates the resource cleanly
instead of failing with "invalid state: empty id".

Testserver now models the async deletion window: DELETE moves the
index into a pending buffer, CREATE during pending returns the
backend's exact error, and the next GET (the wait loop's poll) clears
the buffer. The recreate/index_type acceptance test exercises the
full path.

Co-authored-by: Isaac
Previously lookupEndpointUuid swallowed all non-404 errors and returned "",
which would feed empty remoteUuid into OverrideChangeDesc and propose a
destructive Recreate ("endpoint replaced out-of-band") on transient or
permission errors. The Recreate is dangerous: Delta Sync re-runs the
embedding pipeline, and Direct Access loses all upserted vectors.

Now the helper returns (string, error): 404 maps to ("", nil) — the orphan
signal — and any other error is propagated through DoRead/DoCreate so the
plan fails loudly instead of misclassifying it as drift.

Document the OverrideChangeDesc divergence from vector_search_endpoint
(which requires remoteUuid != ""): for indexes, an empty remoteUuid is the
orphan signal, and the lookup contract guarantees that case is unambiguous.
Add a Badness-marked test that deploys a bundle with both a
vector_search_endpoint and a vector_search_index referencing it, then
changes the endpoint_type to trigger an endpoint Recreate. The plan
correctly recreates the endpoint but leaves the dependent index
unchanged, so on a real workspace the endpoint delete would either
fail (indexes still attached) or orphan the index.

Root cause is in the planner (bundle/direct/bundle_plan.go): there is
no logic to propagate Recreate from a dependency to its dependents.
This is a framework-level concern that affects more than just VS,
so it's deferred to a follow-up. The Badness entry documents the gap.
Add a Badness-marked validate test showing that the name_prefix preset
does not rewrite a vector_search_indexes.*.endpoint_name literal that
points at a bundle-managed (and therefore prefixed) endpoint. The output
shows vs_endpoint -> prefix_vs_endpoint while vs_index_literal still
targets the unprefixed name vs_endpoint.

The DABs idiom is to use ${resources.vector_search_endpoints.X.name}
(captured by vs_index_ref in the same fixture). That form resolves
correctly to the prefixed name at plan/deploy time, so users have a
working pattern. The literal form silently breaks though, and the
preset has enough information to rewrite it; tracked as Badness for a
follow-up fix in apply_presets.go.
Mirror the existing vector_search_endpoint bind test: pre-create both
endpoint and index, bind the index into the bundle, deploy, unbind, and
destroy. Verifies the index survives unbind+destroy as expected.

Required by bundle/direct/dresources/README.md for new resource types.
The placeholder substitutes index_type, not primary_key — the original
name was misleading.
Drop the Terraform-provider justification (already implied by
"direct engine only") and the long list of internal mechanics.
Keep the entry focused on what customers see.
CreateIndex returns immediately with metadata of an index whose embedding
pipeline is still provisioning; queries against an index that isn't ready
fail. Implement WaitAfterCreate so dependent resources (and the next plan)
see a usable index. 75-minute timeout matches the terraform provider.

Co-authored-by: Isaac
The SaveState->DeleteState change in apply.Recreate and the empty-id
tolerance in bundle_plan.go were extracted to a separate PR (#5173).
Reverting them here so this branch and #5173 can land independently;
once #5173 merges, a rebase on main brings the same fix back in.

Co-authored-by: Isaac
@janniklasrose janniklasrose marked this pull request as ready for review May 6, 2026 10:51
Comment on lines +535 to +537
# columns_to_sync is request-only in the create spec and not returned by the read API.
- field: delta_sync_index_spec.columns_to_sync
reason: input_only
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR (under review by VS team) to return this field

// createIndexTimeout caps the wait for an index to become ready after creation.
// Delta sync indexes do an initial sync from the source table, which can stretch
// out for large tables. Matches the terraform provider's defaultIndexProvisionTimeout.
// https://github.com/databricks/terraform-provider-databricks/blob/c61a32300445f84efb2bb6827dee35e6e523f4ff/vectorsearch/resource_vector_search_index.go#L19
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update reference after databricks/terraform-provider-databricks#5598 is merged


trace $CLI bundle summary

trace $CLI vector-search-endpoints create-endpoint "${endpoint_name}" STANDARD | jq '{id, name, endpoint_type}'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create it as part of the same bundle and refer to it in the index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong feeling either way.

  • pro: they are strongly coupled so should be declared together
  • con: precedent is to have resources/RESOURCE/*/databricks.yml.tmpl only contain the resource

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precedent is to have resources/RESOURCE/*/databricks.yml.tmpl only contain the resource

I don't think we are strict about it, we have already resources combining both (see catalogs or database_catalogs for example).

Having them together helps us to test the actual use case users will have + cleaning up is a bit easier too

// provisioning; queries against an index that isn't ready fail. Blocking here
// lets dependent resources (and the next plan) see a usable index.
func (r *ResourceVectorSearchIndex) WaitAfterCreate(ctx context.Context, config *VectorSearchIndexState) (*VectorSearchIndexRemote, error) {
index, err := retries.Poll(ctx, createIndexTimeout, func() (*vectorsearch.VectorIndex, *retries.Err) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a very long operation, right? Shall we maybe not wait for index to be ready and then later introduce support for lifecycle.started here which will wait?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following terraform's 75min timeout (see databricks/terraform-provider-databricks#5598 that I made by request from VS team). Don't we always wait for creation for every resource? We can also keep WaitAfterCreate but make it opt-in (maybe use an experimental flag)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't for clusters and sql warehouses for example (in both direct and TF mode) and we now make it opt in with lifecycle.started option #5150

My concern with waiting is that if it's a very long operation users don't have a way out of it right now
But implementation wise I'm fine with both to be honest, we can iterate on it any time

}

func (r *ResourceVectorSearchIndex) DoUpdate(ctx context.Context, id string, config *VectorSearchIndexState, entry *PlanEntry) (*VectorSearchIndexRemote, error) {
// Vector search indexes have no update API; all field changes trigger recreation via resources.yml.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if the new field is added but it can be updatable, how do we not leave it unnoticed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants