Org-wide Meet transcript ETL + vector store + read-only MCP server by hhff · Pull Request #115 · sanctuarycomputer/stacks

hhff · 2026-06-29T07:39:53Z

What this is

The first slice of the MCP & agent-features effort: a durable, org-wide vector store fed by an ETL pipeline, with Google Meet transcripts as the first source, exposed to an agent through a read-only MCP server. Built inside Stacks (Rails 6.1) so it reuses the existing Postgres, Google Workspace auth, and ActiveAdmin.

Design spec: docs/superpowers/specs/2026-06-28-org-vector-store-and-etl-design.md · Plan: docs/superpowers/plans/2026-06-28-org-vector-store-and-etl.md · Deploy runbook: docs/meet-etl-deploy.md

How it works

Generic core (documents → chunks, polymorphic embeddings side-table, mentions, document_contacts, source_syncs) with pgvector (neighbor) + Postgres full-text. Search is source-agnostic, so future connectors light up automatically.
Connector/ETL framework (Stacks::Etl::Connector): extract → classify → resolve contacts → chunk → embed → load, idempotent upserts, per-source watermark, SystemTask-wrapped rake tasks.
Meet connector (source Feature/hotfixes #1): hybrid Meet REST API (ongoing, rich) + Drive transcript-Doc sweep (90-day backfill — the API only retains ~30 days). Rich meetings/participants/segments project into the generic core.
Calendar enrichment: matches each meeting to its Calendar event (precisely — by Meet code / exact title, never by time alone) to recover the real title (re-enabling title-based exclusion) and attendee emails (resolved to Contacts).
Org-wide multi-user sweep: impersonates every active Workspace user via domain-wide delegation, error-isolated; dedup by global IDs.
Local embeddings (informers + quantized mxbai-embed-large-v1, 1024-dim) — no API key, no chunk text leaves our infra.
Read-only MCP server (official mcp gem, Streamable HTTP at /api/mcp, stateless) — search / get_document / list_documents / list_sources, authenticated with the existing X-Api-Key private key.

Privacy / governance

Exclusion wall: sensitive meetings (1:1s, performance review, comp, HR, …) are auto-classified excluded and never chunked, embedded, or returned by any MCP tool or search mode — verified end-to-end. They keep their full transcript for the human record (ActiveAdmin only).
Reversible: excluded docs retain segments; an ActiveAdmin Include & index action chunks+embeds from stored segments (no re-fetch).
Identity is anchored on the existing contacts table.

Admin

New top-level MCP menu → ETL subpages (Meetings, Documents, Source syncs); Chunks + Mentions are reached by drilling into a Document. Document show has Include & index / Exclude actions.

Testing

~60 new tests / 150+ assertions, plus the full suite green (313 runs, 0 failures). Schema verified to load cleanly from scratch (db:test:prepare). Enrichment + ingest verified against the live Workspace.

Deploy notes (see runbook)

Heroku Postgres must be Standard+ tier (pgvector); the enable_extension "vector" migration fails otherwise.
DWD scopes meetings.space.readonly + drive.readonly and the Meet API must be enabled (already done).
Overnight 90-day backfill: heroku run:detached --size=performance-l "rake 'stacks:etl:backfill_meet_all[90]'"; nightly stacks:etl:sync_meet_all on a Performance dyno.

Known follow-ups (non-blocking, in the runbook)

Cross-source dedup (Drive backfill vs Meet API sync), semantic-search IN (...) scaling, baking the embedding model into the slug, and the intelligence layer (decisions/commitments/tasks/opportunities) as a separate spec.

Out of scope

Intelligence layer, connectors beyond Meet, multi-tenant/productization — all deliberately deferred.

🤖 Generated with Claude Code

Foundation spec for the MCP & agent-features effort: ingest Meet transcripts org-wide (hybrid Meet API + Drive backfill), store them permanently in Postgres with full-text + pgvector semantic search, and expose them through a read-only MCP server. Includes an excluded/ excluded_reason exclusion layer that walls sensitive meetings (1:1s, reviews, comp, HR) off from the agent entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Generalize the design from a Meet-specific transcript corpus into a source-agnostic vector store (documents/document_people/document_chunks with pgvector + full-text + facets) plus a connector/ETL framework with incremental sync watermarks. The MCP server now searches across all sources at once, so every future connector (Notion, Gmail, etc.) is searchable the moment it loads. Meet transcripts become the first connector, keeping rich meetings/segments/participants tables that project into the generic core. Exclusion layer and identity resolution (people -> AdminUser/Contributor) generalize to any source. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…e target) Fold in the institutional-memory design Hugh shared: - Index is a synthesis layer, not a mirror: full-ingest Meet (no MCP), promote only high-signal slices from MCP-having sources later. - Embeddings become a versioned polymorphic side-table keyed by (owner_type, owner_id, model) instead of a vector column on chunks. - Add document_versions (history for changing docs) and a mentions table + unresolved-mention queue for transcript display-name -> AdminUser resolution; chunks carry stable spans for later evidence citation. - MCP: official mcp Ruby gem over Streamable HTTP, per-client scoped tokens, audit logging, retrieved content treated as untrusted. - Record the full 16-table north star and which tables this foundation builds vs. the later intelligence layer. Decisions locked: foundation-first scope, keep rake + Heroku Scheduler (no GoodJob, embed inline), internal single-tenant. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Use Contact as the canonical person spine (FK target everywhere) rather than AdminUser/Contributor or a new person table: contacts is unique on email, Apollo-enriched, and already aims to cover everyone — including external @gmail/client guests who are neither workspace logins nor Forecast contributors. Extend contacts with nullable admin_user_id / contributor_id bridge links (populated via the AdminUser cross-domain matcher) so internal org info is reachable and domain-variant rows regroup. mentions/document_people/Meet participants+segments now FK to contact_id; display-name resolution targets Contact with the unresolved queue. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…min UI - Contact is the identity outright: resolve every person by email (create_or_find_by, made if missing — workspace or external), with no AdminUser/Contributor reconciliation and no bridge columns. Contributor org-info is joinable later by email if the intelligence layer wants it. - Drop the document_versions table (Meet transcripts are immutable): content_hash moves onto documents, chunks belong to documents, and per-fetch versioning returns when the first mutable source (Notion) lands. - mentions are chunk-scoped; chunks/segments carry speaker_contact_id. - Add a top-level ActiveAdmin "MCP" menu with an "ETL" subpage (meetings/documents/chunks/mentions queue/source_syncs) for visual debugging and exclusion review. - Add open item to scope sync_contacts (Apollo) away from meet-only contacts. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

The existing stacks:sync_contacts already enriches every contact via Apollo, so meet-sourced contacts flow through as desired. No special scoping needed. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

23 TDD tasks: pgvector/gems, generic core schema (documents/chunks/ embeddings side-table/mentions/document_contacts/source_syncs), Meet tables, Voyage embedder, chunker, mention resolver, ETL connector base, Meet auth/classifier/api+drive sources/connector, SystemTask rake tasks, hybrid search, read-only MCP tools + streamable-HTTP endpoint with bearer auth, and the ActiveAdmin MCP -> ETL menu. Pins Ruby-3.1-compatible gem versions and flags the Heroku pgvector tier prerequisite. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

…ey for MCP - Embeddings now run locally via the informers gem (ONNX), model mixedbread-ai/mxbai-embed-large-v1 quantized, 1024-dim. No API key, and no chunk text leaves our infra (privacy win). Search queries get the model's retrieval prefix; stored chunks do not. - MCP endpoint reuses the existing private API key (X-Api-Key header / config[:stacks][:private_api_key], via ApiController#check_private_api_key!) instead of a new bearer token; unauthorized -> 403. - Plan: Task 10 rewritten for informers; Task 1 adds the gem; Task 5/20 model strings updated; Task 22 auth rewritten. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

- Adds neighbor ~> 0.4.3, informers 1.2.1, mcp 0.22.0, google-apis-meet_v2 0.13.0, google-apis-drive_v3 0.81.0 gems - Updates google-apis-core 0.5.0 → 0.15.1 (required by meet_v2), cascading to googleauth 1.9.2, signet 0.16.1, addressable 2.8.7 - Adds rexml explicitly (dropped from google-apis-core transitive chain) - Enables the vector Postgres extension via migration - Smoke-tests pg_extension presence with PgvectorSmokeTest Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Adds display_name column to contacts table and implements Contact.resolve_email(email, name: nil) class method for creating or finding contacts, tagging them with 'meet' source, and setting display_name when blank. Also fix schema.rb by removing three incorrect composite foreign key definitions that were preventing tests from running. The real FKs are defined in the migration using raw SQL. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>

Task 2's migration regenerated db/schema.rb without the three intentional composite-FK documentation comments (fk_adhoc_invoice_trackers_qbo_invoice, fk_contributor_adjustments_qbo_invoice, fk_invoice_trackers_qbo_invoice) that exist in the branch base. Restore them so schema.rb matches canonical. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>