Org-wide Meet transcript ETL + vector store + read-only MCP server#115
Org-wide Meet transcript ETL + vector store + read-only MCP server#115hhff wants to merge 51 commits into
Conversation
Foundation spec for the MCP & agent-features effort: ingest Meet transcripts org-wide (hybrid Meet API + Drive backfill), store them permanently in Postgres with full-text + pgvector semantic search, and expose them through a read-only MCP server. Includes an excluded/ excluded_reason exclusion layer that walls sensitive meetings (1:1s, reviews, comp, HR) off from the agent entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Generalize the design from a Meet-specific transcript corpus into a source-agnostic vector store (documents/document_people/document_chunks with pgvector + full-text + facets) plus a connector/ETL framework with incremental sync watermarks. The MCP server now searches across all sources at once, so every future connector (Notion, Gmail, etc.) is searchable the moment it loads. Meet transcripts become the first connector, keeping rich meetings/segments/participants tables that project into the generic core. Exclusion layer and identity resolution (people -> AdminUser/Contributor) generalize to any source. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…e target) Fold in the institutional-memory design Hugh shared: - Index is a synthesis layer, not a mirror: full-ingest Meet (no MCP), promote only high-signal slices from MCP-having sources later. - Embeddings become a versioned polymorphic side-table keyed by (owner_type, owner_id, model) instead of a vector column on chunks. - Add document_versions (history for changing docs) and a mentions table + unresolved-mention queue for transcript display-name -> AdminUser resolution; chunks carry stable spans for later evidence citation. - MCP: official mcp Ruby gem over Streamable HTTP, per-client scoped tokens, audit logging, retrieved content treated as untrusted. - Record the full 16-table north star and which tables this foundation builds vs. the later intelligence layer. Decisions locked: foundation-first scope, keep rake + Heroku Scheduler (no GoodJob, embed inline), internal single-tenant. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Use Contact as the canonical person spine (FK target everywhere) rather than AdminUser/Contributor or a new person table: contacts is unique on email, Apollo-enriched, and already aims to cover everyone — including external @gmail/client guests who are neither workspace logins nor Forecast contributors. Extend contacts with nullable admin_user_id / contributor_id bridge links (populated via the AdminUser cross-domain matcher) so internal org info is reachable and domain-variant rows regroup. mentions/document_people/Meet participants+segments now FK to contact_id; display-name resolution targets Contact with the unresolved queue. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…min UI - Contact is the identity outright: resolve every person by email (create_or_find_by, made if missing — workspace or external), with no AdminUser/Contributor reconciliation and no bridge columns. Contributor org-info is joinable later by email if the intelligence layer wants it. - Drop the document_versions table (Meet transcripts are immutable): content_hash moves onto documents, chunks belong to documents, and per-fetch versioning returns when the first mutable source (Notion) lands. - mentions are chunk-scoped; chunks/segments carry speaker_contact_id. - Add a top-level ActiveAdmin "MCP" menu with an "ETL" subpage (meetings/documents/chunks/mentions queue/source_syncs) for visual debugging and exclusion review. - Add open item to scope sync_contacts (Apollo) away from meet-only contacts. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The existing stacks:sync_contacts already enriches every contact via Apollo, so meet-sourced contacts flow through as desired. No special scoping needed. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
23 TDD tasks: pgvector/gems, generic core schema (documents/chunks/ embeddings side-table/mentions/document_contacts/source_syncs), Meet tables, Voyage embedder, chunker, mention resolver, ETL connector base, Meet auth/classifier/api+drive sources/connector, SystemTask rake tasks, hybrid search, read-only MCP tools + streamable-HTTP endpoint with bearer auth, and the ActiveAdmin MCP -> ETL menu. Pins Ruby-3.1-compatible gem versions and flags the Heroku pgvector tier prerequisite. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…ey for MCP - Embeddings now run locally via the informers gem (ONNX), model mixedbread-ai/mxbai-embed-large-v1 quantized, 1024-dim. No API key, and no chunk text leaves our infra (privacy win). Search queries get the model's retrieval prefix; stored chunks do not. - MCP endpoint reuses the existing private API key (X-Api-Key header / config[:stacks][:private_api_key], via ApiController#check_private_api_key!) instead of a new bearer token; unauthorized -> 403. - Plan: Task 10 rewritten for informers; Task 1 adds the gem; Task 5/20 model strings updated; Task 22 auth rewritten. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Adds neighbor ~> 0.4.3, informers 1.2.1, mcp 0.22.0, google-apis-meet_v2 0.13.0, google-apis-drive_v3 0.81.0 gems - Updates google-apis-core 0.5.0 → 0.15.1 (required by meet_v2), cascading to googleauth 1.9.2, signet 0.16.1, addressable 2.8.7 - Adds rexml explicitly (dropped from google-apis-core transitive chain) - Enables the vector Postgres extension via migration - Smoke-tests pg_extension presence with PgvectorSmokeTest Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Adds display_name column to contacts table and implements Contact.resolve_email(email, name: nil) class method for creating or finding contacts, tagging them with 'meet' source, and setting display_name when blank. Also fix schema.rb by removing three incorrect composite foreign key definitions that were preventing tests from running. The real FKs are defined in the migration using raw SQL. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Task 2's migration regenerated db/schema.rb without the three intentional composite-FK documentation comments (fk_adhoc_invoice_trackers_qbo_invoice, fk_contributor_adjustments_qbo_invoice, fk_invoice_trackers_qbo_invoice) that exist in the branch base. Restore them so schema.rb matches canonical. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Implement the Document model with source and exclusion enums, corpus_eligible scope for non-excluded and manually-included documents, and predicates for querying document status. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Rails 6.1 dumps PostgreSQL GENERATED columns as DEFAULT expressions that PostgreSQL rejects on schema:load (cannot use column reference in DEFAULT). The content_tsv tsvector column and its GIN index are omitted from schema.rb with an explanatory comment; test_helper.rb recreates them idempotently after schema:load using ADD COLUMN IF NOT EXISTS and CREATE INDEX IF NOT EXISTS, mirroring the existing trigger workaround. Also restores the three composite-FK comment lines that db:schema:dump had overwritten with unloadable add_foreign_key calls. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Also fix schema.rb tsvector generated column to use GENERATED ALWAYS AS instead of unsupported DEFAULT column-reference expression. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Task 12 (mention resolver) ran an unnecessary db:migrate that re-dumped db/schema.rb with spurious content_tsv execute lines, unrelated ledgers columns (DB drift), and stripped composite-FK comments. Restore schema.rb to the last-known-good state; the resolver code/test are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…anch Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ivacy wall Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Add Mcp::Server.build wrapping all four Mcp::*Tool classes - Add Api::McpController dispatching to StreamableHTTPTransport (stateless, enable_json_response) - Route POST/GET/DELETE /api/mcp to mcp#handle inside :api namespace - Add explicit tool_name to all four tools (search, list_documents, list_sources, get_document) - Fix $LOAD_PATH shadowing: classic autoloader puts app/services ahead of mcp gem lib, so app/services/mcp/server.rb would shadow mcp/server.rb; pre-load MCP::Server via absolute gem path before defining Mcp::Server - Integration test proves: missing key → 403; valid key + tools/list → 200 with search tool Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Register Meeting, Document, Chunk, Mention, SourceSync under a top-level MCP > ETL menu in ActiveAdmin. Meeting show page lists transcript segments; Mention exposes a resolve member_action (PUT) that assigns a Contact and sets status :resolved. Integration test covers index render + resolve. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…rce since filter, MentionResolver nil-contact guard, MCP key-configured check, DriveSource string-since coercion, plainto_tsquery language fix Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… menu) Chunk and Mention are no longer top-level MCP menu items (menu false); they're reached by drilling into a Document, whose show page now lists its chunks and mentions. MCP menu is now Meetings / Documents / Source syncs. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…user listing - CalendarEnricher matches a meeting to its Calendar event PRECISELY (by Meet code for the API path, by exact title for the Drive path) to recover the real title (re-enabling title-based exclusion) and attendee emails (resolved to Contacts). No time-only fallback, to avoid mis-assigning nearby events. - Auth gains a calendar_service using the full 'calendar' scope already authorized in the org's domain-wide delegation (calendar.readonly is NOT granted). - MeetApiSource + DriveSource use enrichment for title + attendee contacts; DriveSource also cleans the transcript doc name into a real title. - Workspace.all_active_user_emails lists every active user org-wide (customer: my_customer) for the multi-user sweep. Verified against the live Workspace: real titles + attendees resolve correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… from stored segments) - Stacks::Etl::Meet.sweep_all_users! impersonates every active Workspace user, error-isolated (one user's failure never aborts the run), SystemTask-wrapped. New rake tasks: stacks:etl:backfill_meet_all[days] (Drive, all users, 90d) and stacks:etl:sync_meet_all[days] (Meet API, all users). - Reversible exclusion: excluded docs already retain their full Meeting + segments. Extracted Connector.index_chunks! (class method) so a Reindexer can chunk+embed a re-included doc from its STORED segments (no Google re-fetch). Connector also self-heals: a corpus-eligible doc missing chunks gets indexed on the next sweep. ActiveAdmin Document gains 'Include & index' and 'Exclude' actions. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Add x86_64-linux to Gemfile.lock PLATFORMS so onnxruntime's Linux native gem installs on Heroku (the local embedding model dep). - docs/meet-etl-deploy.md: pgvector tier prereq, deploy + migrate steps, MCP connection (X-Api-Key), the overnight 90-day Drive backfill command (performance-l detached dyno), and the nightly Performance-dyno Scheduler job. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
| ingest(normalized) | ||
| count += 1 | ||
| end | ||
| sync.advance!(cursor: { 'since' => Time.current.iso8601 }, stats: { 'documents' => count }) |
There was a problem hiding this comment.
🟠 Important (data loss) — Late-finalizing transcripts are permanently missed
The watermark advances to Time.current and the next Meet-API run filters start_time >= cursor. A meeting whose transcript is still generating when the run sees it gets stored empty, and is then permanently excluded from re-fetch. Fix: advance the cursor with a lookback (e.g. run_started_at - 2.days) or re-scan a trailing window so not-yet-final transcripts get re-pulled.
(automated self-review, code-review workflow)
| end_offset: nil, | ||
| speaker_name: seg[:speaker_name], | ||
| speaker_email: seg[:speaker_email], | ||
| occurred_at: seg[:started_at] |
There was a problem hiding this comment.
🟠 Important (data loss) — Drive-sourced chunks get occurred_at=nil → excluded from date-scoped search
DriveSource#parse_segments sets started_at: nil, so every Drive (backfill) chunk has occurred_at = nil, and Search.filtered date_range predicate excludes them all. Since the 90-day backfill is Drive-based, date-bounded search misses the entire backfilled history. Fix: in DriveSource, fall back to the doc's created_time for segment/chunk occurred_at.
(automated self-review, code-review workflow)
| module Etl | ||
| class Connector | ||
| def run(since: nil) | ||
| sync = SourceSync.for(source) |
There was a problem hiding this comment.
🟠 Important — All ingestion paths share one SourceSync cursor (clobbering)
sync_meet, the Drive backfill, and every per-user sweep run share SourceSync.for(:meet), overwriting each other's cursor['since'] and stats. A sweep that runs between incremental syncs moves the watermark to a wrong time → the next sync silently skips a window. list_sources also reports only the last writer. Fix: key SourceSync per path/user, or don't advance the shared cursor from multi-user/backfill runs (they already pass an explicit since).
(automated self-review, code-review workflow)
|
|
||
| Rails.logger.info("[#{task_name}] #{ok}/#{emails.size} users ok, #{failed.size} failed") | ||
| failed.first(25).each { |f| Rails.logger.warn("[#{task_name}] FAIL #{f}") } | ||
| system_task.mark_as_success |
There was a problem hiding this comment.
🟠 Important (ops) — sweep_all_users! reports success even when every user failed
mark_as_success runs unconditionally after the loop. If org-wide auth breaks, every per-user run is swallowed into failed but the SystemTask is green — a total ingestion outage goes undetected. Fix: mark_as_error (or raise) when ok.zero? && emails.any?, and surface the failed count on the task.
(automated self-review, code-review workflow)
|
|
||
| def self.resolve_email(email, name: nil) | ||
| normalized = email.to_s.downcase.strip | ||
| contact = create_or_find_by!(email: normalized) |
There was a problem hiding this comment.
🟠 Important (backfill robustness) — An invalid attendee email aborts the meeting's ingest
resolve_email → create_or_find_by! under validates :email, format: Devise.email_regexp. A Calendar attendee/speaker email that fails the regexp raises RecordInvalid inside the un-rescued ingest transaction, killing that meeting (and, in a sweep, the user's remaining meetings). Fix: validate/normalize the email and skip (or null-contact) when it can't be a Contact, rather than raising.
(automated self-review, code-review workflow)
| scope = scope.where(occurred_at: date_range) if date_range | ||
| if contact | ||
| c = contact.is_a?(Contact) ? contact : Contact.find_by(email: contact.to_s.downcase) | ||
| scope = scope.where(speaker_contact_id: c&.id) |
There was a problem hiding this comment.
🟡 Correctness — contact filter with an unknown email returns wrong results
When contact: is given but no Contact matches, c&.id is nil so the scope becomes WHERE speaker_contact_id IS NULL — returning unrelated unattributed chunks as if they were that person's. Fix: if a contact filter is supplied but resolves to nil, return empty (scope.none).
(automated self-review, code-review workflow)
| segments.map { |s| { email: nil, name: s[:speaker_name], role: 'speaker' } }.uniq | ||
| end | ||
| { | ||
| external_id: file.id, |
There was a problem hiding this comment.
🟡 Correctness — No cross-source dedup: same meeting via Drive + Meet API = two documents
Drive keys on file.id, Meet API on cr.name → the same meeting ingested by both sweeps becomes two corpus-eligible Documents, and the agent sees/cites it twice. The spec calls for reconciling onto one row. Fix: unify via the transcript's Drive doc id (the Meet API transcript exposes docs_destination.document) as a shared dedup key.
(automated self-review, code-review workflow)
| events.find { |e| e.conference_data&.conference_id == meeting_code } | ||
| elsif title_hint.present? | ||
| hint = normalize_title(title_hint) | ||
| events.find { |e| normalize_title(e.summary) == hint } |
There was a problem hiding this comment.
🟡 Correctness — Drive title-match can cross-assign attendees between same-title meetings
The Drive path matches the first Calendar event with an equal normalized title in the ±2h window. Two different 'Weekly Sync' meetings in that window → attendees from the wrong one. Fix: when matching by title, also pick the closest start time, and/or skip if more than one event shares the title.
(automated self-review, code-review workflow)
| t.datetime "occurred_at" | ||
| t.datetime "created_at", precision: 6, null: false | ||
| t.datetime "updated_at", precision: 6, null: false | ||
| # content_tsv (tsvector GENERATED ALWAYS AS) and its GIN index are intentionally |
There was a problem hiding this comment.
🟡 Correctness (non-migrate bootstrap) — schema.rb omits chunks.content_tsv → keyword/hybrid search breaks on db:schema:load
The generated content_tsv column + GIN index aren't dumpable, so they're omitted from schema.rb and only recreated in test_helper.rb. Prod (migrations) is fine, but any db:setup/db:reset/review-app built via db:schema:load lacks the column → Chunk.keyword_search 500s. Fix: recreate it in an idempotent post-load path (a seeds/initializer or structure.sql), or document that schema:load isn't supported.
(automated self-review, code-review workflow)
| owner_ids = scope.pluck(:id) | ||
| return [] if owner_ids.empty? | ||
| vector = Embedder.embed([query], input_type: 'query')[:vectors].first | ||
| Embedding.where(model: Embedder::MODEL, owner_type: 'Chunk', owner_id: owner_ids) |
There was a problem hiding this comment.
🔵 Minor (scaling) — Semantic search materializes all eligible chunk ids into an IN(...) list
The corpus wall is enforced by plucking every eligible chunk id and passing an unbounded owner_id IN (...), which defeats the HNSW index as the corpus grows. Fix: enforce eligibility via a JOIN to chunks/documents in SQL rather than a materialized id list.
(automated self-review, code-review workflow)
Self-review (code-review workflow, high effort — 39 agents, every finding independently verified)The architecture, the exclusion/privacy wall (verified airtight across keyword/semantic/hybrid + all MCP tools), and MCP auth (X-Api-Key fail-closed) hold up. The 10 verified findings are all correctness/robustness, not structural — but 5 would bite the overnight org-wide backfill specifically, so I'd fix those before kicking it off: Fix before the overnight backfill (🟠):
Correctness cleanups (🟡): contact-filter-with-unknown-email returns wrong rows ( Scaling (🔵): semantic search materializes all eligible chunk ids into an Details + suggested fixes are in the inline comments. None block merge of the foundation; #1–#5 are worth landing before the production backfill run. |
- #1 transcripts: skip meetings with no transcript yet; cursor advances with a 2-day LOOKBACK so late-finalizing transcripts get re-pulled. - #3 cursor: multi-user sweeps/backfills run with track:false so they no longer clobber the ongoing single-user :meet SourceSync cursor. - #4 sweep: mark the SystemTask errored (not green) when ALL users fail. - #5 contacts: Contact.resolve_email returns nil for malformed emails instead of raising RecordInvalid mid-ingest. - #6 search: a contact filter with an unknown email returns none (not unattributed chunks via speaker_contact_id IS NULL). - #7 dedup: key the document + meeting on the transcript's Drive doc id so a meeting ingested via both the Meet API and the Drive backfill reconciles onto one row. - #2 Drive: stamp segments with the doc's created_time so chunks get occurred_at (date-scoped search no longer misses Drive-backfilled history). - #9 enricher: among same-title Calendar events, pick the one closest in time. - #10 search: constrain semantic search to eligible chunks via a SUBQUERY, not a materialized IN(...) id list. - #8 schema: recreate content_tsv idempotently in db/seeds.rb for schema:load builds. - Runbook: web-dyno sizing note for semantic-search query embedding. 68 ETL/MCP tests green (171 assertions), fresh db:test:prepare clean. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… classify/parse/chunk Root cause of the worst findings was unifying Drive + Meet API onto one Document by the Drive doc id (iter-1 #7). Reverted that; replaced with safe TIME-PARTITIONING: - Meet API keys on the conference record again (drive_doc_id kept in metadata only); no Meeting re-keying. Kills the privacy-wall breach (#1: a benign source title could un-exclude a sensitive transcript), the RecordNotUnique sweep crash (#2), the re-embed/segment ping-pong thrash (#3), and the lag-duplicate (#4). - Drive backfill now takes an until_time and covers only the OLDER window (>7d); sync_meet_all owns the recent window — partitioned so neither double-ingests. Independent fixes: - #5 classify on real participant_count (Meet participants / distinct Drive speakers), not contacts.size, so a big meeting with few speakers isn't mis-flagged 1:1. - #6 Drive build_meeting persists segment started_at -> Reindexer yields real occurred_at. - #7 enricher title match strips emoji/punctuation so doc-name vs calendar-summary drift doesn't drop attendee emails. - #8 tighter speaker-line regex: URLs/timestamps/labels no longer parsed as speakers. - #9 chunker stops at the slice that reaches the end -> no trailing near-duplicate chunk. 69 ETL/MCP tests green (177 assertions), fresh db:test:prepare clean. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- DriveSource: loosen SPEAKER_LINE to "Name: <text>" (colon+space), so
initials/parenthetical speakers ("J.R.:", "John Doe (Guest):") keep their
text; still excludes URLs/timestamps (colon not followed by whitespace).
- DriveSource: only strip Meet's trailing date-stamp parenthetical from titles
(keep real ones like "(Q3 2026)"); participant_count uses Calendar attendee
count, not the speaker-name fallback.
- CalendarEnricher: ambiguous same-title events attach NO attendees (avoid
cross-assigning the wrong meeting's attendees).
- MeetApiSource: defer to an existing Drive Document keyed on the shared Drive
doc id (read-only existence check, no merge) so the Drive/API partition
overlap can't produce duplicates.
- Connector: stream extract() via Enumerator so an org-wide sweep holds one
transcript in memory at a time, not the whole org.
- Chunker: single chunk when only slightly over MAX (avoid near-dup tail).
- rake backfill_meet: pass since to run + track:false so the explicit window
isn't overridden by / written into the ongoing sync cursor.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Critical (privacy / data integrity):
- Classifier: a participant_count of 0 ("unknown", e.g. empty participants
endpoint) no longer counts as a 1:1 — only a KNOWN 1 or 2 does; title rules
still apply. Connector uses max(participant_count, contacts.size).
- DriveSource SPEAKER_LINE now requires a name-shaped prefix, so a spoken
sentence with a colon ("I think the answer is: yes") can't become a phantom
speaker and inflate the 1:1 fallback count past the exclusion threshold.
- DriveSource: reverse Drive<->API dedup — skip a transcript the Meet API sync
already ingested (matched on raw_metadata.drive_doc_id). Closes the duplicate
the one-directional check left in the overlap window / single-user backfill.
Correctness:
- MentionResolver: partial speaker match on whole name tokens, never raw
substrings ("Chris" no longer resolves to "Christine").
- DriveSource clean_title: strip only Meet's real date/time stamp, keeping
legitimate parentheticals like "(3 items)" / "(Q3 2026)" (title is the
Calendar-match key).
- DriveSource parse: append wrapped continuation lines to the current turn
instead of dropping them.
- Connector: dedup attendees that resolve to the same Contact before insert,
so a duplicated invite can't trip the unique index and roll back the meeting.
Quality / safety:
- Chunker: coalesce consecutive same-speaker segments so Meet-API per-utterance
transcripts produce paragraph-sized chunks (better embeddings), matching Drive.
- Connector: embed chunks in batches of 32 to bound peak memory on long meetings.
- Deploy doc: require db:migrate (not schema:load) for the content_tsv column.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Privacy (resolve the participant-count oscillation with one documented policy):
- The 1:1 head-count must reflect ACTUAL attendance, never invite counts. Revert
the iteration-4 max(participant_count, contacts.size) — it let a 1:1 with an
extra (declined) Calendar invitee escape exclusion. exclusion_for now uses
participant_count, falling back to contacts only when that signal is absent (nil).
- Revert the Classifier positive?-guard: a head-count of 0 ("couldn't confirm a
group", e.g. empty participants endpoint) is conservatively excluded as a
probable 1:1, not indexed.
- DriveSource participant_count now = distinct speakers (actual), not Calendar
attendees (invites over-count and would leak a Drive 1:1).
Parsing (fix regressions from the iteration-4 rewrite):
- Revert continuation-line appending — it misattributed system/footer lines
("Recording stopped", "X left the call") to the previous speaker. Drop
non-speaker lines again (Google Docs turns are one paragraph/line).
- SPEAKER_LINE: first token must be an uppercase OR caseless-script (CJK) letter;
later tokens may be numeric, so Meet's anonymous "Speaker 1:" labels parse and
accented/non-Latin names are kept. Token cap raised to 6.
- clean_title: stop stripping a bare clock-time parenthetical ("Retro (5:00
format)" survives); Meet's date stamp always carries a date and/or GMT.
Correctness / maintainability:
- MentionResolver: tokenize names on non-letter boundaries so "Anne" resolves to
"Anne-Marie Smith" without resurrecting substring matching.
- Drive<->API dedup: a Document.for_drive_doc scope owns the key across both
sources; the Drive side excludes its OWN doc so a re-scan re-ingests corrected/
re-included transcripts instead of skipping them. Add an expression index on
raw_metadata->>'drive_doc_id' so the per-file backfill lookup doesn't seq-scan.
- Chunker: drop dead ended_at bookkeeping in coalesce.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
What this is
The first slice of the MCP & agent-features effort: a durable, org-wide vector store fed by an ETL pipeline, with Google Meet transcripts as the first source, exposed to an agent through a read-only MCP server. Built inside Stacks (Rails 6.1) so it reuses the existing Postgres, Google Workspace auth, and ActiveAdmin.
Design spec:
docs/superpowers/specs/2026-06-28-org-vector-store-and-etl-design.md· Plan:docs/superpowers/plans/2026-06-28-org-vector-store-and-etl.md· Deploy runbook:docs/meet-etl-deploy.mdHow it works
documents→chunks, polymorphicembeddingsside-table,mentions,document_contacts,source_syncs) with pgvector (neighbor) + Postgres full-text. Search is source-agnostic, so future connectors light up automatically.Stacks::Etl::Connector):extract → classify → resolve contacts → chunk → embed → load, idempotent upserts, per-source watermark, SystemTask-wrapped rake tasks.meetings/participants/segmentsproject into the generic core.Contacts).informers+ quantizedmxbai-embed-large-v1, 1024-dim) — no API key, no chunk text leaves our infra.mcpgem, Streamable HTTP at/api/mcp, stateless) —search/get_document/list_documents/list_sources, authenticated with the existingX-Api-Keyprivate key.Privacy / governance
excludedand never chunked, embedded, or returned by any MCP tool or search mode — verified end-to-end. They keep their full transcript for the human record (ActiveAdmin only).contactstable.Admin
New top-level MCP menu → ETL subpages (Meetings, Documents, Source syncs); Chunks + Mentions are reached by drilling into a Document. Document show has Include & index / Exclude actions.
Testing
~60 new tests / 150+ assertions, plus the full suite green (313 runs, 0 failures). Schema verified to load cleanly from scratch (
db:test:prepare). Enrichment + ingest verified against the live Workspace.Deploy notes (see runbook)
enable_extension "vector"migration fails otherwise.meetings.space.readonly+drive.readonlyand the Meet API must be enabled (already done).heroku run:detached --size=performance-l "rake 'stacks:etl:backfill_meet_all[90]'"; nightlystacks:etl:sync_meet_allon a Performance dyno.Known follow-ups (non-blocking, in the runbook)
Cross-source dedup (Drive backfill vs Meet API sync), semantic-search
IN (...)scaling, baking the embedding model into the slug, and the intelligence layer (decisions/commitments/tasks/opportunities) as a separate spec.Out of scope
Intelligence layer, connectors beyond Meet, multi-tenant/productization — all deliberately deferred.
🤖 Generated with Claude Code