Skip to content

Org-wide Meet transcript ETL + vector store + read-only MCP server#115

Open
hhff wants to merge 51 commits into
mainfrom
worktree-mcp-agent-features
Open

Org-wide Meet transcript ETL + vector store + read-only MCP server#115
hhff wants to merge 51 commits into
mainfrom
worktree-mcp-agent-features

Conversation

@hhff

@hhff hhff commented Jun 29, 2026

Copy link
Copy Markdown
Member

What this is

The first slice of the MCP & agent-features effort: a durable, org-wide vector store fed by an ETL pipeline, with Google Meet transcripts as the first source, exposed to an agent through a read-only MCP server. Built inside Stacks (Rails 6.1) so it reuses the existing Postgres, Google Workspace auth, and ActiveAdmin.

Design spec: docs/superpowers/specs/2026-06-28-org-vector-store-and-etl-design.md · Plan: docs/superpowers/plans/2026-06-28-org-vector-store-and-etl.md · Deploy runbook: docs/meet-etl-deploy.md

How it works

  • Generic core (documentschunks, polymorphic embeddings side-table, mentions, document_contacts, source_syncs) with pgvector (neighbor) + Postgres full-text. Search is source-agnostic, so future connectors light up automatically.
  • Connector/ETL framework (Stacks::Etl::Connector): extract → classify → resolve contacts → chunk → embed → load, idempotent upserts, per-source watermark, SystemTask-wrapped rake tasks.
  • Meet connector (source Feature/hotfixes #1): hybrid Meet REST API (ongoing, rich) + Drive transcript-Doc sweep (90-day backfill — the API only retains ~30 days). Rich meetings/participants/segments project into the generic core.
  • Calendar enrichment: matches each meeting to its Calendar event (precisely — by Meet code / exact title, never by time alone) to recover the real title (re-enabling title-based exclusion) and attendee emails (resolved to Contacts).
  • Org-wide multi-user sweep: impersonates every active Workspace user via domain-wide delegation, error-isolated; dedup by global IDs.
  • Local embeddings (informers + quantized mxbai-embed-large-v1, 1024-dim) — no API key, no chunk text leaves our infra.
  • Read-only MCP server (official mcp gem, Streamable HTTP at /api/mcp, stateless) — search / get_document / list_documents / list_sources, authenticated with the existing X-Api-Key private key.

Privacy / governance

  • Exclusion wall: sensitive meetings (1:1s, performance review, comp, HR, …) are auto-classified excluded and never chunked, embedded, or returned by any MCP tool or search mode — verified end-to-end. They keep their full transcript for the human record (ActiveAdmin only).
  • Reversible: excluded docs retain segments; an ActiveAdmin Include & index action chunks+embeds from stored segments (no re-fetch).
  • Identity is anchored on the existing contacts table.

Admin

New top-level MCP menu → ETL subpages (Meetings, Documents, Source syncs); Chunks + Mentions are reached by drilling into a Document. Document show has Include & index / Exclude actions.

Testing

~60 new tests / 150+ assertions, plus the full suite green (313 runs, 0 failures). Schema verified to load cleanly from scratch (db:test:prepare). Enrichment + ingest verified against the live Workspace.

Deploy notes (see runbook)

  • Heroku Postgres must be Standard+ tier (pgvector); the enable_extension "vector" migration fails otherwise.
  • DWD scopes meetings.space.readonly + drive.readonly and the Meet API must be enabled (already done).
  • Overnight 90-day backfill: heroku run:detached --size=performance-l "rake 'stacks:etl:backfill_meet_all[90]'"; nightly stacks:etl:sync_meet_all on a Performance dyno.

Known follow-ups (non-blocking, in the runbook)

Cross-source dedup (Drive backfill vs Meet API sync), semantic-search IN (...) scaling, baking the embedding model into the slug, and the intelligence layer (decisions/commitments/tasks/opportunities) as a separate spec.

Out of scope

Intelligence layer, connectors beyond Meet, multi-tenant/productization — all deliberately deferred.

🤖 Generated with Claude Code

hhff and others added 30 commits June 28, 2026 18:34
Foundation spec for the MCP & agent-features effort: ingest Meet
transcripts org-wide (hybrid Meet API + Drive backfill), store them
permanently in Postgres with full-text + pgvector semantic search, and
expose them through a read-only MCP server. Includes an excluded/
excluded_reason exclusion layer that walls sensitive meetings (1:1s,
reviews, comp, HR) off from the agent entirely.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Generalize the design from a Meet-specific transcript corpus into a
source-agnostic vector store (documents/document_people/document_chunks
with pgvector + full-text + facets) plus a connector/ETL framework with
incremental sync watermarks. The MCP server now searches across all
sources at once, so every future connector (Notion, Gmail, etc.) is
searchable the moment it loads. Meet transcripts become the first
connector, keeping rich meetings/segments/participants tables that
project into the generic core. Exclusion layer and identity resolution
(people -> AdminUser/Contributor) generalize to any source.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…e target)

Fold in the institutional-memory design Hugh shared:
- Index is a synthesis layer, not a mirror: full-ingest Meet (no MCP),
  promote only high-signal slices from MCP-having sources later.
- Embeddings become a versioned polymorphic side-table keyed by
  (owner_type, owner_id, model) instead of a vector column on chunks.
- Add document_versions (history for changing docs) and a mentions table
  + unresolved-mention queue for transcript display-name -> AdminUser
  resolution; chunks carry stable spans for later evidence citation.
- MCP: official mcp Ruby gem over Streamable HTTP, per-client scoped
  tokens, audit logging, retrieved content treated as untrusted.
- Record the full 16-table north star and which tables this foundation
  builds vs. the later intelligence layer.

Decisions locked: foundation-first scope, keep rake + Heroku Scheduler
(no GoodJob, embed inline), internal single-tenant.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Use Contact as the canonical person spine (FK target everywhere) rather
than AdminUser/Contributor or a new person table: contacts is unique on
email, Apollo-enriched, and already aims to cover everyone — including
external @gmail/client guests who are neither workspace logins nor
Forecast contributors. Extend contacts with nullable admin_user_id /
contributor_id bridge links (populated via the AdminUser cross-domain
matcher) so internal org info is reachable and domain-variant rows
regroup. mentions/document_people/Meet participants+segments now FK to
contact_id; display-name resolution targets Contact with the unresolved
queue.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…min UI

- Contact is the identity outright: resolve every person by email
  (create_or_find_by, made if missing — workspace or external), with no
  AdminUser/Contributor reconciliation and no bridge columns. Contributor
  org-info is joinable later by email if the intelligence layer wants it.
- Drop the document_versions table (Meet transcripts are immutable):
  content_hash moves onto documents, chunks belong to documents, and
  per-fetch versioning returns when the first mutable source (Notion) lands.
- mentions are chunk-scoped; chunks/segments carry speaker_contact_id.
- Add a top-level ActiveAdmin "MCP" menu with an "ETL" subpage
  (meetings/documents/chunks/mentions queue/source_syncs) for visual
  debugging and exclusion review.
- Add open item to scope sync_contacts (Apollo) away from meet-only contacts.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
The existing stacks:sync_contacts already enriches every contact via
Apollo, so meet-sourced contacts flow through as desired. No special
scoping needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
23 TDD tasks: pgvector/gems, generic core schema (documents/chunks/
embeddings side-table/mentions/document_contacts/source_syncs), Meet
tables, Voyage embedder, chunker, mention resolver, ETL connector base,
Meet auth/classifier/api+drive sources/connector, SystemTask rake tasks,
hybrid search, read-only MCP tools + streamable-HTTP endpoint with bearer
auth, and the ActiveAdmin MCP -> ETL menu. Pins Ruby-3.1-compatible gem
versions and flags the Heroku pgvector tier prerequisite.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…ey for MCP

- Embeddings now run locally via the informers gem (ONNX), model
  mixedbread-ai/mxbai-embed-large-v1 quantized, 1024-dim. No API key, and
  no chunk text leaves our infra (privacy win). Search queries get the
  model's retrieval prefix; stored chunks do not.
- MCP endpoint reuses the existing private API key (X-Api-Key header /
  config[:stacks][:private_api_key], via ApiController#check_private_api_key!)
  instead of a new bearer token; unauthorized -> 403.
- Plan: Task 10 rewritten for informers; Task 1 adds the gem; Task 5/20
  model strings updated; Task 22 auth rewritten.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Adds neighbor ~> 0.4.3, informers 1.2.1, mcp 0.22.0,
  google-apis-meet_v2 0.13.0, google-apis-drive_v3 0.81.0 gems
- Updates google-apis-core 0.5.0 → 0.15.1 (required by meet_v2),
  cascading to googleauth 1.9.2, signet 0.16.1, addressable 2.8.7
- Adds rexml explicitly (dropped from google-apis-core transitive chain)
- Enables the vector Postgres extension via migration
- Smoke-tests pg_extension presence with PgvectorSmokeTest

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Adds display_name column to contacts table and implements
Contact.resolve_email(email, name: nil) class method for creating
or finding contacts, tagging them with 'meet' source, and setting
display_name when blank.

Also fix schema.rb by removing three incorrect composite foreign key
definitions that were preventing tests from running. The real FKs
are defined in the migration using raw SQL.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Task 2's migration regenerated db/schema.rb without the three intentional
composite-FK documentation comments (fk_adhoc_invoice_trackers_qbo_invoice,
fk_contributor_adjustments_qbo_invoice, fk_invoice_trackers_qbo_invoice) that
exist in the branch base. Restore them so schema.rb matches canonical.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Implement the Document model with source and exclusion enums, corpus_eligible
scope for non-excluded and manually-included documents, and predicates for
querying document status.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Rails 6.1 dumps PostgreSQL GENERATED columns as DEFAULT expressions
that PostgreSQL rejects on schema:load (cannot use column reference in
DEFAULT). The content_tsv tsvector column and its GIN index are omitted
from schema.rb with an explanatory comment; test_helper.rb recreates
them idempotently after schema:load using ADD COLUMN IF NOT EXISTS and
CREATE INDEX IF NOT EXISTS, mirroring the existing trigger workaround.

Also restores the three composite-FK comment lines that db:schema:dump
had overwritten with unloadable add_foreign_key calls.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Also fix schema.rb tsvector generated column to use GENERATED ALWAYS AS
instead of unsupported DEFAULT column-reference expression.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Task 12 (mention resolver) ran an unnecessary db:migrate that re-dumped
db/schema.rb with spurious content_tsv execute lines, unrelated ledgers
columns (DB drift), and stripped composite-FK comments. Restore schema.rb
to the last-known-good state; the resolver code/test are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
hhff and others added 14 commits June 28, 2026 22:38
…ivacy wall

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Add Mcp::Server.build wrapping all four Mcp::*Tool classes
- Add Api::McpController dispatching to StreamableHTTPTransport (stateless, enable_json_response)
- Route POST/GET/DELETE /api/mcp to mcp#handle inside :api namespace
- Add explicit tool_name to all four tools (search, list_documents, list_sources, get_document)
- Fix $LOAD_PATH shadowing: classic autoloader puts app/services ahead of mcp gem lib,
  so app/services/mcp/server.rb would shadow mcp/server.rb; pre-load MCP::Server via
  absolute gem path before defining Mcp::Server
- Integration test proves: missing key → 403; valid key + tools/list → 200 with search tool

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Register Meeting, Document, Chunk, Mention, SourceSync under a top-level
MCP > ETL menu in ActiveAdmin. Meeting show page lists transcript segments;
Mention exposes a resolve member_action (PUT) that assigns a Contact and
sets status :resolved. Integration test covers index render + resolve.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…rce since filter, MentionResolver nil-contact guard, MCP key-configured check, DriveSource string-since coercion, plainto_tsquery language fix

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… menu)

Chunk and Mention are no longer top-level MCP menu items (menu false); they're
reached by drilling into a Document, whose show page now lists its chunks and
mentions. MCP menu is now Meetings / Documents / Source syncs.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
…user listing

- CalendarEnricher matches a meeting to its Calendar event PRECISELY (by Meet
  code for the API path, by exact title for the Drive path) to recover the real
  title (re-enabling title-based exclusion) and attendee emails (resolved to
  Contacts). No time-only fallback, to avoid mis-assigning nearby events.
- Auth gains a calendar_service using the full 'calendar' scope already authorized
  in the org's domain-wide delegation (calendar.readonly is NOT granted).
- MeetApiSource + DriveSource use enrichment for title + attendee contacts;
  DriveSource also cleans the transcript doc name into a real title.
- Workspace.all_active_user_emails lists every active user org-wide (customer:
  my_customer) for the multi-user sweep.
Verified against the live Workspace: real titles + attendees resolve correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… from stored segments)

- Stacks::Etl::Meet.sweep_all_users! impersonates every active Workspace user,
  error-isolated (one user's failure never aborts the run), SystemTask-wrapped.
  New rake tasks: stacks:etl:backfill_meet_all[days] (Drive, all users, 90d) and
  stacks:etl:sync_meet_all[days] (Meet API, all users).
- Reversible exclusion: excluded docs already retain their full Meeting + segments.
  Extracted Connector.index_chunks! (class method) so a Reindexer can chunk+embed a
  re-included doc from its STORED segments (no Google re-fetch). Connector also
  self-heals: a corpus-eligible doc missing chunks gets indexed on the next sweep.
  ActiveAdmin Document gains 'Include & index' and 'Exclude' actions.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- Add x86_64-linux to Gemfile.lock PLATFORMS so onnxruntime's Linux native gem
  installs on Heroku (the local embedding model dep).
- docs/meet-etl-deploy.md: pgvector tier prereq, deploy + migrate steps, MCP
  connection (X-Api-Key), the overnight 90-day Drive backfill command
  (performance-l detached dyno), and the nightly Performance-dyno Scheduler job.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Comment thread lib/stacks/etl/connector.rb Outdated
ingest(normalized)
count += 1
end
sync.advance!(cursor: { 'since' => Time.current.iso8601 }, stats: { 'documents' => count })

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important (data loss) — Late-finalizing transcripts are permanently missed

The watermark advances to Time.current and the next Meet-API run filters start_time >= cursor. A meeting whose transcript is still generating when the run sees it gets stored empty, and is then permanently excluded from re-fetch. Fix: advance the cursor with a lookback (e.g. run_started_at - 2.days) or re-scan a trailing window so not-yet-final transcripts get re-pulled.

(automated self-review, code-review workflow)

Comment thread lib/stacks/etl/chunker.rb
end_offset: nil,
speaker_name: seg[:speaker_name],
speaker_email: seg[:speaker_email],
occurred_at: seg[:started_at]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important (data loss) — Drive-sourced chunks get occurred_at=nil → excluded from date-scoped search

DriveSource#parse_segments sets started_at: nil, so every Drive (backfill) chunk has occurred_at = nil, and Search.filtered date_range predicate excludes them all. Since the 90-day backfill is Drive-based, date-bounded search misses the entire backfilled history. Fix: in DriveSource, fall back to the doc's created_time for segment/chunk occurred_at.

(automated self-review, code-review workflow)

Comment thread lib/stacks/etl/connector.rb Outdated
module Etl
class Connector
def run(since: nil)
sync = SourceSync.for(source)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important — All ingestion paths share one SourceSync cursor (clobbering)

sync_meet, the Drive backfill, and every per-user sweep run share SourceSync.for(:meet), overwriting each other's cursor['since'] and stats. A sweep that runs between incremental syncs moves the watermark to a wrong time → the next sync silently skips a window. list_sources also reports only the last writer. Fix: key SourceSync per path/user, or don't advance the shared cursor from multi-user/backfill runs (they already pass an explicit since).

(automated self-review, code-review workflow)

Comment thread lib/stacks/etl/meet.rb Outdated

Rails.logger.info("[#{task_name}] #{ok}/#{emails.size} users ok, #{failed.size} failed")
failed.first(25).each { |f| Rails.logger.warn("[#{task_name}] FAIL #{f}") }
system_task.mark_as_success

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important (ops) — sweep_all_users! reports success even when every user failed

mark_as_success runs unconditionally after the loop. If org-wide auth breaks, every per-user run is swallowed into failed but the SystemTask is green — a total ingestion outage goes undetected. Fix: mark_as_error (or raise) when ok.zero? && emails.any?, and surface the failed count on the task.

(automated self-review, code-review workflow)

Comment thread app/models/contact.rb

def self.resolve_email(email, name: nil)
normalized = email.to_s.downcase.strip
contact = create_or_find_by!(email: normalized)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important (backfill robustness) — An invalid attendee email aborts the meeting's ingest

resolve_emailcreate_or_find_by! under validates :email, format: Devise.email_regexp. A Calendar attendee/speaker email that fails the regexp raises RecordInvalid inside the un-rescued ingest transaction, killing that meeting (and, in a sweep, the user's remaining meetings). Fix: validate/normalize the email and skip (or null-contact) when it can't be a Contact, rather than raising.

(automated self-review, code-review workflow)

Comment thread lib/stacks/etl/search.rb Outdated
scope = scope.where(occurred_at: date_range) if date_range
if contact
c = contact.is_a?(Contact) ? contact : Contact.find_by(email: contact.to_s.downcase)
scope = scope.where(speaker_contact_id: c&.id)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Correctness — contact filter with an unknown email returns wrong results

When contact: is given but no Contact matches, c&.id is nil so the scope becomes WHERE speaker_contact_id IS NULL — returning unrelated unattributed chunks as if they were that person's. Fix: if a contact filter is supplied but resolves to nil, return empty (scope.none).

(automated self-review, code-review workflow)

segments.map { |s| { email: nil, name: s[:speaker_name], role: 'speaker' } }.uniq
end
{
external_id: file.id,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Correctness — No cross-source dedup: same meeting via Drive + Meet API = two documents

Drive keys on file.id, Meet API on cr.name → the same meeting ingested by both sweeps becomes two corpus-eligible Documents, and the agent sees/cites it twice. The spec calls for reconciling onto one row. Fix: unify via the transcript's Drive doc id (the Meet API transcript exposes docs_destination.document) as a shared dedup key.

(automated self-review, code-review workflow)

events.find { |e| e.conference_data&.conference_id == meeting_code }
elsif title_hint.present?
hint = normalize_title(title_hint)
events.find { |e| normalize_title(e.summary) == hint }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Correctness — Drive title-match can cross-assign attendees between same-title meetings

The Drive path matches the first Calendar event with an equal normalized title in the ±2h window. Two different 'Weekly Sync' meetings in that window → attendees from the wrong one. Fix: when matching by title, also pick the closest start time, and/or skip if more than one event shares the title.

(automated self-review, code-review workflow)

Comment thread db/schema.rb
t.datetime "occurred_at"
t.datetime "created_at", precision: 6, null: false
t.datetime "updated_at", precision: 6, null: false
# content_tsv (tsvector GENERATED ALWAYS AS) and its GIN index are intentionally

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Correctness (non-migrate bootstrap) — schema.rb omits chunks.content_tsv → keyword/hybrid search breaks on db:schema:load

The generated content_tsv column + GIN index aren't dumpable, so they're omitted from schema.rb and only recreated in test_helper.rb. Prod (migrations) is fine, but any db:setup/db:reset/review-app built via db:schema:load lacks the column → Chunk.keyword_search 500s. Fix: recreate it in an idempotent post-load path (a seeds/initializer or structure.sql), or document that schema:load isn't supported.

(automated self-review, code-review workflow)

Comment thread lib/stacks/etl/search.rb Outdated
owner_ids = scope.pluck(:id)
return [] if owner_ids.empty?
vector = Embedder.embed([query], input_type: 'query')[:vectors].first
Embedding.where(model: Embedder::MODEL, owner_type: 'Chunk', owner_id: owner_ids)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Minor (scaling) — Semantic search materializes all eligible chunk ids into an IN(...) list

The corpus wall is enforced by plucking every eligible chunk id and passing an unbounded owner_id IN (...), which defeats the HNSW index as the corpus grows. Fix: enforce eligibility via a JOIN to chunks/documents in SQL rather than a materialized id list.

(automated self-review, code-review workflow)

@hhff

hhff commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Self-review (code-review workflow, high effort — 39 agents, every finding independently verified)

The architecture, the exclusion/privacy wall (verified airtight across keyword/semantic/hybrid + all MCP tools), and MCP auth (X-Api-Key fail-closed) hold up. The 10 verified findings are all correctness/robustness, not structural — but 5 would bite the overnight org-wide backfill specifically, so I'd fix those before kicking it off:

Fix before the overnight backfill (🟠):

  1. connector.rb — incremental cursor advances to now, so transcripts that finalize after a run are permanently missed (data loss).
  2. chunker.rb/drive_source.rb — Drive chunks get occurred_at = nil, so date-scoped search misses the entire 90-day backfill.
  3. connector.rb/meet.rb — all ingestion paths share one SourceSync cursor and clobber each other's watermark.
  4. meet.rbsweep_all_users! reports green even if every user failed (silent total-outage).
  5. contact.rb — an invalid attendee email raises mid-transaction and aborts that meeting's (and, in a sweep, the user's remaining) ingest.

Correctness cleanups (🟡): contact-filter-with-unknown-email returns wrong rows (search.rb); no cross-source dedup → duplicate docs when Drive + API overlap (drive_source.rb); same-title Calendar match can cross-assign attendees (calendar_enricher.rb); schema.rb omits content_tsv so db:schema:load-built DBs break keyword search.

Scaling (🔵): semantic search materializes all eligible chunk ids into an IN (...) list (search.rb).

Details + suggested fixes are in the inline comments. None block merge of the foundation; #1#5 are worth landing before the production backfill run.

hhff and others added 5 commits June 29, 2026 10:04
- #1 transcripts: skip meetings with no transcript yet; cursor advances with a
  2-day LOOKBACK so late-finalizing transcripts get re-pulled.
- #3 cursor: multi-user sweeps/backfills run with track:false so they no longer
  clobber the ongoing single-user :meet SourceSync cursor.
- #4 sweep: mark the SystemTask errored (not green) when ALL users fail.
- #5 contacts: Contact.resolve_email returns nil for malformed emails instead of
  raising RecordInvalid mid-ingest.
- #6 search: a contact filter with an unknown email returns none (not unattributed
  chunks via speaker_contact_id IS NULL).
- #7 dedup: key the document + meeting on the transcript's Drive doc id so a meeting
  ingested via both the Meet API and the Drive backfill reconciles onto one row.
- #2 Drive: stamp segments with the doc's created_time so chunks get occurred_at
  (date-scoped search no longer misses Drive-backfilled history).
- #9 enricher: among same-title Calendar events, pick the one closest in time.
- #10 search: constrain semantic search to eligible chunks via a SUBQUERY, not a
  materialized IN(...) id list.
- #8 schema: recreate content_tsv idempotently in db/seeds.rb for schema:load builds.
- Runbook: web-dyno sizing note for semantic-search query embedding.

68 ETL/MCP tests green (171 assertions), fresh db:test:prepare clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
… classify/parse/chunk

Root cause of the worst findings was unifying Drive + Meet API onto one Document by the
Drive doc id (iter-1 #7). Reverted that; replaced with safe TIME-PARTITIONING:
- Meet API keys on the conference record again (drive_doc_id kept in metadata only);
  no Meeting re-keying. Kills the privacy-wall breach (#1: a benign source title could
  un-exclude a sensitive transcript), the RecordNotUnique sweep crash (#2), the
  re-embed/segment ping-pong thrash (#3), and the lag-duplicate (#4).
- Drive backfill now takes an until_time and covers only the OLDER window (>7d);
  sync_meet_all owns the recent window — partitioned so neither double-ingests.
Independent fixes:
- #5 classify on real participant_count (Meet participants / distinct Drive speakers),
  not contacts.size, so a big meeting with few speakers isn't mis-flagged 1:1.
- #6 Drive build_meeting persists segment started_at -> Reindexer yields real occurred_at.
- #7 enricher title match strips emoji/punctuation so doc-name vs calendar-summary drift
  doesn't drop attendee emails.
- #8 tighter speaker-line regex: URLs/timestamps/labels no longer parsed as speakers.
- #9 chunker stops at the slice that reaches the end -> no trailing near-duplicate chunk.

69 ETL/MCP tests green (177 assertions), fresh db:test:prepare clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
- DriveSource: loosen SPEAKER_LINE to "Name: <text>" (colon+space), so
  initials/parenthetical speakers ("J.R.:", "John Doe (Guest):") keep their
  text; still excludes URLs/timestamps (colon not followed by whitespace).
- DriveSource: only strip Meet's trailing date-stamp parenthetical from titles
  (keep real ones like "(Q3 2026)"); participant_count uses Calendar attendee
  count, not the speaker-name fallback.
- CalendarEnricher: ambiguous same-title events attach NO attendees (avoid
  cross-assigning the wrong meeting's attendees).
- MeetApiSource: defer to an existing Drive Document keyed on the shared Drive
  doc id (read-only existence check, no merge) so the Drive/API partition
  overlap can't produce duplicates.
- Connector: stream extract() via Enumerator so an org-wide sweep holds one
  transcript in memory at a time, not the whole org.
- Chunker: single chunk when only slightly over MAX (avoid near-dup tail).
- rake backfill_meet: pass since to run + track:false so the explicit window
  isn't overridden by / written into the ongoing sync cursor.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Critical (privacy / data integrity):
- Classifier: a participant_count of 0 ("unknown", e.g. empty participants
  endpoint) no longer counts as a 1:1 — only a KNOWN 1 or 2 does; title rules
  still apply. Connector uses max(participant_count, contacts.size).
- DriveSource SPEAKER_LINE now requires a name-shaped prefix, so a spoken
  sentence with a colon ("I think the answer is: yes") can't become a phantom
  speaker and inflate the 1:1 fallback count past the exclusion threshold.
- DriveSource: reverse Drive<->API dedup — skip a transcript the Meet API sync
  already ingested (matched on raw_metadata.drive_doc_id). Closes the duplicate
  the one-directional check left in the overlap window / single-user backfill.

Correctness:
- MentionResolver: partial speaker match on whole name tokens, never raw
  substrings ("Chris" no longer resolves to "Christine").
- DriveSource clean_title: strip only Meet's real date/time stamp, keeping
  legitimate parentheticals like "(3 items)" / "(Q3 2026)" (title is the
  Calendar-match key).
- DriveSource parse: append wrapped continuation lines to the current turn
  instead of dropping them.
- Connector: dedup attendees that resolve to the same Contact before insert,
  so a duplicated invite can't trip the unique index and roll back the meeting.

Quality / safety:
- Chunker: coalesce consecutive same-speaker segments so Meet-API per-utterance
  transcripts produce paragraph-sized chunks (better embeddings), matching Drive.
- Connector: embed chunks in batches of 32 to bound peak memory on long meetings.
- Deploy doc: require db:migrate (not schema:load) for the content_tsv column.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Privacy (resolve the participant-count oscillation with one documented policy):
- The 1:1 head-count must reflect ACTUAL attendance, never invite counts. Revert
  the iteration-4 max(participant_count, contacts.size) — it let a 1:1 with an
  extra (declined) Calendar invitee escape exclusion. exclusion_for now uses
  participant_count, falling back to contacts only when that signal is absent (nil).
- Revert the Classifier positive?-guard: a head-count of 0 ("couldn't confirm a
  group", e.g. empty participants endpoint) is conservatively excluded as a
  probable 1:1, not indexed.
- DriveSource participant_count now = distinct speakers (actual), not Calendar
  attendees (invites over-count and would leak a Drive 1:1).

Parsing (fix regressions from the iteration-4 rewrite):
- Revert continuation-line appending — it misattributed system/footer lines
  ("Recording stopped", "X left the call") to the previous speaker. Drop
  non-speaker lines again (Google Docs turns are one paragraph/line).
- SPEAKER_LINE: first token must be an uppercase OR caseless-script (CJK) letter;
  later tokens may be numeric, so Meet's anonymous "Speaker 1:" labels parse and
  accented/non-Latin names are kept. Token cap raised to 6.
- clean_title: stop stripping a bare clock-time parenthetical ("Retro (5:00
  format)" survives); Meet's date stamp always carries a date and/or GMT.

Correctness / maintainability:
- MentionResolver: tokenize names on non-letter boundaries so "Anne" resolves to
  "Anne-Marie Smith" without resurrecting substring matching.
- Drive<->API dedup: a Document.for_drive_doc scope owns the key across both
  sources; the Drive side excludes its OWN doc so a re-scan re-ingests corrected/
  re-included transcripts instead of skipping them. Add an expression index on
  raw_metadata->>'drive_doc_id' so the per-file backfill lookup doesn't seq-scan.
- Chunker: drop dead ended_at bookkeeping in coalesce.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant