Skip to content

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129

Merged
zxqfd555 merged 2 commits into
pathwaycom:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration
Jun 29, 2026
Merged

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129
zxqfd555 merged 2 commits into
pathwaycom:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

Conversation

@mohit-twelvelabs

Copy link
Copy Markdown
Contributor

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

Process note: Per CONTRIBUTING.md and the PR template, this change ideally starts with an issue/Discord discussion and requires signing the CLA. I'm opening this PR as a concrete proposal to make the discussion easier — happy to file an issue, adjust scope, or iterate however the maintainers prefer, and I'll sign the CLA when prompted.

Introduction

This adds a new, fully opt-in application template: Video RAG with TwelveLabs (templates/video_rag_twelvelabs/). It lets a Pathway pipeline do RAG over video by bringing in two TwelveLabs models:

  • Pegasus (video understanding) — a Pathway parser (TwelveLabsVideoParser, a pw.UDF) that uploads each video as a TwelveLabs asset and turns it into a rich text description (what happens on screen, who/what appears, spoken and on-screen text, the overall topic). Pathway then indexes that text exactly like it indexes a PDF.
  • Marengo (multimodal embeddings, 512-dim) — a Pathway embedder (MarengoEmbedder, a BaseEmbedder subclass) used as the retriever embedder.

Both components live in a local pathway_twelvelabs package and are wired in entirely through app.yaml (mirroring the multimodal_rag and slides_ai_search templates), so models, prompts, the data source, and the LLM can all be swapped without touching Python.

Context

The existing templates handle documents (PDF/DOCX/slides) but not video. Video is hard to drop into RAG because most stacks only transcribe the audio and discard everything visual. Pegasus captures the whole video as text, and Marengo gives a shared multimodal embedding space. This extends Pathway's live-sync + in-memory-index story to a new modality with zero new infrastructure.

How has this been tested?

  • templates/video_rag_twelvelabs/test_twelvelabs.py: 4 no-network unit tests (stubbed SDK; run without credentials) covering the embedder vector shape, the Pegasus upload-then-analyze flow, failed-asset handling, and an embedding-dimension regression test. 2 of these are dimension/default checks; a 5th test is a live smoke test that's skipped unless TWELVELABS_API_KEY is set.
  • Live, end to end against the real TwelveLabs API:
    • Marengo text embedding returns a 512-dim vector, and MarengoEmbedder.get_embedding_dimension() correctly reports 512 (this required overriding the base probe, which assumes a single-vector return).
    • Pegasus analyzed a short public sample video end to end (asset upload → analyze → text) in ~7s, returning a correct description.
  • Linters matching the repo's CI (black, isort --profile black, flake8) pass on all new files. The new module type-checks cleanly under mypy; the template dir is added to the existing [tool.mypy] exclude list, consistent with the other RAG templates.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature or improvement (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

This is purely additive: a new template directory plus one row in the main README table and one entry in the mypy exclude list. No existing template, default, or behavior is changed.

Related issue(s):

  1. (none yet — happy to open one if preferred)

Checklist:

  • My code follows the code style of this project,
  • My change requires a change to the documentation (added a template README and a row in the main README),
  • I described the modification in the CHANGELOG.md file. (No CHANGELOG.md exists in this repo.)

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

@CLAassistant

CLAassistant commented Jun 25, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@zxqfd555 zxqfd555 self-assigned this Jun 26, 2026
@zxqfd555

Copy link
Copy Markdown
Contributor

Thank you for your interest in Pathway! Could you please sign the CLA, which is required for the merge? Thank you.

@mohit-twelvelabs

Copy link
Copy Markdown
Contributor Author

Thanks @zxqfd555 — glad it's of interest! I'll get the CLA signed so we can move ahead with the merge. Appreciate you taking a look at the template.

@zxqfd555 zxqfd555 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I have tested the integration with an actual API key, and I can confirm it works. I have several questions on the implementation. Could you please address them?


def _upload_asset(self, contents: bytes) -> str:
"""Upload video bytes and return the asset id once it is ready."""
asset = self.client.assets.create(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the integration with a real API key, and it works!
However, I have noted that a run creates a new asset, which is not removed afterwards. I suspect it may flood the assets list if there are many runs. Would it be possible to clear the produced assets on completion?

return len(self._embed_one("."))

def _embed_one(self, text: str) -> np.ndarray:
response = self.client.embed.create(model_name=self.model, text=text)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is a blocking call.

Returns:
A list of 512-dimensional ``numpy`` arrays, one per input string.
"""
return [self._embed_one(text) for text in inputs]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _embed_one is synchronous, this call is potentially inefficient.
It would be better to replace it with an async version if possible. If it's not provided by the current API, please document this behavior.

…t path

- TwelveLabsVideoParser: delete the per-run uploaded asset after analysis via
  client.assets.delete() inside a try/finally (cleaned up even if analyze()
  raises). Add a `delete_assets: bool = True` flag so the asset list isn't
  flooded by default; pass False to keep assets for reuse/inspection. When
  deletion is on, omit `twelvelabs_asset_id` from metadata since the id no
  longer resolves. Docstrings updated accordingly.
- MarengoEmbedder.get_embedding_dimension: document that it is a one-time,
  setup-time synchronous probe (Pathway calls it once when building the index),
  not on the hot path, so the single blocking call there is intentional.
- MarengoEmbedder: move the embedding hot path to the async client. Add a lazily
  built AsyncTwelveLabs `aclient` property and `_aembed_one`, and make
  __wrapped__ run requests concurrently via asyncio.gather instead of calling
  the sync client serially. Keep sync `_embed_one` for the setup probe only.
- Tests: cover asset deletion (default + disabled + delete-on-analyze-failure)
  and the new async, concurrent __wrapped__ path.
@mohit-twelvelabs

Copy link
Copy Markdown
Contributor Author

Thanks so much for testing this with a real key, @zxqfd555 — really appreciate it. I've pushed 7846519 addressing all three comments:

1. Assets flooding the list (_upload_asset / __wrapped__) — Good catch. The parser now deletes the per-run asset after analysis via client.assets.delete(asset_id), wrapped in a try/finally so cleanup happens even if analyze() raises. I made it the default (delete_assets: bool = True) so repeated runs no longer accumulate assets, with an opt-out (delete_assets=False) for anyone who wants to keep assets around for reuse/inspection. Since the id no longer resolves once the asset is gone, I now omit twelvelabs_asset_id from the emitted metadata when deletion is on (and keep it when delete_assets=False); the docstrings reflect this. Verified live end to end: create → assets.delete → a follow-up retrieve confirms the asset is gone.

2. get_embedding_dimension is a blocking call — Correct, it is blocking, and that's intentional here: Pathway calls it exactly once at index-build time, so it's a one-time setup-time probe, not on the per-document hot path. I added a docstring note making that explicit so the behavior is documented rather than surprising.

3. MarengoEmbedder.__wrapped__ calling the sync _embed_one — Fixed properly. The SDK does ship an async client, so I moved the hot path onto it: there's now a lazily-built AsyncTwelveLabs aclient (mirroring the existing sync client property) and an async def _aembed_one, and __wrapped__ now runs the requests concurrently via asyncio.gather instead of serial sync calls — so the embedding path no longer blocks. The sync _embed_one is kept solely for the one-time dimension probe above.

I also added tests for the asset-deletion behavior (default, disabled, and cleanup-on-analyze-failure) and for the new concurrent async __wrapped__ path. Verified live against the API: dimension probe returns 512, and the async path returns two (512,) float32 vectors. Let me know if you'd like any tweaks!

@zxqfd555 zxqfd555 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes. Merging.

@zxqfd555 zxqfd555 merged commit 1cfa84a into pathwaycom:main Jun 29, 2026
2 checks passed
@mohit-twelvelabs

Copy link
Copy Markdown
Contributor Author

Thank you @zxqfd555 and the Pathway team for the thorough review and the merge! 🎉 Really appreciate you testing against the live API and the sharp catches on asset cleanup and the async embedding path — the template's better for it. Excited to see TwelveLabs video RAG in Pathway; happy to help with any follow-ups down the line.

@zxqfd555

Copy link
Copy Markdown
Contributor

As a follow-up, I think we can port the embedder to the Pathway package so it's available out of the box for anyone who pip-installs it. I've described the details in the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants