Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder) by mohit-twelvelabs · Pull Request #129 · pathwaycom/llm-app

mohit-twelvelabs · 2026-06-25T10:34:32Z

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

Process note: Per CONTRIBUTING.md and the PR template, this change ideally starts with an issue/Discord discussion and requires signing the CLA. I'm opening this PR as a concrete proposal to make the discussion easier — happy to file an issue, adjust scope, or iterate however the maintainers prefer, and I'll sign the CLA when prompted.

Introduction

This adds a new, fully opt-in application template: Video RAG with TwelveLabs (templates/video_rag_twelvelabs/). It lets a Pathway pipeline do RAG over video by bringing in two TwelveLabs models:

Pegasus (video understanding) — a Pathway parser (TwelveLabsVideoParser, a pw.UDF) that uploads each video as a TwelveLabs asset and turns it into a rich text description (what happens on screen, who/what appears, spoken and on-screen text, the overall topic). Pathway then indexes that text exactly like it indexes a PDF.
Marengo (multimodal embeddings, 512-dim) — a Pathway embedder (MarengoEmbedder, a BaseEmbedder subclass) used as the retriever embedder.

Both components live in a local pathway_twelvelabs package and are wired in entirely through app.yaml (mirroring the multimodal_rag and slides_ai_search templates), so models, prompts, the data source, and the LLM can all be swapped without touching Python.

Context

The existing templates handle documents (PDF/DOCX/slides) but not video. Video is hard to drop into RAG because most stacks only transcribe the audio and discard everything visual. Pegasus captures the whole video as text, and Marengo gives a shared multimodal embedding space. This extends Pathway's live-sync + in-memory-index story to a new modality with zero new infrastructure.

How has this been tested?

templates/video_rag_twelvelabs/test_twelvelabs.py: 4 no-network unit tests (stubbed SDK; run without credentials) covering the embedder vector shape, the Pegasus upload-then-analyze flow, failed-asset handling, and an embedding-dimension regression test. 2 of these are dimension/default checks; a 5th test is a live smoke test that's skipped unless TWELVELABS_API_KEY is set.
Live, end to end against the real TwelveLabs API:
- Marengo text embedding returns a 512-dim vector, and MarengoEmbedder.get_embedding_dimension() correctly reports 512 (this required overriding the base probe, which assumes a single-vector return).
- Pegasus analyzed a short public sample video end to end (asset upload → analyze → text) in ~7s, returning a correct description.
Linters matching the repo's CI (black, isort --profile black, flake8) pass on all new files. The new module type-checks cleanly under mypy; the template dir is added to the existing [tool.mypy] exclude list, consistent with the other RAG templates.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature or improvement (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

This is purely additive: a new template directory plus one row in the main README table and one entry in the mypy exclude list. No existing template, default, or behavior is changed.

Related issue(s):

(none yet — happy to open one if preferred)

Checklist:

My code follows the code style of this project,
My change requires a change to the documentation (added a template README and a row in the main README),
I described the modification in the CHANGELOG.md file. (No CHANGELOG.md exists in this repo.)

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

CLAassistant · 2026-06-25T10:34:46Z

All committers have signed the CLA.

zxqfd555 · 2026-06-28T22:52:31Z

Thank you for your interest in Pathway! Could you please sign the CLA, which is required for the merge? Thank you.

mohit-twelvelabs · 2026-06-28T23:35:47Z

Thanks @zxqfd555 — glad it's of interest! I'll get the CLA signed so we can move ahead with the merge. Appreciate you taking a look at the template.

zxqfd555

Hi!

I have tested the integration with an actual API key, and I can confirm it works. I have several questions on the implementation. Could you please address them?

zxqfd555 · 2026-06-29T20:14:00Z

+
+    def _upload_asset(self, contents: bytes) -> str:
+        """Upload video bytes and return the asset id once it is ready."""
+        asset = self.client.assets.create(


I have tested the integration with a real API key, and it works!
However, I have noted that a run creates a new asset, which is not removed afterwards. I suspect it may flood the assets list if there are many runs. Would it be possible to clear the produced assets on completion?

zxqfd555 · 2026-06-29T20:22:52Z

+        return len(self._embed_one("."))
+
+    def _embed_one(self, text: str) -> np.ndarray:
+        response = self.client.embed.create(model_name=self.model, text=text)


If I understand correctly, this is a blocking call.

zxqfd555 · 2026-06-29T20:24:18Z

+        Returns:
+            A list of 512-dimensional ``numpy`` arrays, one per input string.
+        """
+        return [self._embed_one(text) for text in inputs]


If _embed_one is synchronous, this call is potentially inefficient.
It would be better to replace it with an async version if possible. If it's not provided by the current API, please document this behavior.

…t path - TwelveLabsVideoParser: delete the per-run uploaded asset after analysis via client.assets.delete() inside a try/finally (cleaned up even if analyze() raises). Add a `delete_assets: bool = True` flag so the asset list isn't flooded by default; pass False to keep assets for reuse/inspection. When deletion is on, omit `twelvelabs_asset_id` from metadata since the id no longer resolves. Docstrings updated accordingly. - MarengoEmbedder.get_embedding_dimension: document that it is a one-time, setup-time synchronous probe (Pathway calls it once when building the index), not on the hot path, so the single blocking call there is intentional. - MarengoEmbedder: move the embedding hot path to the async client. Add a lazily built AsyncTwelveLabs `aclient` property and `_aembed_one`, and make __wrapped__ run requests concurrently via asyncio.gather instead of calling the sync client serially. Keep sync `_embed_one` for the setup probe only. - Tests: cover asset deletion (default + disabled + delete-on-analyze-failure) and the new async, concurrent __wrapped__ path.

mohit-twelvelabs · 2026-06-29T20:43:53Z

Thanks so much for testing this with a real key, @zxqfd555 — really appreciate it. I've pushed 7846519 addressing all three comments:

1. Assets flooding the list (_upload_asset / __wrapped__) — Good catch. The parser now deletes the per-run asset after analysis via client.assets.delete(asset_id), wrapped in a try/finally so cleanup happens even if analyze() raises. I made it the default (delete_assets: bool = True) so repeated runs no longer accumulate assets, with an opt-out (delete_assets=False) for anyone who wants to keep assets around for reuse/inspection. Since the id no longer resolves once the asset is gone, I now omit twelvelabs_asset_id from the emitted metadata when deletion is on (and keep it when delete_assets=False); the docstrings reflect this. Verified live end to end: create → assets.delete → a follow-up retrieve confirms the asset is gone.

2. get_embedding_dimension is a blocking call — Correct, it is blocking, and that's intentional here: Pathway calls it exactly once at index-build time, so it's a one-time setup-time probe, not on the per-document hot path. I added a docstring note making that explicit so the behavior is documented rather than surprising.

3. MarengoEmbedder.__wrapped__ calling the sync _embed_one — Fixed properly. The SDK does ship an async client, so I moved the hot path onto it: there's now a lazily-built AsyncTwelveLabs aclient (mirroring the existing sync client property) and an async def _aembed_one, and __wrapped__ now runs the requests concurrently via asyncio.gather instead of serial sync calls — so the embedding path no longer blocks. The sync _embed_one is kept solely for the one-time dimension probe above.

I also added tests for the asset-deletion behavior (default, disabled, and cleanup-on-analyze-failure) and for the new concurrent async __wrapped__ path. Verified live against the API: dimension probe returns 512, and the async path returns two (512,) float32 vectors. Let me know if you'd like any tweaks!

zxqfd555

Thank you for the changes. Merging.

mohit-twelvelabs · 2026-06-29T21:36:25Z

Thank you @zxqfd555 and the Pathway team for the thorough review and the merge! 🎉 Really appreciate you testing against the live API and the sharp catches on asset cleanup and the async embedding path — the template's better for it. Excited to see TwelveLabs video RAG in Pathway; happy to help with any follow-ups down the line.

zxqfd555 · 2026-06-30T13:49:25Z

As a follow-up, I think we can port the embedder to the Pathway package so it's available out of the box for anyone who pip-installs it. I've described the details in the linked issue.

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)

b0083d0

zxqfd555 self-assigned this Jun 26, 2026

zxqfd555 reviewed Jun 29, 2026

View reviewed changes

zxqfd555 approved these changes Jun 29, 2026

View reviewed changes

zxqfd555 merged commit 1cfa84a into pathwaycom:main Jun 29, 2026
2 checks passed

zxqfd555 mentioned this pull request Jun 30, 2026

Promote TwelveLabsVideoParser and MarengoEmbedder from the Video RAG template into pathway.xpacks.llm pathwaycom/pathway#255

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129

Add TwelveLabs video RAG template (Pegasus parser + Marengo embedder)#129
zxqfd555 merged 2 commits into
pathwaycom:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

mohit-twelvelabs commented Jun 25, 2026

Uh oh!

CLAassistant commented Jun 25, 2026 •

edited

Loading

Uh oh!

zxqfd555 commented Jun 28, 2026

Uh oh!

mohit-twelvelabs commented Jun 28, 2026

Uh oh!

zxqfd555 left a comment

Uh oh!

zxqfd555 Jun 29, 2026

Uh oh!

zxqfd555 Jun 29, 2026

Uh oh!

zxqfd555 Jun 29, 2026

Uh oh!

mohit-twelvelabs commented Jun 29, 2026

Uh oh!

zxqfd555 left a comment

Uh oh!

Uh oh!

mohit-twelvelabs commented Jun 29, 2026

Uh oh!

zxqfd555 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mohit-twelvelabs commented Jun 25, 2026

Introduction

Context

How has this been tested?

Types of changes

Related issue(s):

Checklist:

Uh oh!

CLAassistant commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zxqfd555 commented Jun 28, 2026

Uh oh!

mohit-twelvelabs commented Jun 28, 2026

Uh oh!

zxqfd555 left a comment

Choose a reason for hiding this comment

Uh oh!

zxqfd555 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

zxqfd555 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

zxqfd555 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

mohit-twelvelabs commented Jun 29, 2026

Uh oh!

zxqfd555 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mohit-twelvelabs commented Jun 29, 2026

Uh oh!

zxqfd555 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 25, 2026 •

edited

Loading