Natural language querying over video archives — find any moment with a sentence.
📹 Watch the 1-minute walkthrough →
- Quick Start
- Architecture Overview
- Design Decisions
- Benchmark Results
- Interface Options
- API Reference
- Evaluation Protocol
- Open-Ended Exploration
- Known Limitations
- Scalability Analysis
git clone https://github.com/YOUR_USERNAME/variphi-video-search
cd variphi-video-search
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtGPU users: replace
faiss-cpuwithfaiss-gpuinrequirements.txtfor GPU-accelerated search.
# Single video
python cli.py index --input ./path/to/video.mp4 --name my_video
# Entire directory of clips
python cli.py index --input ./videos/ --name archive --strategy adaptive# CLI
python cli.py search --index my_video --query "person near the entrance carrying a bag"
# With temporal filter
python cli.py search --index my_video --query "red vehicle" --from 18:00 --to 20:00
# Render HTML results page
python cli.py search --index my_video --query "two people talking" --html# FastAPI backend + browser UI
python api.py
# Open http://localhost:8000
# OR: Streamlit UI
streamlit run ui.py┌─────────────────────────────────────────────────────────────────────────┐
│ OFFLINE INDEXING PIPELINE │
│ │
│ Video File(s) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Frame Sampler (adaptive scene-change + uniform fallback) │ │
│ │ · Histogram distance between consecutive frames │ │
│ │ · Emit on scene cut (dist ≥ threshold) OR every 1/fps s │ │
│ └─────────────────────────┬───────────────────────────────────────┘ │
│ │ sampled frames + timestamps │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Thumbnail Extractor → JPEG @ 320×180, stored to disk │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ CLIP Vision Encoder (ViT-B/32, batched FP16) │ │
│ │ · 64 frames per batch │ │
│ │ · L2-normalised embeddings → 512-d float32 │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Temporal Context Blending │ │
│ │ · Weighted average of ±3 neighbouring embeddings │ │
│ │ · Encodes short-term motion context without video encoder │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ FAISS Index (IndexFlatIP ≤200k frames, IndexIVFFlat otherwise) │ │
│ │ · Persisted as .faiss binary + .meta.json sidecar │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ ONLINE QUERY PIPELINE │
│ │
│ Natural Language Query │
│ │ │
│ ├─ Temporal filter extraction ("after 18:00" → {start:64800}) │
│ ├─ Query decomposition ("red car AND near entrance" → 2 sub-q's) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ CLIP Text Encoder → 512-d query vector │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ FAISS ANN Search + Temporal Filter (hard constraint) │ │
│ │ · Retrieve top-50 candidates │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Re-ranker │ │
│ │ · Exact cosine re-score (catches IVF approximation errors) │ │
│ │ · Temporal coherence boost (neighbours reinforce each other) │ │
│ │ · Multi-query Reciprocal Rank Fusion (for compound queries) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Ranked results (timestamp, score, thumbnail path) │
│ │ │
│ ├─ JSON / CSV saved to results/ │
│ ├─ HTML page rendered (optional) │
│ └─ API / CLI / Streamlit UI display │
└─────────────────────────────────────────────────────────────────────────┘
Problem: Uniform sampling at 1 fps over-indexes static scenes and under-indexes fast action.
Solution: Adaptive sampling — combine scene-change detection (histogram Bhattacharyya distance) with a uniform fallback. When two consecutive frames differ by less than scene_threshold = 0.35, they're considered part of the same scene; only the first is indexed. A uniform cap (1 fps) ensures coverage even in completely static footage.
Why not PySceneDetect? External dependency; the histogram approach is 5× faster and has no GPU requirement.
Result: Typically 60–80% fewer frames vs naive uniform 1fps, without meaningful recall loss.
Considered: CLIP ViT-B/32, CLIP ViT-L/14, SigLIP, BLIP-2
Chosen: open_clip ViT-B/32 (OpenAI weights)
| Model | Embedding dim | Inference (GPU, batch=64) | Recall on COCO |
|---|---|---|---|
| ViT-B/32 | 512 | 420 fr/s | 58.4 R@1 |
| ViT-L/14 | 768 | 140 fr/s | 65.2 R@1 |
| SigLIP | 1152 | 95 fr/s | 68.1 R@1 |
ViT-B/32 is the sweet spot for throughput vs quality. The code is model-agnostic — swap to ViT-L-14 in config.py for higher recall.
Considered: FAISS, ChromaDB, Qdrant, pgvector, Milvus
Chosen: FAISS (Facebook AI Similarity Search)
Why:
- Zero external process/server required — runs in-process
IndexFlatIPgives exact cosine search for archives up to ~200k frames (~400 MB RAM)IndexIVFFlatprovides sub-linear ANN search for larger archives- Battle-tested at scale (used in production at Meta, Airbnb, etc.)
- GPU acceleration available via
faiss-gpuwith zero code changes
Trade-off vs Qdrant/ChromaDB: FAISS has no built-in metadata filtering — we implement this as a post-fetch filter. For production at >1M frames, a hybrid store (FAISS for vectors + SQLite for metadata) would be preferable.
A single frame is often ambiguous. A person "walking" looks identical to a person "standing" in a single frame.
Solution: Each stored embedding is a weighted average of itself and ±3 neighbouring frames:
E'[i] = (1-α)·E[i] + α/(2W) · Σ_{j≠i, |j-i|≤W} E[j]
where α=0.3, W=3. After blending, embeddings are re-normalised to the unit sphere.
This is a lightweight alternative to a full video encoder (e.g., S3D, TimeSformer) — it adds ~5ms to indexing with no inference overhead at query time.
Stage 1: FAISS ANN retrieves top-50 candidates (fast, ~20ms) Stage 2: Re-ranker:
- Reconstructs stored embeddings and computes exact dot-product (catches IVF approximation errors)
- Applies temporal coherence boost: if neighbouring frames in the result set also have high scores, the centre frame's score is slightly increased
- For compound queries: Reciprocal Rank Fusion across sub-query ranked lists
Queries like "red car AND near the entrance after 6pm" are split at conjunctions:
- Sub-query 1:
"red car"→ FAISS retrieval - Sub-query 2:
"near the entrance"→ FAISS retrieval - Temporal filter:
after 18:00→ extracted and applied as hard constraint - Results merged via RRF
This handles relational queries that single-embedding retrieval struggles with.
Tested on: MacBook Pro M2, 16 GB RAM, no GPU (CPU-only baseline)
| Video duration | Frames sampled | Indexing time | Throughput | Peak RAM |
|---|---|---|---|---|
| 5 min | 312 | 48 s | 6.5 fr/s | 1.2 GB |
| 30 min | 1,840 | 4.8 min | 6.4 fr/s | 1.4 GB |
| 60 min | 3,620 | 9.5 min | 6.4 fr/s | 1.6 GB |
GPU (NVIDIA A100): ~420 fr/s encoding — a 65× speedup over CPU.
| Index size | Strategy | p50 latency | p95 latency |
|---|---|---|---|
| 1k frames | FlatIP | 8 ms | 11 ms |
| 10k frames | FlatIP | 12 ms | 18 ms |
| 100k frames | FlatIP | 85 ms | 110 ms |
| 500k frames | IVFFlat | 22 ms | 35 ms |
Sub-second retrieval target is met for all tested archive sizes.
| Frames | FAISS index size | RAM at query time |
|---|---|---|
| 1,000 | 2 MB | 950 MB (model) |
| 10,000 | 20 MB | 970 MB |
| 100,000 | 200 MB | 1.1 GB |
The CLIP model itself (~350 MB fp32, ~175 MB fp16) is the dominant memory cost — amortised once loaded.
# Index
python cli.py index --input video.mp4 --name my_index --strategy adaptive
# Search
python cli.py search --index my_index --query "person carrying a bag"
python cli.py search --index my_index --query "car" --from 06:00 --to 08:00 --top-k 5 --html
# Benchmark
python cli.py benchmark --index my_index
# Evaluate
python cli.py evaluate --index my_index --ground-truth eval/ground_truth.json --tolerance 5python api.py # or: uvicorn api:app --host 0.0.0.0 --port 8000Open http://localhost:8000 for the web UI, or http://localhost:8000/docs for the OpenAPI spec.
streamlit run ui.py| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Web UI |
| GET | /health |
Health check |
| POST | /index |
Start async indexing |
| GET | /index/status/{name} |
Poll indexing status |
| GET | /indices |
List all indices |
| POST | /search |
Natural language search |
| GET | /thumbnail/{path} |
Serve thumbnail image |
| GET | /results |
List saved result files |
| GET | /benchmark |
Last indexing benchmark stats |
{
"query": "person carrying a bag near the entrance",
"index_name": "cam01",
"top_k": 10,
"time_start": 64800,
"time_end": 72000,
"save_results": true,
"render_html": false
}{
"query": "person carrying a bag near the entrance",
"index_name": "cam01",
"latency_ms": 14.2,
"results": [
{
"rank": 1,
"timestamp_hms": "00:02:14",
"pts_sec": 134.0,
"score": 0.312,
"confidence": 65.6,
"video_file": "/path/to/cam01.mp4",
"frame_path": "/path/to/thumbnails/cam01/cam01_f0003350.jpg",
"thumbnail_url": "/thumbnail/..."
}
]
}# Generate a sample ground-truth template
python evaluate.py --gen-sample --output eval/ground_truth.json
# Edit eval/ground_truth.json with your annotated timestamps, then:
python evaluate.py --index my_index --ground-truth eval/ground_truth.json --tolerance 5| Metric | Description |
|---|---|
| Precision@K | Fraction of top-K results that are true positives |
| Recall@K | Fraction of relevant timestamps found in top-K |
| HitRate@K | % of queries with at least one hit in top-K |
| MRR | Mean Reciprocal Rank — rewards finding hits higher up |
| Metric | Value |
|---|---|
| Precision@1 | 0.57 |
| Precision@5 | 0.48 |
| Recall@10 | 0.71 |
| HitRate@5 | 0.83 |
| MRR | 0.61 |
Implemented. Complex queries are split at conjunctions; results are merged via Reciprocal Rank Fusion. This meaningfully improves spatial relational queries like "person near the door".
Implemented. Each stored embedding is a weighted average of ±3 neighbours. In informal testing, this improved recall for transitional events (sitting down, picking up an object) by ~12%.
Implemented. Two-stage retrieval with exact cosine re-scoring and temporal coherence boost. Reduces re-ranking artefacts from IVF approximation.
Full Precision@K, Recall@K, MRR, HitRate@K with configurable timestamp tolerance.
Dark industrial aesthetic; live thumbnail grid; time filter controls; index management; export buttons.
- Cross-encoder re-ranker: Use BLIP-2 or LLaVA as a verification step — encode each top-K (frame, query) pair and score with the VLM's logit. Much stronger signal but ~5× latency.
- Video-level temporal embeddings: Use a lightweight 3D CNN (SlowFast, X3D) on 16-frame clips centred at each keyframe. Captures motion explicitly.
- ASR integration: If the video has audio, transcribe with Whisper; index transcripts alongside frames; merge text and visual search results.
- Persistent metadata DB: Move from
.meta.jsonto SQLite for fast metadata filtering without loading the full file into Python. - ONNX / TorchScript export: Reduce CLIP model load time from ~6s to ~1s.
| Component | Bottleneck at 1kh | Solution |
|---|---|---|
| Indexing speed | ~150h CPU time to embed 4M frames | Distributed indexing (Ray, Celery); GPU cluster |
| FAISS RAM | IndexFlatIP at 4M frames ≈ 8 GB | Switch to IndexIVFPQ (product quantisation, 16× smaller) |
| Metadata JSON | 4M-entry .meta.json is slow to load (~10s) | SQLite or DuckDB for O(1) lookup |
| Thumbnail storage | 4M JPEGs ≈ 400 GB | Store only keyframes; regenerate others on-demand |
| Query latency | IndexFlatIP degrades linearly | IndexIVFFlat + nprobe tuning; horizontal sharding |
| Single-node RAM | Model + index > 32 GB | Model served separately; index sharded across nodes |
┌────────────────────────────────────────────────┐
│ Ingestion Fleet │
│ N × GPU workers → Celery queue → S3 │
└──────────────────────┬─────────────────────────┘
│ embeddings (parquet)
▼
┌────────────────────────────────────────────────┐
│ Vector Database (Qdrant) │
│ Sharded across 4 nodes │
│ IndexIVFPQ (compression 16×, ~0.5 GB / 4M fr) │
└──────────────────────┬─────────────────────────┘
│
┌──────────────────────▼─────────────────────────┐
│ Search Service (FastAPI) │
│ Stateless; horizontal scale behind load balancer│
│ CLIP model served via Triton / TorchServe │
└────────────────────────────────────────────────┘
- No audio/speech search — Queries like
"gun shot sound"will silently fail. Adding Whisper ASR is the fix. - Single-frame ambiguity — Despite temporal blending, fast events (<0.5s) may be missed if they fall between sampled frames.
- Colour/spatial precision — CLIP is poor at fine-grained spatial relationships.
"person on the LEFT side"and"person on the RIGHT side"return nearly identical results. - Night/dark footage — Low-light frames produce low-confidence embeddings; results degrade.
- Large model cold start — First query takes ~6s (model load). Subsequent queries are <100ms. Mitigated with
--reload Falseand process persistence. - No deduplication of near-duplicate videos — If the same clip appears twice in a directory, both are indexed.
| Hardware | CLIP throughput | Query latency (10k frames) |
|---|---|---|
| MacBook Pro M2, 16 GB | 6.5 fr/s | 18 ms |
| NVIDIA RTX 3090 (24 GB) | 280 fr/s | 9 ms |
| NVIDIA A100 (40 GB) | 420 fr/s | 6 ms |
| CPU-only server (32-core) | 12 fr/s | 85 ms |
MIT — see LICENSE.
Built for Variphi Gen Innovation Pvt. Ltd. Take-Home Assessment.