Skip to content

vancyland/DataClaw0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

DataClaw

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Actively refining and structuring raw multimodal data to align with diverse user and downstream intents.

Paper Project Page Hugging Face License Status

Note

The code, model weights, dataset, and the DataClaw-val benchmark will be released upon paper acceptance. In the meantime, read the method in the paper and explore the qualitative cases on the project page.


📰 News

  • 2026-06 — DataClaw v1 paper released on arXiv 📄
  • 2026-06 — Project page with qualitative cases across five domains is live 🌐
  • Upcoming — Code, weights, dataset, and the DataClaw-val benchmark will be released upon acceptance

🐾 Overview

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms — heavily reliant on heuristic rules or general VLMs — are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data.

DataClaw elevates data processing to a learnable, high-order capability. We propose a paradigm shift towards Agentic Data Tailoring: given a user intent or downstream objective, a 9B tailoring agent filters redundant signal from long videos, GUI traces, embodied trajectories, and editing sequences, then reorganizes the residual into dense, verifiable, application-specific supervision.

Key ideas:

  • Bottom-up Factual Anchors → Top-down Semantic Synthesis. A two-stage pipeline grounds generative semantic synthesis in deterministic factual anchors, yielding a large-scale dataset spanning five core physical and digital domains.
  • SFT + rule-driven GRPO. DataClaw-9B synergizes Supervised Fine-Tuning with Group Relative Policy Optimization to robustly align with complex refinement and tailoring intents.
  • Two deployment paradigms. A single unified Omni model (DataClaw-O) or a panel of domain Experts (DataClaw-E).
  • DataClaw-val. The first benchmark dedicated to data refinement, scoring outputs by JSON validity and schema-aware Field / Semantic / Sequence metrics.
  • Downstream post-training as the ultimate touchstone. Validated on video generation, real-world VQA, and GUI navigation under volume-aligned training budgets.
DataClaw method overview
DataClaw pipeline: bottom-up factual anchor extraction and top-down semantic synthesis, followed by training under the Omni and Expert paradigms, inference, and downstream utilization.

🎬 Qualitative Cases

Interactive case replays across five domains — daily life, education, GUI agents, embodied, and AIGC — with input videos and the agent's reasoning, are available on the project page →


📊 Results (from the v1 paper)

DataClaw-E is the routed expert configuration; DataClaw-O is the unified omni model.

DataClaw-val — structured-output quality (Field / Semantic / Sequence)

Model Field Semantic Sequence
Gemini-3.1-Pro 98.12 73.85 58.50
GPT-4o 97.27 75.15 49.43
DataClaw-E (Ours) 97.53 74.94 48.86
DataClaw-O (Ours) 87.65 62.46 44.82

Targeted Refinement — downstream SFT (same raw streams, same budget, only the annotator changes)

Downstream task Metric SFT on Gemini-3.1-Pro SFT on DataClaw
GUI navigation (AgentNet) SSR ↑ / TSR 39.5 / 14.2 38.2 / 15.6
Action video gen (Ego4D) FVD ↓ / Contact mAP 295.4 / 48.5 288.6 / 51.2
Spatio-temporal VQA (ReMoT) Partial / Overall 53.4 / 31.5 52.1 / 33.2

DataClaw-E matches frontier VLMs on schema quality and leads on end-to-end downstream task success. Full tables, ablations, scaling curves, and t-SNE diversity analysis are in the paper.


🗺️ Release Roadmap

Everything below ships upon paper acceptance (v2).

  • 📄 Paper (v1) on arXiv
  • 🌐 Project page with qualitative cases & demos
  • 🧩 Code — training (SFT + GRPO) and inference
  • 🏋️ Model weights — DataClaw-O (Omni) & DataClaw-E (Experts)
  • 📦 Dataset — five domains (daily life, education, embodied, GUI agents, AIGC)
  • 📐 DataClaw-val benchmark + evaluation scripts
  • 📝 Reproduction recipes for downstream SFT tasks

Star / watch this repo to be notified the moment the code and data drop.


📌 Citation

If you find DataClaw useful for your research, please consider citing:

@article{wan2026dataclaw,
  title   = {DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams},
  author  = {Wan, Cong and Guo, Zeyu and Cai, Zijian and Li, Jiangyang and
             Dong, SongLin and Peng, Lin and Luo, Xiangyang and Ma, Zhiheng and Gong, Yihong},
  journal = {arXiv preprint arXiv:2606.21337},
  year    = {2026}
}

📬 Contact

Questions, collaboration, or follow-up? Open an issue or reach the authors via the contacts listed on the paper.

License

The license for the code and released artifacts will be announced together with the open-source release upon acceptance.

About

DataClaw: Agentic Tailoring Multimodal Data from Raw Streams — coming soon (code, weights, dataset & DataClaw-val upon acceptance).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors