Skip to content

wayyoungboy/skill_graph_search

Repository files navigation

Skill Graph Search

Dependency-Aware Skill Retrieval Engine — Using SeekDB to solve the "cliff effect" in AI skill systems

Python 3.11+ SeekDB License

English | Chinese

Problem

Inspiration: Based on the article The Truth About Skill Systems, which points out that giving AI a thousand skills should make it omnipotent in theory, but in practice the benefit is near zero — because semantic retrieval only finds "seemingly relevant" skills and misses "the set that actually works together".

Give AI a thousand skills and it should be omnipotent. But in reality, the benefit is near zero.

Root cause: Semantic retrieval finds skills that "look relevant" but misses the dependencies needed to execute them. A top-level skill (e.g., "video analysis") requires underlying dependencies (frame extraction, object counting, format conversion) to function. Pure vector search returns a list of skills missing these dependency chains, leaving the AI with an empty shell.

Semantic similarity ≠ Execution sufficiency.

Why SeekDB?

Dependency-aware retrieval requires two capabilities working together: vector search to find seed skills, and relational filtering to expand dependency chains. No single traditional database provides both.

Architecture Comparison

Approach Vector Search Dependency Expansion Systems Required Complexity
Milvus / Pinecone alone Yes No — missing precise ID lookup 1 + MySQL High: two databases, data sync, distributed transactions
MySQL / PostgreSQL alone No — requires external embedding service Yes 1 + embedding service High: manual vector management, no HNSW
Redis + pgvector Yes Yes 2 + Redis High: separate systems, cross-database joins impossible
SeekDB Yes Yes 1 Low: single database, unified API, embedded mode

SeekDB is the only option that provides vector semantic search, BM25 full-text search, and relational filtering in a single database — eliminating the need for multi-system architecture, data synchronization, and cross-database joins.

Five Capabilities in One System

Capability Implementation What It Does
Vector search HNSW index, cosine distance Finds "seemingly relevant" seed skills
BM25 full-text Built-in inverted index Exact match for technical terms and names
Relational filter SQL on metadata fields Precise dependency expansion by ID
Hybrid search RRF fusion of vector + BM25 Best of both worlds in a single query
Embedded mode Local file storage, no external service Zero deployment, works in any Python project

How This Project Uses SeekDB

User query ("analyze video pedestrian traffic")
    │
    ▼
┌─────────────────────────────────────────┐
│  Phase 1: Vector semantic search        │  ← SeekDB HNSW vector
│            → seed skills                 │     single cosine query
├─────────────────────────────────────────┤
│  Phase 2: Recursive dependency expansion │  ← SeekDB metadata filter
│            → full skill closure          │     collection.get(ids=[...])
├─────────────────────────────────────────┤
│  Phase 3: Topological sort              │  ← in-memory DAG
│            → executable order            │     (client-side, lightweight)
└─────────────────────────────────────────┘

Phases 1 and 2 both run against the same SeekDB instance — no cross-database calls, no data sync, no separate vector service.

Hybrid mode (vector + BM25 + RRF) is also available via retrieve_hybrid() for cases where keyword matching complements semantic search.

Benchmarks

Key Findings

Pure vector retrieval misses 30%–40% of execution dependencies. Switching to a 4× stronger embedding yields identical results. The problem is not embedding quality — semantic similarity as a paradigm cannot cover dependency relationships. SeekDB finds seeds via vectors + finds dependencies via relations, achieving 100% dependency completeness with ~100% token savings.

Three Experiments Overview

Dataset Scale Embedding Vector Miss Rate SeekDB Completeness Token Ratio
Script-generated 1000 skills, 43 domains all-MiniLM-L6-v2 (384-dim, ONNX) 30% 100% ~1:450
ClawSkills (real) 1054 skills all-MiniLM-L6-v2 (384-dim, ONNX) 40% 100% ~1:500
Script-generated 1000 skills, 43 domains text-embedding-3-small (1536-dim, OpenAI) 30% 100% ~1:450

Key evidence: 1536-dim OpenAI embedding and 384-dim ONNX model produce identical miss rates (30%), proving the problem is structural, not semantic quality.

Detailed Reports

  • Script-generated (ONNX): REPORT_generated.md
  • ClawSkills (real data): REPORT_clawskills.md
  • Script-generated (OpenAI): REPORT_embedding.md

Script-generated (ONNX) — Selected Results

Query               Vector  Missing  Vanilla   Tokens   Ours   Complete  Savings
─────────────────── ──────  ──────   ──────    ──────── ────── ──────    ────────
Image Pipeline      5       0        1000      500K     2      ✓         ~100%
ML Training         5       2        1000      500K     2      ✓         ~100%
ETL Pipeline        5       3        1000      500K     4      ✓         ~100%
Security Scan       5       0        1000      500K     2      ✓         ~100%

ClawSkills (Real Data) — Typical Cases

Query                              Vector  Missing  Vanilla   Tokens   Ours   Complete
────────────────────────────────── ──────  ──────   ──────    ──────── ────── ──────
workspace governance               5       3        1054      527K     4      ✓
security vulnerability scan        5       3        1054      527K     5      ✓
parse document PDF                 5       1        1054      527K     3      ✓

Security queries miss 3 binary dependencies: bin_curl, bin_jq, bin_openssl. These skills have descriptions like "Git version control" and "JSON processor" — completely unrelated semantically to "security vulnerability scan", making them invisible to vector search. SeekDB's relational filtering fills this gap: once a seed skill is found, its dependencies are retrieved by exact ID lookup, not by semantic similarity.

Datasets

Script-Generated Data

Run the generator — no external download needed:

python demo/generate_1000_skills.py

Generates 1000 skills across 43 domains with 369 dependency edges.

ClawSkills (Real Data)

26,502 public skills from the ClawHub/OpenClaw ecosystem:

# Option 1: Clone from GitHub
git clone https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis.git
cp Clawhub-Skills-Analysis/test_data_v2/ClawSkills_dataset/* test_data_v2/ClawSkills_dataset/

# Option 2: Download manually
# https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis
# Extract to test_data_v2/ClawSkills_dataset/

After downloading, run the conversion:

python demo/convert_clawskills.py
python demo/benchmark_clawskills.py

Quick Start

Installation

git clone https://github.com/wayyoungboy/skill_graph_search.git
cd skill_graph_search
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run Demo

# Full 8-step demo
python demo/demo.py

# Generate 1000 skills
python demo/generate_1000_skills.py

Run Benchmarks

# Script-generated data (ONNX embedding)
python demo/benchmark.py

# ClawSkills real data
python demo/convert_clawskills.py   # convert first
python demo/benchmark_clawskills.py # then benchmark

Run Tests

python -m pytest tests/ -v

Project Structure

skill_graph_search/
├── .gitignore
├── LICENSE
├── README.md                      # English documentation
├── README_zh.md                   # Chinese documentation
├── requirements.txt
├── REPORT_generated.md            # Script data (ONNX) benchmark report
├── REPORT_clawskills.md           # ClawSkills real data benchmark report
├── REPORT_embedding.md            # Script data (OpenAI) benchmark report
├── src/
│   ├── __init__.py
│   ├── models.py                  # Skill / SkillBundle data models
│   ├── graph.py                   # Dependency graph and topological sort
│   ├── registry.py                # SeekDB CRUD wrapper
│   └── retriever.py               # Two-phase retrieval engine + recursive expansion
├── demo/
│   ├── seed_skills.json           # Seed skill data
│   ├── generate_1000_skills.py    # 1000-skill generator
│   ├── convert_clawskills.py      # ClawSkills JSONL → project format
│   ├── demo.py                    # 8-step full demo
│   ├── benchmark.py               # Script data benchmark
│   └── benchmark_clawskills.py    # ClawSkills benchmark
├── tests/
│   └── test_retriever.py          # 8 unit tests
└── test_data_v2/                  # Large datasets (not committed)
    └── ClawSkills_dataset/        # ClawHub 26,502 skills

Core API

Data Model

from src.models import Skill

skill = Skill(
    id="video_traffic_analyzer",
    name="Video Traffic Analyzer",
    description="Analyze pedestrian traffic in videos",
    category="video_analysis",
    dependencies=["frame_extractor", "object_counter"],
    provides=["traffic_count"],
    alternatives=["lite_video_analyzer"],
)

Register Skills

from src.registry import SkillRegistry

registry = SkillRegistry(path="./seekdb_data")
registry.seed_from_json("demo/seed_skills.json")

Dependency-Aware Retrieval

from src.graph import SkillGraph
from src.retriever import SkillRetriever

graph = SkillGraph()
for s in registry.list_all():
    graph.add_skill(s)

retriever = SkillRetriever(registry, graph)

# Method 1: Semantic search + dependency expansion
bundle = retriever.retrieve("video pedestrian traffic analysis", n_results=2)

# Method 2: HybridSearch (Vector + BM25 + RRF) + dependency expansion
bundle = retriever.retrieve_hybrid("video pedestrian traffic analysis", n_results=2)

print(f"Seeds: {[s.name for s in bundle.seeds]}")
print(f"Dependencies: {[s.name for s in bundle.dependencies]}")
print(f"Execution order: {bundle.execution_order}")
print(f"Estimated tokens: ~{bundle.total_tokens_estimate}")

References

About

Dependency-aware AI skill retrieval engine using SeekDB vector search and graph expansion

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages