Skill Graph Search

Dependency-Aware Skill Retrieval Engine — Using SeekDB to solve the "cliff effect" in AI skill systems

Problem

Inspiration: Based on the article The Truth About Skill Systems, which points out that giving AI a thousand skills should make it omnipotent in theory, but in practice the benefit is near zero — because semantic retrieval only finds "seemingly relevant" skills and misses "the set that actually works together".

Give AI a thousand skills and it should be omnipotent. But in reality, the benefit is near zero.

Root cause: Semantic retrieval finds skills that "look relevant" but misses the dependencies needed to execute them. A top-level skill (e.g., "video analysis") requires underlying dependencies (frame extraction, object counting, format conversion) to function. Pure vector search returns a list of skills missing these dependency chains, leaving the AI with an empty shell.

Semantic similarity ≠ Execution sufficiency.

Why SeekDB?

Dependency-aware retrieval requires two capabilities working together: vector search to find seed skills, and relational filtering to expand dependency chains. No single traditional database provides both.

Architecture Comparison

Approach	Vector Search	Dependency Expansion	Systems Required	Complexity
Milvus / Pinecone alone	Yes	No — missing precise ID lookup	1 + MySQL	High: two databases, data sync, distributed transactions
MySQL / PostgreSQL alone	No — requires external embedding service	Yes	1 + embedding service	High: manual vector management, no HNSW
Redis + pgvector	Yes	Yes	2 + Redis	High: separate systems, cross-database joins impossible
SeekDB	Yes	Yes	1	Low: single database, unified API, embedded mode

SeekDB is the only option that provides vector semantic search, BM25 full-text search, and relational filtering in a single database — eliminating the need for multi-system architecture, data synchronization, and cross-database joins.

Five Capabilities in One System

Capability	Implementation	What It Does
Vector search	HNSW index, cosine distance	Finds "seemingly relevant" seed skills
BM25 full-text	Built-in inverted index	Exact match for technical terms and names
Relational filter	SQL on metadata fields	Precise dependency expansion by ID
Hybrid search	RRF fusion of vector + BM25	Best of both worlds in a single query
Embedded mode	Local file storage, no external service	Zero deployment, works in any Python project

How This Project Uses SeekDB

User query ("analyze video pedestrian traffic")
    │
    ▼
┌─────────────────────────────────────────┐
│  Phase 1: Vector semantic search        │  ← SeekDB HNSW vector
│            → seed skills                 │     single cosine query
├─────────────────────────────────────────┤
│  Phase 2: Recursive dependency expansion │  ← SeekDB metadata filter
│            → full skill closure          │     collection.get(ids=[...])
├─────────────────────────────────────────┤
│  Phase 3: Topological sort              │  ← in-memory DAG
│            → executable order            │     (client-side, lightweight)
└─────────────────────────────────────────┘

Phases 1 and 2 both run against the same SeekDB instance — no cross-database calls, no data sync, no separate vector service.

Hybrid mode (vector + BM25 + RRF) is also available via retrieve_hybrid() for cases where keyword matching complements semantic search.

Benchmarks

Key Findings

Pure vector retrieval misses 30%–40% of execution dependencies. Switching to a 4× stronger embedding yields identical results. The problem is not embedding quality — semantic similarity as a paradigm cannot cover dependency relationships. SeekDB finds seeds via vectors + finds dependencies via relations, achieving 100% dependency completeness with ~100% token savings.

Three Experiments Overview

Dataset	Scale	Embedding	Vector Miss Rate	SeekDB Completeness	Token Ratio
Script-generated	1000 skills, 43 domains	all-MiniLM-L6-v2 (384-dim, ONNX)	30%	100%	~1:450
ClawSkills (real)	1054 skills	all-MiniLM-L6-v2 (384-dim, ONNX)	40%	100%	~1:500
Script-generated	1000 skills, 43 domains	text-embedding-3-small (1536-dim, OpenAI)	30%	100%	~1:450

Key evidence: 1536-dim OpenAI embedding and 384-dim ONNX model produce identical miss rates (30%), proving the problem is structural, not semantic quality.

Detailed Reports

Script-generated (ONNX): REPORT_generated.md
ClawSkills (real data): REPORT_clawskills.md
Script-generated (OpenAI): REPORT_embedding.md

Script-generated (ONNX) — Selected Results

Query               Vector  Missing  Vanilla   Tokens   Ours   Complete  Savings
─────────────────── ──────  ──────   ──────    ──────── ────── ──────    ────────
Image Pipeline      5       0        1000      500K     2      ✓         ~100%
ML Training         5       2        1000      500K     2      ✓         ~100%
ETL Pipeline        5       3        1000      500K     4      ✓         ~100%
Security Scan       5       0        1000      500K     2      ✓         ~100%

ClawSkills (Real Data) — Typical Cases

Query                              Vector  Missing  Vanilla   Tokens   Ours   Complete
────────────────────────────────── ──────  ──────   ──────    ──────── ────── ──────
workspace governance               5       3        1054      527K     4      ✓
security vulnerability scan        5       3        1054      527K     5      ✓
parse document PDF                 5       1        1054      527K     3      ✓

Security queries miss 3 binary dependencies: bin_curl, bin_jq, bin_openssl. These skills have descriptions like "Git version control" and "JSON processor" — completely unrelated semantically to "security vulnerability scan", making them invisible to vector search. SeekDB's relational filtering fills this gap: once a seed skill is found, its dependencies are retrieved by exact ID lookup, not by semantic similarity.

Datasets

Script-Generated Data

Run the generator — no external download needed:

python demo/generate_1000_skills.py

Generates 1000 skills across 43 domains with 369 dependency edges.

ClawSkills (Real Data)

26,502 public skills from the ClawHub/OpenClaw ecosystem:

# Option 1: Clone from GitHub
git clone https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis.git
cp Clawhub-Skills-Analysis/test_data_v2/ClawSkills_dataset/* test_data_v2/ClawSkills_dataset/

# Option 2: Download manually
# https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis
# Extract to test_data_v2/ClawSkills_dataset/

After downloading, run the conversion:

python demo/convert_clawskills.py
python demo/benchmark_clawskills.py

Quick Start

Installation

git clone https://github.com/wayyoungboy/skill_graph_search.git
cd skill_graph_search
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run Demo

# Full 8-step demo
python demo/demo.py

# Generate 1000 skills
python demo/generate_1000_skills.py

Run Benchmarks

# Script-generated data (ONNX embedding)
python demo/benchmark.py

# ClawSkills real data
python demo/convert_clawskills.py   # convert first
python demo/benchmark_clawskills.py # then benchmark

Run Tests

python -m pytest tests/ -v

Project Structure

skill_graph_search/
├── .gitignore
├── LICENSE
├── README.md                      # English documentation
├── README_zh.md                   # Chinese documentation
├── requirements.txt
├── REPORT_generated.md            # Script data (ONNX) benchmark report
├── REPORT_clawskills.md           # ClawSkills real data benchmark report
├── REPORT_embedding.md            # Script data (OpenAI) benchmark report
├── src/
│   ├── __init__.py
│   ├── models.py                  # Skill / SkillBundle data models
│   ├── graph.py                   # Dependency graph and topological sort
│   ├── registry.py                # SeekDB CRUD wrapper
│   └── retriever.py               # Two-phase retrieval engine + recursive expansion
├── demo/
│   ├── seed_skills.json           # Seed skill data
│   ├── generate_1000_skills.py    # 1000-skill generator
│   ├── convert_clawskills.py      # ClawSkills JSONL → project format
│   ├── demo.py                    # 8-step full demo
│   ├── benchmark.py               # Script data benchmark
│   └── benchmark_clawskills.py    # ClawSkills benchmark
├── tests/
│   └── test_retriever.py          # 8 unit tests
└── test_data_v2/                  # Large datasets (not committed)
    └── ClawSkills_dataset/        # ClawHub 26,502 skills

Core API

Data Model

from src.models import Skill

skill = Skill(
    id="video_traffic_analyzer",
    name="Video Traffic Analyzer",
    description="Analyze pedestrian traffic in videos",
    category="video_analysis",
    dependencies=["frame_extractor", "object_counter"],
    provides=["traffic_count"],
    alternatives=["lite_video_analyzer"],
)

Register Skills

from src.registry import SkillRegistry

registry = SkillRegistry(path="./seekdb_data")
registry.seed_from_json("demo/seed_skills.json")

Dependency-Aware Retrieval

from src.graph import SkillGraph
from src.retriever import SkillRetriever

graph = SkillGraph()
for s in registry.list_all():
    graph.add_skill(s)

retriever = SkillRetriever(registry, graph)

# Method 1: Semantic search + dependency expansion
bundle = retriever.retrieve("video pedestrian traffic analysis", n_results=2)

# Method 2: HybridSearch (Vector + BM25 + RRF) + dependency expansion
bundle = retriever.retrieve_hybrid("video pedestrian traffic analysis", n_results=2)

print(f"Seeds: {[s.name for s in bundle.seeds]}")
print(f"Dependencies: {[s.name for s in bundle.dependencies]}")
print(f"Execution order: {bundle.execution_order}")
print(f"Estimated tokens: ~{bundle.total_tokens_estimate}")

References

Inspiration: The Truth About Skill Systems
SeekDB: oceanbase/seekdb
ClawSkills: ClawHub/OpenClaw Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Graph Search

Problem

Why SeekDB?

Architecture Comparison

Five Capabilities in One System

How This Project Uses SeekDB

Benchmarks

Key Findings

Three Experiments Overview

Detailed Reports

Script-generated (ONNX) — Selected Results

ClawSkills (Real Data) — Typical Cases

Datasets

Script-Generated Data

ClawSkills (Real Data)

Quick Start

Installation

Run Demo

Run Benchmarks

Run Tests

Project Structure

Core API

Data Model

Register Skills

Dependency-Aware Retrieval

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
demo		demo
src		src
test_data		test_data
tests		tests
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
REPORT_clawskills.md		REPORT_clawskills.md
REPORT_embedding.md		REPORT_embedding.md
REPORT_generated.md		REPORT_generated.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Skill Graph Search

Problem

Why SeekDB?

Architecture Comparison

Five Capabilities in One System

How This Project Uses SeekDB

Benchmarks

Key Findings

Three Experiments Overview

Detailed Reports

Script-generated (ONNX) — Selected Results

ClawSkills (Real Data) — Typical Cases

Datasets

Script-Generated Data

ClawSkills (Real Data)

Quick Start

Installation

Run Demo

Run Benchmarks

Run Tests

Project Structure

Core API

Data Model

Register Skills

Dependency-Aware Retrieval

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages