Dependency-Aware Skill Retrieval Engine — Using SeekDB to solve the "cliff effect" in AI skill systems
Inspiration: Based on the article The Truth About Skill Systems, which points out that giving AI a thousand skills should make it omnipotent in theory, but in practice the benefit is near zero — because semantic retrieval only finds "seemingly relevant" skills and misses "the set that actually works together".
Give AI a thousand skills and it should be omnipotent. But in reality, the benefit is near zero.
Root cause: Semantic retrieval finds skills that "look relevant" but misses the dependencies needed to execute them. A top-level skill (e.g., "video analysis") requires underlying dependencies (frame extraction, object counting, format conversion) to function. Pure vector search returns a list of skills missing these dependency chains, leaving the AI with an empty shell.
Semantic similarity ≠ Execution sufficiency.
Dependency-aware retrieval requires two capabilities working together: vector search to find seed skills, and relational filtering to expand dependency chains. No single traditional database provides both.
| Approach | Vector Search | Dependency Expansion | Systems Required | Complexity |
|---|---|---|---|---|
| Milvus / Pinecone alone | Yes | No — missing precise ID lookup | 1 + MySQL | High: two databases, data sync, distributed transactions |
| MySQL / PostgreSQL alone | No — requires external embedding service | Yes | 1 + embedding service | High: manual vector management, no HNSW |
| Redis + pgvector | Yes | Yes | 2 + Redis | High: separate systems, cross-database joins impossible |
| SeekDB | Yes | Yes | 1 | Low: single database, unified API, embedded mode |
SeekDB is the only option that provides vector semantic search, BM25 full-text search, and relational filtering in a single database — eliminating the need for multi-system architecture, data synchronization, and cross-database joins.
| Capability | Implementation | What It Does |
|---|---|---|
| Vector search | HNSW index, cosine distance | Finds "seemingly relevant" seed skills |
| BM25 full-text | Built-in inverted index | Exact match for technical terms and names |
| Relational filter | SQL on metadata fields | Precise dependency expansion by ID |
| Hybrid search | RRF fusion of vector + BM25 | Best of both worlds in a single query |
| Embedded mode | Local file storage, no external service | Zero deployment, works in any Python project |
User query ("analyze video pedestrian traffic")
│
▼
┌─────────────────────────────────────────┐
│ Phase 1: Vector semantic search │ ← SeekDB HNSW vector
│ → seed skills │ single cosine query
├─────────────────────────────────────────┤
│ Phase 2: Recursive dependency expansion │ ← SeekDB metadata filter
│ → full skill closure │ collection.get(ids=[...])
├─────────────────────────────────────────┤
│ Phase 3: Topological sort │ ← in-memory DAG
│ → executable order │ (client-side, lightweight)
└─────────────────────────────────────────┘
Phases 1 and 2 both run against the same SeekDB instance — no cross-database calls, no data sync, no separate vector service.
Hybrid mode (vector + BM25 + RRF) is also available via retrieve_hybrid() for cases where keyword matching complements semantic search.
Pure vector retrieval misses 30%–40% of execution dependencies. Switching to a 4× stronger embedding yields identical results. The problem is not embedding quality — semantic similarity as a paradigm cannot cover dependency relationships. SeekDB finds seeds via vectors + finds dependencies via relations, achieving 100% dependency completeness with ~100% token savings.
| Dataset | Scale | Embedding | Vector Miss Rate | SeekDB Completeness | Token Ratio |
|---|---|---|---|---|---|
| Script-generated | 1000 skills, 43 domains | all-MiniLM-L6-v2 (384-dim, ONNX) | 30% | 100% | ~1:450 |
| ClawSkills (real) | 1054 skills | all-MiniLM-L6-v2 (384-dim, ONNX) | 40% | 100% | ~1:500 |
| Script-generated | 1000 skills, 43 domains | text-embedding-3-small (1536-dim, OpenAI) | 30% | 100% | ~1:450 |
Key evidence: 1536-dim OpenAI embedding and 384-dim ONNX model produce identical miss rates (30%), proving the problem is structural, not semantic quality.
- Script-generated (ONNX):
REPORT_generated.md - ClawSkills (real data):
REPORT_clawskills.md - Script-generated (OpenAI):
REPORT_embedding.md
Query Vector Missing Vanilla Tokens Ours Complete Savings
─────────────────── ────── ────── ────── ──────── ────── ────── ────────
Image Pipeline 5 0 1000 500K 2 ✓ ~100%
ML Training 5 2 1000 500K 2 ✓ ~100%
ETL Pipeline 5 3 1000 500K 4 ✓ ~100%
Security Scan 5 0 1000 500K 2 ✓ ~100%
Query Vector Missing Vanilla Tokens Ours Complete
────────────────────────────────── ────── ────── ────── ──────── ────── ──────
workspace governance 5 3 1054 527K 4 ✓
security vulnerability scan 5 3 1054 527K 5 ✓
parse document PDF 5 1 1054 527K 3 ✓
Security queries miss 3 binary dependencies: bin_curl, bin_jq, bin_openssl. These skills have descriptions like "Git version control" and "JSON processor" — completely unrelated semantically to "security vulnerability scan", making them invisible to vector search. SeekDB's relational filtering fills this gap: once a seed skill is found, its dependencies are retrieved by exact ID lookup, not by semantic similarity.
Run the generator — no external download needed:
python demo/generate_1000_skills.pyGenerates 1000 skills across 43 domains with 369 dependency edges.
26,502 public skills from the ClawHub/OpenClaw ecosystem:
# Option 1: Clone from GitHub
git clone https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis.git
cp Clawhub-Skills-Analysis/test_data_v2/ClawSkills_dataset/* test_data_v2/ClawSkills_dataset/
# Option 2: Download manually
# https://github.com/Tomsawyerhu/Clawhub-Skills-Analysis
# Extract to test_data_v2/ClawSkills_dataset/After downloading, run the conversion:
python demo/convert_clawskills.py
python demo/benchmark_clawskills.pygit clone https://github.com/wayyoungboy/skill_graph_search.git
cd skill_graph_search
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt# Full 8-step demo
python demo/demo.py
# Generate 1000 skills
python demo/generate_1000_skills.py# Script-generated data (ONNX embedding)
python demo/benchmark.py
# ClawSkills real data
python demo/convert_clawskills.py # convert first
python demo/benchmark_clawskills.py # then benchmarkpython -m pytest tests/ -vskill_graph_search/
├── .gitignore
├── LICENSE
├── README.md # English documentation
├── README_zh.md # Chinese documentation
├── requirements.txt
├── REPORT_generated.md # Script data (ONNX) benchmark report
├── REPORT_clawskills.md # ClawSkills real data benchmark report
├── REPORT_embedding.md # Script data (OpenAI) benchmark report
├── src/
│ ├── __init__.py
│ ├── models.py # Skill / SkillBundle data models
│ ├── graph.py # Dependency graph and topological sort
│ ├── registry.py # SeekDB CRUD wrapper
│ └── retriever.py # Two-phase retrieval engine + recursive expansion
├── demo/
│ ├── seed_skills.json # Seed skill data
│ ├── generate_1000_skills.py # 1000-skill generator
│ ├── convert_clawskills.py # ClawSkills JSONL → project format
│ ├── demo.py # 8-step full demo
│ ├── benchmark.py # Script data benchmark
│ └── benchmark_clawskills.py # ClawSkills benchmark
├── tests/
│ └── test_retriever.py # 8 unit tests
└── test_data_v2/ # Large datasets (not committed)
└── ClawSkills_dataset/ # ClawHub 26,502 skills
from src.models import Skill
skill = Skill(
id="video_traffic_analyzer",
name="Video Traffic Analyzer",
description="Analyze pedestrian traffic in videos",
category="video_analysis",
dependencies=["frame_extractor", "object_counter"],
provides=["traffic_count"],
alternatives=["lite_video_analyzer"],
)from src.registry import SkillRegistry
registry = SkillRegistry(path="./seekdb_data")
registry.seed_from_json("demo/seed_skills.json")from src.graph import SkillGraph
from src.retriever import SkillRetriever
graph = SkillGraph()
for s in registry.list_all():
graph.add_skill(s)
retriever = SkillRetriever(registry, graph)
# Method 1: Semantic search + dependency expansion
bundle = retriever.retrieve("video pedestrian traffic analysis", n_results=2)
# Method 2: HybridSearch (Vector + BM25 + RRF) + dependency expansion
bundle = retriever.retrieve_hybrid("video pedestrian traffic analysis", n_results=2)
print(f"Seeds: {[s.name for s in bundle.seeds]}")
print(f"Dependencies: {[s.name for s in bundle.dependencies]}")
print(f"Execution order: {bundle.execution_order}")
print(f"Estimated tokens: ~{bundle.total_tokens_estimate}")- Inspiration: The Truth About Skill Systems
- SeekDB: oceanbase/seekdb
- ClawSkills: ClawHub/OpenClaw Dataset