Curate and index documentation from any website into collections like tailwind/, horses/, etc. Reference collection indexes in your AI chats (e.g. @tailwind/INDEX.xml what's a utility?) so that only relevant docs are analysed. Much cleaner than a web-fetch and more focussed than a web-search. Keep your AI context sharp.
Available collections in this repo:
| Collection | Collection Index | Description | Scraped | Source |
|---|---|---|---|---|
📦 biome/ |
📄 INDEX.xml |
Fast linter/formatter | 2025-11-04 | Official |
📦 claudecode/ |
📄 INDEX.xml |
Anthropic Claude Code | 2026-02-05 | Official |
📦 claudeplat/ |
📄 INDEX.xml |
Anthropic Claude Platform | 2026-01-07 | Official |
📦 clerk/ |
📄 INDEX.xml |
Authentication | 2025-12-03 | Official |
📦 convex/ |
📄 INDEX.xml |
Reactive database | 2026-01-07 | Official |
🪝 lefthook/ |
📄 INDEX.xml |
Git hooks manager | 2025-11-24 | Official |
📦 marimo/ |
📄 INDEX.xml |
Reactive Python notebooks | 2025-11-11 | Official |
📦 nextjs/ |
📄 INDEX.xml |
React framework | 2025-12-02 | Official |
📦 playwright/ |
📄 INDEX.xml |
Browser testing | 2025-11-07 | Official |
📦 shadcn/ |
📄 INDEX.xml |
React UI components | 2025-12-16 | Official, Guide |
📦 shiny/ |
📄 INDEX.xml |
Python web apps | 2025-11-02 | Official |
📦 tailwind/ |
📄 INDEX.xml |
CSS framework | 2025-10-15 | Official |
📦 tailwindplus/ |
📄 INDEX.xml |
Paid UI Components | 2025-11-16 | Official |
📦 uv/ |
📄 INDEX.xml |
Python projects | 2026-05-19 | Official |
📦 vercel/ |
📄 INDEX.xml |
Deployment platform | 2025-10-20 | Official |
📦 vitest/ |
📄 INDEX.xml |
Testing framework | 2025-11-05 | Official |
📦 zustand/ |
📄 INDEX.xml |
State management | 2026-01-03 | Official |
Curate your own collections. The lefthook collection is non-standard — docs are downloaded directly from GitHub. For Anthropic docs use this tool.
# 1. Install UV
# 👉 https://docs.astral.sh/uv/getting-started/installation/
# 2. Clone repository
git clone https://github.com/michellepace/docs-for-ai.git
cd docs-for-ai
# 3. Get free FireCrawl API key (Only GitHub sources are downloaded directly)
# Visit: https://www.firecrawl.dev/app/api-keys
# 4. Add to your shell profile
echo 'export API_KEY_MCP_FIRECRAWL=your-api-key-here' >> ~/.zshrc
source ~/.zshrc # Use ~/.bashrc if that's your shellImportant
Edit the paths in .claude/commands/ask-docs.md to match your local setup. To use from anywhere, move it to ~/.claude/commands/.
| Slash Command | Purpose | .md Files | INDEX <source> |
|---|---|---|---|
/curate-doc <collection> <url> |
Add new or re-scrape | ✅ Write | ✅ Add/update INDEX.xml |
/rescrape-docs <collection> |
Re-scrape all docs | ✅ Write all | ✅ Selective update INDEX.xml |
/improve-index-xml <collection> |
Batch improve descriptions | 📖 Read | ✅ Update INDEX.xml |
/ask-docs <collection> <question> |
Query any collection | Docs analysed | Relevant docs identified |
Assume tailwind was not already a collection in this repo:
# Start a new collection
/curate-doc tailwind https://tailwindcss.com/docs/customizing-colors
# → Creates tailwind/ collection directory, with README.md + INDEX.xml, and first curated doc
# Re-scrape existing doc (refresh content from same URL)
/curate-doc tailwind https://tailwindcss.com/docs/customizing-colors
# → Re-scrapes, writes .md file, replaces source in INDEX.xml
# Curate a new doc into collection
/curate-doc tailwind https://tailwindcss.com/docs/styling-with-utility-classes
# → Scrapes page into collection, writes .md file, adds source to INDEX.xml
# Re-scrape all docs in collection
/rescrape-docs tailwind
# → Re-scrapes all URLs in INDEX.xml, writes all .md files, updates descriptions for changed content
# ✨ Use the docs
/ask-docs tailwind Please evaluate my project for correct usage of utility classes?
# → Searches tailwind/INDEX.xml for relevant docs, analyses these, gives you an answerWorkflow: Python script fetches from source URL → writes .md file → creates INDEX.xml entry with PLACEHOLDER description → Claude Code generates semantic description.
The /curate-doc command always regenerates the description, whereas /rescrape-docs only regenerates descriptions for files with content changes.
Source routing: If the source URL is on GitHub, a direct fetch is used instead of FireCrawl.
Directory Structure:
uv/
├── INDEX.xml # Index of all docs
├── README.md
├── api-reference.md # Scraped doc
├── getting-started.md # Scraped doc
└── ...
INDEX.xml Schema:
<docs_index>
<source>
<title>Hello Document Title</title>
<description>20-30 word dense summary optimised for semantic search...</description>
<source_url>https://docs.example.com/hello</source_url>
<local_file>hello-document-title.md</local_file>
<scraped_at>2025-10-15</scraped_at>
</source>
<!-- Multiple <source> entries, one per .md file -->
</docs_index>Scripts use the FireCrawl Python SDK for general web sources and Python stdlib (urllib.request) for GitHub raw markdown.
"Semantic search" isn't the right term. The examples also need improving — very keyword-heavy, with redundant starting words. Index should say <summary> instead of <description>.
Each description is a routing signal for an LLM reader — Claude reads
INDEX.xmlto pick which files answer a question. Optimise for that, not human readability:
Discriminative descriptions
Bottom line: keep the descriptions, drop the "semantic search" framing. What you want is discriminative descriptions — meaningful and keyword-anchored and written to stand apart from their neighbours — read by an LLM, not matched by vectors.
Also, extract all duplicated examples into one .claude/commands/references/examples.md file.
Then regenerate all the PLACEHOLDER descriptions.
Remove the scripts in scripts/ I don't need. For example, what about a bash for-loop over curate_doc.py, then allocate to subagents (run wc --chars *.md | sort -n or token counts). Read the diff on each local file and assess if the description needs refining (using head).
Pick a cap so no agent is overloaded, on two axes:
- Input size — a target like "~X total words per agent" so context stays comfortable.
- File count — a soft cap (e.g. ≤5 files) so no single agent does too much serial reading, which hurts both speed and summary quality.
The honest rule: isolation matters in proportion to file size and similarity. Give a large or highly-similar file its own agent (where contamination/dilution is real). Batch small, distinct files 3–5 together — the interference there is negligible, and you save the overhead. It's a quality-vs-efficiency trade-off, not a flat "always isolate".
For this task I'd set the budget as "≤5 files AND ≤~12k words per agent, whichever binds first". I'd rather use the anthropic tokeniser script in
~/projects/python/TEMP-token-counts/.
But if there's an API cost to running these, use a multiplier (approximated across curated files): tokens ≈ characters × 0.37 (I don't think my sampling was wonky across 49 files).
