Skip to content

michellepace/docs-for-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

282 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Curate Docs For AI (with Claude Code)

Curate and index documentation from any website into collections like tailwind/, horses/, etc. Reference collection indexes in your AI chats (e.g. @tailwind/INDEX.xml what's a utility?) so that only relevant docs are analysed. Much cleaner than a web-fetch and more focussed than a web-search. Keep your AI context sharp.

Terminal showing three-step workflow: (1) Running /curate-doc biome command, (2) Curation success output showing scraped documentation and generated INDEX.xml entry, (3) Use /ask-docs to query docs. Handwritten annotations highlight each step.

Complete workflow: curate → auto scrape → "/ask-docs biome Validate my config file please"

📦 Repo Collections

Available collections in this repo:

Collection Collection Index Description Scraped Source
📦 biome/ 📄 INDEX.xml Fast linter/formatter 2025-11-04 Official
📦 claudecode/ 📄 INDEX.xml Anthropic Claude Code 2026-02-05 Official
📦 claudeplat/ 📄 INDEX.xml Anthropic Claude Platform 2026-01-07 Official
📦 clerk/ 📄 INDEX.xml Authentication 2025-12-03 Official
📦 convex/ 📄 INDEX.xml Reactive database 2026-01-07 Official
🪝 lefthook/ 📄 INDEX.xml Git hooks manager 2025-11-24 Official
📦 marimo/ 📄 INDEX.xml Reactive Python notebooks 2025-11-11 Official
📦 nextjs/ 📄 INDEX.xml React framework 2025-12-02 Official
📦 playwright/ 📄 INDEX.xml Browser testing 2025-11-07 Official
📦 shadcn/ 📄 INDEX.xml React UI components 2025-12-16 Official, Guide
📦 shiny/ 📄 INDEX.xml Python web apps 2025-11-02 Official
📦 tailwind/ 📄 INDEX.xml CSS framework 2025-10-15 Official
📦 tailwindplus/ 📄 INDEX.xml Paid UI Components 2025-11-16 Official
📦 uv/ 📄 INDEX.xml Python projects 2026-05-19 Official
📦 vercel/ 📄 INDEX.xml Deployment platform 2025-10-20 Official
📦 vitest/ 📄 INDEX.xml Testing framework 2025-11-05 Official
📦 zustand/ 📄 INDEX.xml State management 2026-01-03 Official

Curate your own collections. The lefthook collection is non-standard — docs are downloaded directly from GitHub. For Anthropic docs use this tool.


🚀 Setup

# 1. Install UV
# 👉 https://docs.astral.sh/uv/getting-started/installation/

# 2. Clone repository
git clone https://github.com/michellepace/docs-for-ai.git
cd docs-for-ai

# 3. Get free FireCrawl API key (Only GitHub sources are downloaded directly)
# Visit: https://www.firecrawl.dev/app/api-keys

# 4. Add to your shell profile
echo 'export API_KEY_MCP_FIRECRAWL=your-api-key-here' >> ~/.zshrc
source ~/.zshrc  # Use ~/.bashrc if that's your shell

📖 Usage via Slash Commands

Important

Edit the paths in .claude/commands/ask-docs.md to match your local setup. To use from anywhere, move it to ~/.claude/commands/.

Slash Command Purpose .md Files INDEX <source>
/curate-doc <collection> <url> Add new or re-scrape ✅ Write ✅ Add/update INDEX.xml
/rescrape-docs <collection> Re-scrape all docs ✅ Write all ✅ Selective update INDEX.xml
/improve-index-xml <collection> Batch improve descriptions 📖 Read ✅ Update INDEX.xml
/ask-docs <collection> <question> Query any collection Docs analysed Relevant docs identified

💡 Usage Example

Assume tailwind was not already a collection in this repo:

# Start a new collection
/curate-doc tailwind https://tailwindcss.com/docs/customizing-colors
# → Creates tailwind/ collection directory, with README.md + INDEX.xml, and first curated doc

# Re-scrape existing doc (refresh content from same URL)
/curate-doc tailwind https://tailwindcss.com/docs/customizing-colors
# → Re-scrapes, writes .md file, replaces source in INDEX.xml

# Curate a new doc into collection
/curate-doc tailwind https://tailwindcss.com/docs/styling-with-utility-classes
# → Scrapes page into collection, writes .md file, adds source to INDEX.xml

# Re-scrape all docs in collection
/rescrape-docs tailwind
# → Re-scrapes all URLs in INDEX.xml, writes all .md files, updates descriptions for changed content

# ✨ Use the docs
/ask-docs tailwind Please evaluate my project for correct usage of utility classes?
# → Searches tailwind/INDEX.xml for relevant docs, analyses these, gives you an answer

🏗️ How This Repo Works

Workflow: Python script fetches from source URL → writes .md file → creates INDEX.xml entry with PLACEHOLDER description → Claude Code generates semantic description. The /curate-doc command always regenerates the description, whereas /rescrape-docs only regenerates descriptions for files with content changes.

Source routing: If the source URL is on GitHub, a direct fetch is used instead of FireCrawl.

Directory Structure:

uv/
├── INDEX.xml               # Index of all docs
├── README.md
├── api-reference.md        # Scraped doc
├── getting-started.md      # Scraped doc
└── ...

INDEX.xml Schema:

<docs_index>
  <source>
    <title>Hello Document Title</title>
    <description>20-30 word dense summary optimised for semantic search...</description>
    <source_url>https://docs.example.com/hello</source_url>
    <local_file>hello-document-title.md</local_file>
    <scraped_at>2025-10-15</scraped_at>
  </source>
  <!-- Multiple <source> entries, one per .md file -->
</docs_index>

Scripts use the FireCrawl Python SDK for general web sources and Python stdlib (urllib.request) for GitHub raw markdown.


👉 Notes to Improve later

LLM Routing (2026-05-22)

"Semantic search" isn't the right term. The examples also need improving — very keyword-heavy, with redundant starting words. Index should say <summary> instead of <description>.

Each description is a routing signal for an LLM reader — Claude reads INDEX.xml to pick which files answer a question. Optimise for that, not human readability:

Discriminative descriptions

Bottom line: keep the descriptions, drop the "semantic search" framing. What you want is discriminative descriptions — meaningful and keyword-anchored and written to stand apart from their neighbours — read by an LLM, not matched by vectors.

Also, extract all duplicated examples into one .claude/commands/references/examples.md file.

Then regenerate all the PLACEHOLDER descriptions.

Remove Scripts? (2026-05-27)

Remove the scripts in scripts/ I don't need. For example, what about a bash for-loop over curate_doc.py, then allocate to subagents (run wc --chars *.md | sort -n or token counts). Read the diff on each local file and assess if the description needs refining (using head).

Pick a cap so no agent is overloaded, on two axes:

  • Input size — a target like "~X total words per agent" so context stays comfortable.
  • File count — a soft cap (e.g. ≤5 files) so no single agent does too much serial reading, which hurts both speed and summary quality.

The honest rule: isolation matters in proportion to file size and similarity. Give a large or highly-similar file its own agent (where contamination/dilution is real). Batch small, distinct files 3–5 together — the interference there is negligible, and you save the overhead. It's a quality-vs-efficiency trade-off, not a flat "always isolate".

For this task I'd set the budget as "≤5 files AND ≤~12k words per agent, whichever binds first". I'd rather use the anthropic tokeniser script in ~/projects/python/TEMP-token-counts/.

But if there's an API cost to running these, use a multiplier (approximated across curated files): tokens ≈ characters × 0.37 (I don't think my sampling was wonky across 49 files).

About

Curate and index clean docs for clean AI context to ask questions against docs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors