Skip to content

BaseMax/wordpress-crawler-hooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WordPress Hooks Crawler

A fast, fully-typed Python crawler that collects every hook documented at developer.wordpress.org/reference/hooks and exports the data in five formats: JSON, YAML, Markdown, HTML, and plain text.

Pre-built output (all hooks, ready to use) is published separately at github.com/BaseMax/wordpress-hooks.


Features

  • Crawls all 49+ paginated listing pages automatically
  • Extracts per-hook: name, URL, type, inferred kind (action / filter), description, used-by count, uses count, source file & line, GitHub source link, since-version(s), packages
  • HTTP cache SQLite-backed via requests-cache; re-runs are near-instant and the polite delay is skipped for cache hits
  • Automatic retry with exponential back-off on network errors
  • Five export formats out of the box
  • Strict Python type annotations throughout (mypy --strict clean)
  • Single dependency group, managed with uv

Output formats

File Format Best for
hooks.json JSON (pretty-printed) Programmatic processing, APIs
hooks.yaml YAML Config files, readable diffs
hooks.md Markdown GitHub rendering, wikis
hooks.html Self-contained HTML Browser viewing, searchable table
hooks.txt Plain text Terminal paging, grep

Requirements

  • Python ≥ 3.11
  • uv (recommended) or pip

Installation

With uv (recommended)

git clone https://github.com/BaseMax/wordpress-crawler-hooks.git
cd wordpress-crawler-hooks
uv sync

With pip

git clone https://github.com/BaseMax/wordpress-crawler-hooks.git
cd wordpress-crawler-hooks
pip install -r requirements.txt

Usage

# uv
uv run python crawler.py

# plain Python
python crawler.py

Output files are written to ./output/ by default.

All CLI options

usage: crawler.py [-h] [--output-dir DIR] [--delay SECONDS]
                  [--cache-dir DIR] [--cache-ttl SECONDS] [--no-cache]

options:
  --output-dir DIR      Directory where output files are written. (default: output)
  --delay SECONDS       Pause between HTTP requests. (default: 0.5)
  --cache-dir DIR       Directory for the SQLite HTTP cache database. (default: .cache)
  --cache-ttl SECONDS   How long a cached response stays fresh. (default: 86400 = 24 h)
  --no-cache            Disable the HTTP cache and always fetch live pages.

Examples

# Custom output directory
python crawler.py --output-dir data/

# Shorter cache lifetime (1 hour)
python crawler.py --cache-ttl 3600

# Always fetch live, no cache
python crawler.py --no-cache

# Faster scraping (smaller delay) with custom cache location
python crawler.py --delay 0.25 --cache-dir /tmp/wp-cache

Project structure

wordpress-crawler-hooks/
├── crawler.py          # main script
├── pyproject.toml      # project metadata & dependencies (PEP 621)
├── requirements.txt    # pip-compatible pin file
├── output/             # generated output (git-ignored)
│   ├── hooks.json
│   ├── hooks.yaml
│   ├── hooks.md
│   ├── hooks.html
│   └── hooks.txt
└── .cache/             # SQLite HTTP cache (git-ignored)

Data schema

Each hook object contains:

Field Type Description
post_id int WordPress post ID
name str Hook name, e.g. admin_init
url str Full URL on developer.wordpress.org
hook_type str Label from the site (hook)
hook_kind str Inferred: action, filter, or unknown
description str Short description
used_by_count int Number of functions that use this hook
uses_count int Number of functions this hook calls
source_file str WordPress source file path
source_line str Line number in that file
source_github_url str Direct GitHub link to the line
since_versions list[str] WordPress version(s) this hook was introduced
packages list[str] WordPress package(s) the hook belongs to

Development

# Install dev dependencies
uv sync --group dev

# Lint & format
uv run ruff check crawler.py
uv run ruff format crawler.py

# Type-check
uv run mypy crawler.py

Pre-built hook data

The crawled output (JSON, YAML, Markdown, HTML, TXT) for all WordPress hooks is published and kept up to date in the companion repository:

github.com/BaseMax/wordpress-hooks


License

MIT License

Copyright © 2026 Seyyed Ali Mohammadiyeh (Max Base)

About

A fast, fully-typed Python crawler that collects every hook documented at developer.wordpress.org/reference/hooks and exports the data in five formats: JSON, YAML, Markdown, HTML, and plain text.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages