High-concurrency, fully automated public proxy collector β purpose-built for GitHub Actions.
Scrapes 100+ publicly shared proxy sources, validates them across 5 dimensions, enriches metadata, and exports clean, deduplicated, ready-to-use lists β every 3 hours, with zero servers to maintain.
- Why this project
- Features
- Architecture
- The 5-Dimensional Validation Engine
- Quick Start
- Outputs
- Direct download links
- Configuration
- GitHub Actions Workflows
- Project Structure
- Quality Scoring
- Local Development
- Docker
- FAQ
- Responsible Use
- Contributing
- License
Most public proxy lists are noisy, duplicated, and full of dead entries. This project solves that by running a rigorous, fully automated pipeline that:
- β Collects from many sources concurrently (no slow sequential scraping).
- β Removes duplicates with a memory-efficient Bloom filter + set.
- β Verifies every proxy across 5 independent dimensions β so dead proxies never reach you.
- β Runs entirely on GitHub Actions β no VPS, no cost, no maintenance.
- β Writes outputs atomically, so a crash mid-run can never corrupt your lists.
| Category | Highlights |
|---|---|
| Concurrency | asyncio + aiohttp with bounded Semaphore pools β check thousands of proxies in parallel |
| Anti-ban | Token-bucket rate limiter + rotating real-browser User-Agents |
| Scrapers | HTML tables, JSON APIs, GitHub raw text, and a generic regex fallback |
| Validation | TCP liveliness, protocol detection (HTTP/HTTPS/SOCKS4/SOCKS5), anonymity scoring, latency, geolocation |
| Deduplication | Bloom filter (probabilistic) fronting an exact set for correctness |
| Enrichment | ASN/ISP resolution, DNSBL/Spamhaus blacklist checks, 0β100 quality score |
| Data integrity | Atomic file writes (temp file + os.replace) β corruption-proof |
| Outputs | Master list + segmented by country / protocol / anonymity + JSON manifest |
| Type safety | Strict Pydantic v2 models, full mypy --strict, ruff linting |
| Automation | 5 GitHub Actions: orchestrator, security audit, health monitor, auto-release, cache cleanup |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PipelineManager β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β COLLECT ββββΆβ DEDUP ββββΆβ VERIFY ββββΆβ ENRICH ββββΆβ EXPORT β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β β β β β
async scrapers Bloom + Set 5-D engine ASN + DNSBL atomic writes
(rate-limited) (concurrent) + scoring master/segmented
The pipeline is linear and fully asynchronous. Every stage is a dependency-injected
component, which keeps the system testable and each layer independently swappable.
See docs/ARCHITECTURE.md for a deep dive.
Every candidate proxy must survive all of these checks before it is exported:
| # | Dimension | Module | What it does |
|---|---|---|---|
| 1 | Liveliness | 01_liveliness_tcp.py |
Raw TCP handshake β the cheap gate that drops most dead proxies first |
| 2 | Protocol | 02_protocol_detector.py |
Probes SOCKS5/SOCKS4 handshakes, falls back to HTTP(S) |
| 3 | Anonymity | 03_anonymity_check.py |
Classifies as Elite / Anonymous / Transparent by header leakage |
| 4 | Latency | 04_latency_tester.py |
Measures real round-trip time through the proxy |
| 5 | Geolocation | 05_geo_locator.py |
Resolves country via offline MaxMind GeoLite2 .mmdb |
The dimension files are loaded dynamically with
importlib(their numeric names aren't valid Python import identifiers), orchestrated bysrc/validators/engine.py.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the full pipeline
python -m src.main run
# ...or use the Makefile
make runThat's it. Validated proxies land in outputs/.
On GitHub: just enable Actions and the
01 - Main Orchestratorworkflow runs automatically every 3 hours, committing fresh proxies back to the repo.
outputs/
βββ proxies.txt # Master list β best proxies, highest score first
βββ by_country/
β βββ BD_proxies.txt
β βββ US_proxies.txt
β βββ <ISO>_proxies.txt
βββ by_protocol/
β βββ http.txt
β βββ socks4.txt
β βββ socks5.txt
βββ by_anonymity/
β βββ elite.txt
β βββ anonymous.txt
βββ metadata/
βββ manifest.json # Totals, distribution, average latency
βββ source_health_report.json # Which sources performed well
Once running on GitHub, you can consume lists directly via raw URLs
(replace USER/REPO):
https://raw.githubusercontent.com/USER/REPO/main/outputs/proxies.txt
https://raw.githubusercontent.com/USER/REPO/main/outputs/by_protocol/socks5.txt
https://raw.githubusercontent.com/USER/REPO/main/outputs/by_anonymity/elite.txt
All behaviour is driven by YAML in config/ β no code changes needed.
config/settings.yaml (core knobs):
| Key | Default | Description |
|---|---|---|
scrape_concurrency |
50 |
Parallel source fetches |
validate_concurrency |
500 |
Parallel proxy validations |
tcp_timeout |
5.0 |
TCP handshake timeout (s) |
validate_timeout |
10.0 |
HTTP validation timeout (s) |
max_retries |
3 |
Retries per request |
rate_limit_per_sec |
20.0 |
Token-bucket refill rate |
rate_limit_burst |
40 |
Token-bucket capacity |
max_latency_ms |
8000 |
Drop proxies slower than this |
max_alive_output |
50000 |
Cap on exported proxies |
Other config files:
proxy_sources_registry.yamlβ the central database of sources.validation_rules.yamlβ latency buckets, scoring weights, anonymity threshold.country_mapping.jsonβ ISO 3166 codes, names, and flag emojis.
Add a new source interactively:
python scripts/add_new_source.py| Workflow | Schedule | Purpose |
|---|---|---|
01_main_orchestrator.yml |
every 3 hours | Run pipeline, commit fresh proxy lists |
02_security_audit.yml |
push / weekly | Trivy + pip-audit secret & dependency scan |
03_health_monitor.yml |
every 6 hours | Fail if >50% of sources are down |
04_auto_release.yml |
weekly | Tag + GitHub release with proxy snapshot |
05_cleanup_cache.yml |
daily | Purge old Actions caches |
β οΈ Setup: On GitHub go to Settings β Actions β Workflow permissions and enable Read and write permissions so the orchestrator can push outputs.
src/
βββ main.py # CLI entry point
βββ core/ # pipeline_manager, exceptions, constants
βββ models/ # Pydantic schemas (proxy, source, stats)
βββ collectors/ # base_scraper, factory, extractors/, rotators/
βββ deduplication/ # bloom_filter, early_aggregator, redis_state_manager
βββ validators/ # 01..05 dimensions + engine.py
βββ enrichment/ # asn_resolver, spam_blacklist_check, scoring_engine
βββ exporters/ # atomic_writer, master/segmented/manifest builders
βββ utils/ # rate_limiter, async_semaphore_pool, http_client, logger
Full tree and API docs: docs/API_REFERENCE.md.
Each proxy gets a 0β100 score so the master list is sorted best-first:
| Factor | Max points | Best case |
|---|---|---|
| Latency | 40 | < 500 ms |
| Anonymity | 35 | Elite |
| Protocol | 15 | SOCKS5 |
| Clean (not blacklisted) | 10 | Not on any DNSBL |
Weights are configurable in config/validation_rules.yaml.
make install # install dependencies
make test # run unit + integration tests (pytest)
make lint # ruff check + mypy --strict
make format # auto-format with ruff
make clean # remove caches
make run # run the pipeline locallyBenchmark dedup throughput on synthetic data:
python scripts/local_benchmark.pyRun the full local stack (collector + Redis for cross-run state):
docker compose -f docker/docker-compose.yml up --buildThe image is based on lightweight python:3.11-alpine.
Do I need the MaxMind databases?
No. Geolocation and ASN data are optional. Without the .mmdb files the pipeline
still runs β country/ASN fields are simply left empty. To enable them, set a
MAXMIND_LICENSE_KEY secret and run python scripts/update_geoip_db.py.
Why are outputs/ files empty in the repo?
They are generated on each run. The first GitHub Actions run will populate them.
Why do validator files start with numbers (01_...)?
It mirrors the documented architecture order. Since 01_foo isn't a valid Python
import name, engine.py loads them dynamically via importlib β fully functional.
Is Redis required? No. Redis only adds optional cross-run dedup state for local/Docker use. On GitHub Actions (stateless) it gracefully degrades to a no-op.
This tool only collects proxies that are publicly and freely shared. The built-in token-bucket rate limiter and User-Agent rotation exist to be a polite client, not to bypass protections. Always respect each source's terms of service and applicable laws. You are responsible for how you use the collected proxies.
Contributions are welcome! To suggest a new source, open a New source request issue
(template provided) or run scripts/add_new_source.py and submit a PR. Please ensure
make lint and make test pass before opening a pull request.
Released under the MIT License β see LICENSE.