Skip to content

mrgusux/automatic-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

135 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›°οΈ Ultimate God-Tier Automated Proxy Collector

High-concurrency, fully automated public proxy collector β€” purpose-built for GitHub Actions.

Scrapes 100+ publicly shared proxy sources, validates them across 5 dimensions, enriches metadata, and exports clean, deduplicated, ready-to-use lists β€” every 3 hours, with zero servers to maintain.


CI Python Async Pydantic License Maintenance


Active Proxies Avg Latency Sources OK


πŸ“‘ Table of Contents


πŸ’‘ Why this project

Most public proxy lists are noisy, duplicated, and full of dead entries. This project solves that by running a rigorous, fully automated pipeline that:

  • βœ… Collects from many sources concurrently (no slow sequential scraping).
  • βœ… Removes duplicates with a memory-efficient Bloom filter + set.
  • βœ… Verifies every proxy across 5 independent dimensions β€” so dead proxies never reach you.
  • βœ… Runs entirely on GitHub Actions β€” no VPS, no cost, no maintenance.
  • βœ… Writes outputs atomically, so a crash mid-run can never corrupt your lists.

✨ Features

Category Highlights
Concurrency asyncio + aiohttp with bounded Semaphore pools β€” check thousands of proxies in parallel
Anti-ban Token-bucket rate limiter + rotating real-browser User-Agents
Scrapers HTML tables, JSON APIs, GitHub raw text, and a generic regex fallback
Validation TCP liveliness, protocol detection (HTTP/HTTPS/SOCKS4/SOCKS5), anonymity scoring, latency, geolocation
Deduplication Bloom filter (probabilistic) fronting an exact set for correctness
Enrichment ASN/ISP resolution, DNSBL/Spamhaus blacklist checks, 0–100 quality score
Data integrity Atomic file writes (temp file + os.replace) β€” corruption-proof
Outputs Master list + segmented by country / protocol / anonymity + JSON manifest
Type safety Strict Pydantic v2 models, full mypy --strict, ruff linting
Automation 5 GitHub Actions: orchestrator, security audit, health monitor, auto-release, cache cleanup

πŸ—οΈ Architecture

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                      PipelineManager                         β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ COLLECT  │──▢│  DEDUP   │──▢│  VERIFY  │──▢│  ENRICH  │──▢│  EXPORT  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚              β”‚              β”‚              β”‚              β”‚
  async scrapers  Bloom + Set   5-D engine     ASN + DNSBL    atomic writes
  (rate-limited)               (concurrent)     + scoring     master/segmented

The pipeline is linear and fully asynchronous. Every stage is a dependency-injected component, which keeps the system testable and each layer independently swappable. See docs/ARCHITECTURE.md for a deep dive.


πŸ”¬ The 5-Dimensional Validation Engine

Every candidate proxy must survive all of these checks before it is exported:

# Dimension Module What it does
1 Liveliness 01_liveliness_tcp.py Raw TCP handshake β€” the cheap gate that drops most dead proxies first
2 Protocol 02_protocol_detector.py Probes SOCKS5/SOCKS4 handshakes, falls back to HTTP(S)
3 Anonymity 03_anonymity_check.py Classifies as Elite / Anonymous / Transparent by header leakage
4 Latency 04_latency_tester.py Measures real round-trip time through the proxy
5 Geolocation 05_geo_locator.py Resolves country via offline MaxMind GeoLite2 .mmdb

The dimension files are loaded dynamically with importlib (their numeric names aren't valid Python import identifiers), orchestrated by src/validators/engine.py.


πŸš€ Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run the full pipeline
python -m src.main run

# ...or use the Makefile
make run

That's it. Validated proxies land in outputs/.

On GitHub: just enable Actions and the 01 - Main Orchestrator workflow runs automatically every 3 hours, committing fresh proxies back to the repo.


πŸ“¦ Outputs

outputs/
β”œβ”€β”€ proxies.txt                     # Master list β€” best proxies, highest score first
β”œβ”€β”€ by_country/
β”‚   β”œβ”€β”€ BD_proxies.txt
β”‚   β”œβ”€β”€ US_proxies.txt
β”‚   └── <ISO>_proxies.txt
β”œβ”€β”€ by_protocol/
β”‚   β”œβ”€β”€ http.txt
β”‚   β”œβ”€β”€ socks4.txt
β”‚   └── socks5.txt
β”œβ”€β”€ by_anonymity/
β”‚   β”œβ”€β”€ elite.txt
β”‚   └── anonymous.txt
└── metadata/
    β”œβ”€β”€ manifest.json               # Totals, distribution, average latency
    └── source_health_report.json   # Which sources performed well

πŸ”— Direct download links

Once running on GitHub, you can consume lists directly via raw URLs (replace USER/REPO):

https://raw.githubusercontent.com/USER/REPO/main/outputs/proxies.txt
https://raw.githubusercontent.com/USER/REPO/main/outputs/by_protocol/socks5.txt
https://raw.githubusercontent.com/USER/REPO/main/outputs/by_anonymity/elite.txt

βš™οΈ Configuration

All behaviour is driven by YAML in config/ β€” no code changes needed.

config/settings.yaml (core knobs):

Key Default Description
scrape_concurrency 50 Parallel source fetches
validate_concurrency 500 Parallel proxy validations
tcp_timeout 5.0 TCP handshake timeout (s)
validate_timeout 10.0 HTTP validation timeout (s)
max_retries 3 Retries per request
rate_limit_per_sec 20.0 Token-bucket refill rate
rate_limit_burst 40 Token-bucket capacity
max_latency_ms 8000 Drop proxies slower than this
max_alive_output 50000 Cap on exported proxies

Other config files:

  • proxy_sources_registry.yaml β€” the central database of sources.
  • validation_rules.yaml β€” latency buckets, scoring weights, anonymity threshold.
  • country_mapping.json β€” ISO 3166 codes, names, and flag emojis.

Add a new source interactively:

python scripts/add_new_source.py

πŸ€– GitHub Actions Workflows

Workflow Schedule Purpose
01_main_orchestrator.yml every 3 hours Run pipeline, commit fresh proxy lists
02_security_audit.yml push / weekly Trivy + pip-audit secret & dependency scan
03_health_monitor.yml every 6 hours Fail if >50% of sources are down
04_auto_release.yml weekly Tag + GitHub release with proxy snapshot
05_cleanup_cache.yml daily Purge old Actions caches

⚠️ Setup: On GitHub go to Settings β†’ Actions β†’ Workflow permissions and enable Read and write permissions so the orchestrator can push outputs.


πŸ—‚οΈ Project Structure

src/
β”œβ”€β”€ main.py                 # CLI entry point
β”œβ”€β”€ core/                   # pipeline_manager, exceptions, constants
β”œβ”€β”€ models/                 # Pydantic schemas (proxy, source, stats)
β”œβ”€β”€ collectors/             # base_scraper, factory, extractors/, rotators/
β”œβ”€β”€ deduplication/          # bloom_filter, early_aggregator, redis_state_manager
β”œβ”€β”€ validators/             # 01..05 dimensions + engine.py
β”œβ”€β”€ enrichment/             # asn_resolver, spam_blacklist_check, scoring_engine
β”œβ”€β”€ exporters/              # atomic_writer, master/segmented/manifest builders
└── utils/                  # rate_limiter, async_semaphore_pool, http_client, logger

Full tree and API docs: docs/API_REFERENCE.md.


πŸ† Quality Scoring

Each proxy gets a 0–100 score so the master list is sorted best-first:

Factor Max points Best case
Latency 40 < 500 ms
Anonymity 35 Elite
Protocol 15 SOCKS5
Clean (not blacklisted) 10 Not on any DNSBL

Weights are configurable in config/validation_rules.yaml.


πŸ§ͺ Local Development

make install     # install dependencies
make test        # run unit + integration tests (pytest)
make lint        # ruff check + mypy --strict
make format      # auto-format with ruff
make clean       # remove caches
make run         # run the pipeline locally

Benchmark dedup throughput on synthetic data:

python scripts/local_benchmark.py

🐳 Docker

Run the full local stack (collector + Redis for cross-run state):

docker compose -f docker/docker-compose.yml up --build

The image is based on lightweight python:3.11-alpine.


❓ FAQ

Do I need the MaxMind databases? No. Geolocation and ASN data are optional. Without the .mmdb files the pipeline still runs β€” country/ASN fields are simply left empty. To enable them, set a MAXMIND_LICENSE_KEY secret and run python scripts/update_geoip_db.py.

Why are outputs/ files empty in the repo? They are generated on each run. The first GitHub Actions run will populate them.

Why do validator files start with numbers (01_...)? It mirrors the documented architecture order. Since 01_foo isn't a valid Python import name, engine.py loads them dynamically via importlib β€” fully functional.

Is Redis required? No. Redis only adds optional cross-run dedup state for local/Docker use. On GitHub Actions (stateless) it gracefully degrades to a no-op.


πŸ›‘οΈ Responsible Use

This tool only collects proxies that are publicly and freely shared. The built-in token-bucket rate limiter and User-Agent rotation exist to be a polite client, not to bypass protections. Always respect each source's terms of service and applicable laws. You are responsible for how you use the collected proxies.


🀝 Contributing

Contributions are welcome! To suggest a new source, open a New source request issue (template provided) or run scripts/add_new_source.py and submit a PR. Please ensure make lint and make test pass before opening a pull request.


πŸ“„ License

Released under the MIT License β€” see LICENSE.

Built with ⚑ asyncio, 🧬 Pydantic, and ❀️ for the open-source community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages