feat: reCAPTCHA data flywheel (opt-in collector + training tooling) by smowtion · Pull Request #8 · DannyLuna17/VisionAIRecaptchaSolver

smowtion · 2026-06-13T16:13:06Z

Summary

Adds an active-learning data flywheel: the solver can collect hard tiles → human review → retrain → publish → auto-download. Implemented across 4 phases (TDD). Plan: `plans/260613-1719-recaptcha-suite-data-flywheel/`.

What changed

Runtime (shipped in wheel)

New `collection/` module: `DataCollector` — opt-in (`SolverConfig.collect_data=False` by default → zero I/O, PyPI users unaffected). Captures `uncertain` (min_confidence_threshold ≤ conf < conf_threshold) / `failed` / `unknown_keyword` tiles as PNG + `metadata.jsonl`.
Wired symmetrically into both `RecaptchaSolver` and `AsyncRecaptchaSolver`; tile hook lives in `YOLODetector.classify_tiles_with_confidence` (reuses already-cropped tiles, DRY); async disk writes offloaded via `_run_in_executor`.
`SolverConfig`: new `collect_data` / `collect_dir` (kept separate from `download_dir`; sentinel pattern intact).

Training tooling (NOT shipped in wheel — outside `src/`)

`training/class_mapping.py` — single source of truth folder↔class_id↔label (14 classes, validated vs `types.CLASS_NAMES`).
`training/review_cli.py`, `prepare_dataset.py`, `train.py`, `export_onnx.py`, `compute_sha256.py`; `train_model/` merged into `training/`.
`docs/training-and-flywheel.md` — full retrain→publish flow + checklist.

CI

Build job now asserts the wheel excludes `training/` and includes `collection/`.

Testing

107 passed, 1 deselected (integration); +38 new tests.
ruff `src/`+`training/` clean; mypy `src/` no new errors (4 pre-existing).
Public API `all` unchanged; `DataCollector` internal.

Notes

A code-review pass caught a real bug (fixed): collector must `except Exception`, not just `OSError` — `cv2.imwrite` raises `cv2.error` (not an OSError subclass), which would otherwise abort the solve loop. Telemetry is best-effort and must never break solving.
Real cloud-GPU training (Phase 4) is intentionally out of scope here; training scripts have dry tests only (no GPU).

Add opt-in active-learning collector to capture uncertain/failed samples: - New collection/ module with DataCollector (writes tiles/failures to collect_dir) - SolverConfig.collect_data flag (default False, zero I/O overhead) - Integration in RecaptchaSolver and AsyncRecaptchaSolver solve() pipelines - Collector records uncertain tiles (confidence between thresholds) and failures - tests/ scaffold: test_collector_scaffold.py, test_data_collector.py

- CI build job asserts the wheel ships collection/ but not training/ tooling - Document the 4-phase data flywheel implementation + the cv2.error/OSError lesson

…ll 4x4 classification fallback - Add YOLODetector.is_supported() to classify challenges into supported (COCO) vs. unsupported (fallback-capable) groups; skip unsupported challenges with minimal latency using short _reload_challenge delay - Refactor solve loop (sync+async symmetric) with separate attempts/skips budgets and solved flag; on classification failure, drop to short token-wait instead of hanging full timeout - Remove _get_target_class; move logic inline to clarify control flow - Implement per-cell 4x4 classification fallback in SquareCaptchaHandler: when COCO model lacks target (e.g., stairs, bridges), classify each cell independently using the 57k classification model as a last resort - Add test_is_supported.py (device/model coverage); test_square_handler_fallback.py (cell classification, missing-class recovery) - Add integration test retry-until-solvable (bounded N=3, wall-clock 180s, still asserts token); now PASSES in 279s (vs. previous fail after 578s) - Update codebase-summary.md with solve robustness + square handler fallback notes - Fix test_class_mapping.py SIM300 ruff violation - Add executed plan + brainstorm report to plans/ Verified: 117 unit tests pass; integration test now stable. Sync+async symmetric; public API unchanged.

Implement full-image collection, cell-bbox annotation, dataset builder, and 3-tier 4x4 detection priority (COCO → custom detection → per-cell fallback). Phase 3 (GPU training) deferred. Custom model gated behind optional config; default off preserves existing behavior. - DataCollector.record_challenge_image: save 4x4 full images to collected/full/ - SquareCaptchaHandler: hook full-image collection (detector.collector injection) - YOLODetector: load/verify custom detection model (SHA256), detection+mapping - SolverConfig: custom_detection_model_path parameter - annotate_detection_cli.py: interactive cell→bbox annotation CLI - prepare_detection_dataset.py: YOLO detection dataset builder (images/labels/data.yaml) - CUSTOM_DETECTION_CLASSES/MAPPINGS: runtime detection class sync - DETECTION_CLASSES: training/class_mapping dataset builder class registry - 3-tier priority solve handler with per-cell fallback - 13 new tests; ruff/mypy clean; 138 tests passing - docs/training-and-flywheel.md: Tier B documentation Public API unchanged. No confidential data or credentials.

- training/device_utils.resolve_device: CUDA > MPS > CPU auto-detect - train.py: --device defaults to auto, add --amp/--no-amp (use --no-amp on flaky MPS) - Apple Silicon (M-series) trains via Metal/MPS; no cloud GPU required for small datasets - docs + Tier B plan notes: Mac MPS option vs Colab; train_detection.py to reuse device_utils

- train_detection.py: YOLO detect trainer (auto CUDA/MPS/CPU, --amp/--no-amp, --resume) - write_model_card.py: model_card.json sidecar (date/task/classes/sha256) - collect.py: loop driver to accumulate per-cell + full-4x4 data with progress counts - class_mapping: enforce DETECTION_CLASSES == types.CUSTOM_DETECTION_CLASSES contract - pyproject: add onnx/onnxslim to dev extras (export-time only; runtime keeps onnxruntime) - smoke-verified train_detection on MPS (synthetic data -> best.pt, task=detect) Real model still needs human-annotated 4x4 data (no auto pseudo-label).

- auto_annotate_capmonster.py: label collected 4x4 images via CapMonster image mode (ComplexImageTask/recaptcha, ~$0.04/1000) -> annotations.jsonl, 0->1-indexed cells - only the 7 detection classes labeled; resilient per-image (network errors skipped) - API key via --api-key or env CAPMONSTER_API_KEY (no secret in repo) - docs: manual vs auto annotation; cap.guru can't do image mode (token only)

solution.answer is a 16-element bool mask (True=cell has target), not a cell-index list. Previous code coerced bools to ints producing wrong cells ([1,2] for all). Now map True flags -> 0-indexed cells. Verified end-to-end: collected images -> auto-labels -> YOLO detection dataset. Caught by spot-checking identical outputs.

…imiting)

smowtion added 2 commits June 13, 2026 23:03

ci: verify training/ excluded from wheel; add data-flywheel journal

ab950f5

- CI build job asserts the wheel ships collection/ but not training/ tooling - Document the 4-phase data flywheel implementation + the cv2.error/OSError lesson

smowtion mentioned this pull request Jun 13, 2026

CI: data flywheel (fork verification) smowtion/VisionAIRecaptchaSolver#1

Open

smowtion added 8 commits June 14, 2026 09:00

docs: journal — Tier B detection pipeline + MPS training support

410d944

docs: collection findings (Google demo throughput, class skew, rate l…

99136bf

…imiting)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reCAPTCHA data flywheel (opt-in collector + training tooling)#8

feat: reCAPTCHA data flywheel (opt-in collector + training tooling)#8
smowtion wants to merge 10 commits into
DannyLuna17:mainfrom
smowtion:feat/data-flywheel

smowtion commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smowtion commented Jun 13, 2026

Summary

What changed

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant