Skip to content

feat: reCAPTCHA data flywheel (opt-in collector + training tooling)#8

Open
smowtion wants to merge 10 commits into
DannyLuna17:mainfrom
smowtion:feat/data-flywheel
Open

feat: reCAPTCHA data flywheel (opt-in collector + training tooling)#8
smowtion wants to merge 10 commits into
DannyLuna17:mainfrom
smowtion:feat/data-flywheel

Conversation

@smowtion

Copy link
Copy Markdown

Summary

Adds an active-learning data flywheel: the solver can collect hard tiles → human review → retrain → publish → auto-download. Implemented across 4 phases (TDD). Plan: `plans/260613-1719-recaptcha-suite-data-flywheel/`.

What changed

Runtime (shipped in wheel)

  • New `collection/` module: `DataCollector` — opt-in (`SolverConfig.collect_data=False` by default → zero I/O, PyPI users unaffected). Captures `uncertain` (min_confidence_threshold ≤ conf < conf_threshold) / `failed` / `unknown_keyword` tiles as PNG + `metadata.jsonl`.
  • Wired symmetrically into both `RecaptchaSolver` and `AsyncRecaptchaSolver`; tile hook lives in `YOLODetector.classify_tiles_with_confidence` (reuses already-cropped tiles, DRY); async disk writes offloaded via `_run_in_executor`.
  • `SolverConfig`: new `collect_data` / `collect_dir` (kept separate from `download_dir`; sentinel pattern intact).

Training tooling (NOT shipped in wheel — outside `src/`)

  • `training/class_mapping.py` — single source of truth folder↔class_id↔label (14 classes, validated vs `types.CLASS_NAMES`).
  • `training/review_cli.py`, `prepare_dataset.py`, `train.py`, `export_onnx.py`, `compute_sha256.py`; `train_model/` merged into `training/`.
  • `docs/training-and-flywheel.md` — full retrain→publish flow + checklist.

CI

  • Build job now asserts the wheel excludes `training/` and includes `collection/`.

Testing

  • 107 passed, 1 deselected (integration); +38 new tests.
  • ruff `src/`+`training/` clean; mypy `src/` no new errors (4 pre-existing).
  • Public API `all` unchanged; `DataCollector` internal.

Notes

  • A code-review pass caught a real bug (fixed): collector must `except Exception`, not just `OSError` — `cv2.imwrite` raises `cv2.error` (not an OSError subclass), which would otherwise abort the solve loop. Telemetry is best-effort and must never break solving.
  • Real cloud-GPU training (Phase 4) is intentionally out of scope here; training scripts have dry tests only (no GPU).

smowtion added 2 commits June 13, 2026 23:03
Add opt-in active-learning collector to capture uncertain/failed samples:
- New collection/ module with DataCollector (writes tiles/failures to collect_dir)
- SolverConfig.collect_data flag (default False, zero I/O overhead)
- Integration in RecaptchaSolver and AsyncRecaptchaSolver solve() pipelines
- Collector records uncertain tiles (confidence between thresholds) and failures
- tests/ scaffold: test_collector_scaffold.py, test_data_collector.py
- CI build job asserts the wheel ships collection/ but not training/ tooling
- Document the 4-phase data flywheel implementation + the cv2.error/OSError lesson
smowtion added 8 commits June 14, 2026 09:00
…ll 4x4 classification fallback

- Add YOLODetector.is_supported() to classify challenges into supported (COCO) vs. unsupported (fallback-capable) groups; skip unsupported challenges with minimal latency using short _reload_challenge delay
- Refactor solve loop (sync+async symmetric) with separate attempts/skips budgets and solved flag; on classification failure, drop to short token-wait instead of hanging full timeout
- Remove _get_target_class; move logic inline to clarify control flow
- Implement per-cell 4x4 classification fallback in SquareCaptchaHandler: when COCO model lacks target (e.g., stairs, bridges), classify each cell independently using the 57k classification model as a last resort
- Add test_is_supported.py (device/model coverage); test_square_handler_fallback.py (cell classification, missing-class recovery)
- Add integration test retry-until-solvable (bounded N=3, wall-clock 180s, still asserts token); now PASSES in 279s (vs. previous fail after 578s)
- Update codebase-summary.md with solve robustness + square handler fallback notes
- Fix test_class_mapping.py SIM300 ruff violation
- Add executed plan + brainstorm report to plans/

Verified: 117 unit tests pass; integration test now stable. Sync+async symmetric; public API unchanged.
Implement full-image collection, cell-bbox annotation, dataset builder, and
3-tier 4x4 detection priority (COCO → custom detection → per-cell fallback).
Phase 3 (GPU training) deferred. Custom model gated behind optional config;
default off preserves existing behavior.

- DataCollector.record_challenge_image: save 4x4 full images to collected/full/
- SquareCaptchaHandler: hook full-image collection (detector.collector injection)
- YOLODetector: load/verify custom detection model (SHA256), detection+mapping
- SolverConfig: custom_detection_model_path parameter
- annotate_detection_cli.py: interactive cell→bbox annotation CLI
- prepare_detection_dataset.py: YOLO detection dataset builder (images/labels/data.yaml)
- CUSTOM_DETECTION_CLASSES/MAPPINGS: runtime detection class sync
- DETECTION_CLASSES: training/class_mapping dataset builder class registry
- 3-tier priority solve handler with per-cell fallback
- 13 new tests; ruff/mypy clean; 138 tests passing
- docs/training-and-flywheel.md: Tier B documentation

Public API unchanged. No confidential data or credentials.
- training/device_utils.resolve_device: CUDA > MPS > CPU auto-detect
- train.py: --device defaults to auto, add --amp/--no-amp (use --no-amp on flaky MPS)
- Apple Silicon (M-series) trains via Metal/MPS; no cloud GPU required for small datasets
- docs + Tier B plan notes: Mac MPS option vs Colab; train_detection.py to reuse device_utils
- train_detection.py: YOLO detect trainer (auto CUDA/MPS/CPU, --amp/--no-amp, --resume)
- write_model_card.py: model_card.json sidecar (date/task/classes/sha256)
- collect.py: loop driver to accumulate per-cell + full-4x4 data with progress counts
- class_mapping: enforce DETECTION_CLASSES == types.CUSTOM_DETECTION_CLASSES contract
- pyproject: add onnx/onnxslim to dev extras (export-time only; runtime keeps onnxruntime)
- smoke-verified train_detection on MPS (synthetic data -> best.pt, task=detect)

Real model still needs human-annotated 4x4 data (no auto pseudo-label).
- auto_annotate_capmonster.py: label collected 4x4 images via CapMonster image mode
  (ComplexImageTask/recaptcha, ~$0.04/1000) -> annotations.jsonl, 0->1-indexed cells
- only the 7 detection classes labeled; resilient per-image (network errors skipped)
- API key via --api-key or env CAPMONSTER_API_KEY (no secret in repo)
- docs: manual vs auto annotation; cap.guru can't do image mode (token only)
solution.answer is a 16-element bool mask (True=cell has target), not a cell-index
list. Previous code coerced bools to ints producing wrong cells ([1,2] for all).
Now map True flags -> 0-indexed cells. Verified end-to-end: collected images ->
auto-labels -> YOLO detection dataset. Caught by spot-checking identical outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant