feat: implement word-click solver using YOLOv8 and Siamese similarity#3
Merged
Conversation
- Replaced the LLM vision path with a faster YOLOv8 detection and Siamese similarity matching for character recognition in word-click challenges. - Introduced a new HTTP server for handling solve requests, improving usability for integrations and concurrent solves. - Added ONNX models for YOLOv8 detection and Siamese matching, along with a bundled font for character rendering. - Implemented a fallback mechanism to legacy ddddocr detection when the primary path is unavailable. - Enhanced dependency management for ONNX Runtime and OpenCV, ensuring compatibility across Python versions.
Reviewer's Guide将 word_click 的 LLM-vision 解算器替换为本地 YOLOv8 + Siamese ONNX 流水线,新增带 ONNX 预热的长驻 HTTP 服务器,通过新的 word-click extra 将所需模型/字体打包进 wheel,并更新文档/CLI/配置以反映新的架构与运行时行为。 更新后的 word_click YOLO+Siamese 求解流水线时序图sequenceDiagram
actor User
participant CLI as cli_main
participant Core as solve
participant Pipeline as pipelines_word_click
participant WordOCR as solvers_word_ocr
participant Legacy as legacy_ddddocr
User->>CLI: run crack-tcaptcha solve
CLI->>CLI: start word_click_warmup thread
CLI->>Core: solve(appid, max_retries, entry_url)
Core->>Pipeline: solve_one_attempt(dyn_show_info)
Pipeline->>WordOCR: locate_chars_by_siamese(bg_bytes, targets)
activate WordOCR
WordOCR->>WordOCR: _bytes_to_bgr(bg_bytes)
WordOCR->>WordOCR: _get_yolo_session()
WordOCR->>WordOCR: _yolo_detect(bg_bgr)
alt yolo_bboxes_found
WordOCR->>WordOCR: _render_char(target)
WordOCR->>WordOCR: _siamese_score_batch(crops, ref_img)
WordOCR-->>Pipeline: click_coords[(cx, cy)...]
Pipeline->>Core: pow + trajectory + verify
Core-->>CLI: SolveResult(ok, ...)
CLI-->>User: print or JSON output
else yolo_error_or_zero_bboxes
WordOCR-->>Pipeline: raise SolveError
Pipeline->>Legacy: _fallback_ddddocr(bg_bytes, targets)
Legacy-->>Pipeline: click_coords[(cx, cy)...]
Pipeline->>Core: pow + trajectory + verify
Core-->>CLI: SolveResult(ok_or_false,...)
CLI-->>User: print or JSON output
end
deactivate WordOCR
新服务器中 HTTP /solve 的时序图sequenceDiagram
actor Client
participant SRV as server_HTTP
participant H as HttpHandler
participant EXEC as ThreadPoolExecutor
participant Core as solve
participant WC as pipelines_word_click
participant WO as solvers_word_ocr
Client->>SRV: POST /solve {appid, retries, entry_url}
SRV->>H: dispatch request
H->>H: _check_auth(X-SK)
alt auth_ok
H->>EXEC: submit(solve, appid, max_retries, entry_url)
EXEC->>Core: solve(...)
Core->>WC: solve_one_attempt(...)
WC->>WO: locate_chars_by_siamese(bg_bytes, targets)
WO-->>WC: click_coords
WC-->>Core: SolveResult
Core-->>EXEC: SolveResult
EXEC-->>H: SolveResult
H-->>Client: 200 JSON(SolveResult + _cost_s)
else unauthorized
H-->>Client: 401 {status:error}
end
新求解器与服务器模块的类图classDiagram
class WordOcrSolver {
<<module>>
+locate_chars_by_siamese(bg_bytes: bytes, targets: list~str~) list~tuple~int,int~~
+warmup() void
-_bytes_to_bgr(byte_data: bytes) ndarray
-_render_char(char: str) ndarray
-_yolo_detect(bg_bgr: ndarray) list~tuple~int,int,int,int~~
-_siamese_score_batch(crops: list~ndarray~, ref: ndarray) list~float~
}
class OrtProvider {
<<module>>
+resolve_providers() list~str~
-_BACKEND_MAP dict
-_AUTO_PRIORITY tuple
}
class WordClickPipeline {
+solve_one_attempt(dyn_show_info: dict) SolveResult
-_fallback_ddddocr(bg_bytes: bytes, targets: list~str~) list~tuple~int,int~~
}
class ServerState {
<<internal>>
+executor: ThreadPoolExecutor
+sk: str
+providers: list~str~
+started_at: float
}
class HttpHandler {
<<BaseHTTPRequestHandler>>
+do_GET() void
+do_POST() void
-_send_json(code: int, payload: dict) void
-_check_auth() bool
-state: ServerState
}
class ServerModule {
<<module>>
+run(host: str, port: int, workers: int, sk: str) void
+main(argv: list~str~) void
-_warmup_all() list~str~
}
class CliModule {
<<module>>
+main(argv: list~str~) void
-_warmup_word_click() void
}
WordClickPipeline --> WordOcrSolver : uses
WordOcrSolver --> OrtProvider : uses
ServerModule --> ServerState : creates
ServerModule --> HttpHandler : configures
ServerModule --> WordOcrSolver : calls warmup
CliModule --> WordOcrSolver : calls warmup
CliModule --> ServerModule : dispatch_serve_command
CliModule --> WordClickPipeline : indirect_via_solve
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your Experience访问你的 dashboard 以:
Getting HelpOriginal review guide in EnglishReviewer's GuideReplaces the word_click LLM-vision solver with a local YOLOv8 + Siamese ONNX pipeline, adds a long-running HTTP server with ONNX warmup, bundles required models/fonts into the wheel via a new word-click extra, and updates docs/CLI/config to reflect the new architecture and runtime behavior. Sequence diagram for updated word_click YOLO+Siamese solver pipelinesequenceDiagram
actor User
participant CLI as cli_main
participant Core as solve
participant Pipeline as pipelines_word_click
participant WordOCR as solvers_word_ocr
participant Legacy as legacy_ddddocr
User->>CLI: run crack-tcaptcha solve
CLI->>CLI: start word_click_warmup thread
CLI->>Core: solve(appid, max_retries, entry_url)
Core->>Pipeline: solve_one_attempt(dyn_show_info)
Pipeline->>WordOCR: locate_chars_by_siamese(bg_bytes, targets)
activate WordOCR
WordOCR->>WordOCR: _bytes_to_bgr(bg_bytes)
WordOCR->>WordOCR: _get_yolo_session()
WordOCR->>WordOCR: _yolo_detect(bg_bgr)
alt yolo_bboxes_found
WordOCR->>WordOCR: _render_char(target)
WordOCR->>WordOCR: _siamese_score_batch(crops, ref_img)
WordOCR-->>Pipeline: click_coords[(cx, cy)...]
Pipeline->>Core: pow + trajectory + verify
Core-->>CLI: SolveResult(ok, ...)
CLI-->>User: print or JSON output
else yolo_error_or_zero_bboxes
WordOCR-->>Pipeline: raise SolveError
Pipeline->>Legacy: _fallback_ddddocr(bg_bytes, targets)
Legacy-->>Pipeline: click_coords[(cx, cy)...]
Pipeline->>Core: pow + trajectory + verify
Core-->>CLI: SolveResult(ok_or_false,...)
CLI-->>User: print or JSON output
end
deactivate WordOCR
Sequence diagram for HTTP /solve in the new serversequenceDiagram
actor Client
participant SRV as server_HTTP
participant H as HttpHandler
participant EXEC as ThreadPoolExecutor
participant Core as solve
participant WC as pipelines_word_click
participant WO as solvers_word_ocr
Client->>SRV: POST /solve {appid, retries, entry_url}
SRV->>H: dispatch request
H->>H: _check_auth(X-SK)
alt auth_ok
H->>EXEC: submit(solve, appid, max_retries, entry_url)
EXEC->>Core: solve(...)
Core->>WC: solve_one_attempt(...)
WC->>WO: locate_chars_by_siamese(bg_bytes, targets)
WO-->>WC: click_coords
WC-->>Core: SolveResult
Core-->>EXEC: SolveResult
EXEC-->>H: SolveResult
H-->>Client: 200 JSON(SolveResult + _cost_s)
else unauthorized
H-->>Client: 401 {status:error}
end
Class diagram for new solver and server modulesclassDiagram
class WordOcrSolver {
<<module>>
+locate_chars_by_siamese(bg_bytes: bytes, targets: list~str~) list~tuple~int,int~~
+warmup() void
-_bytes_to_bgr(byte_data: bytes) ndarray
-_render_char(char: str) ndarray
-_yolo_detect(bg_bgr: ndarray) list~tuple~int,int,int,int~~
-_siamese_score_batch(crops: list~ndarray~, ref: ndarray) list~float~
}
class OrtProvider {
<<module>>
+resolve_providers() list~str~
-_BACKEND_MAP dict
-_AUTO_PRIORITY tuple
}
class WordClickPipeline {
+solve_one_attempt(dyn_show_info: dict) SolveResult
-_fallback_ddddocr(bg_bytes: bytes, targets: list~str~) list~tuple~int,int~~
}
class ServerState {
<<internal>>
+executor: ThreadPoolExecutor
+sk: str
+providers: list~str~
+started_at: float
}
class HttpHandler {
<<BaseHTTPRequestHandler>>
+do_GET() void
+do_POST() void
-_send_json(code: int, payload: dict) void
-_check_auth() bool
-state: ServerState
}
class ServerModule {
<<module>>
+run(host: str, port: int, workers: int, sk: str) void
+main(argv: list~str~) void
-_warmup_all() list~str~
}
class CliModule {
<<module>>
+main(argv: list~str~) void
-_warmup_word_click() void
}
WordClickPipeline --> WordOcrSolver : uses
WordOcrSolver --> OrtProvider : uses
ServerModule --> ServerState : creates
ServerModule --> HttpHandler : configures
ServerModule --> WordOcrSolver : calls warmup
CliModule --> WordOcrSolver : calls warmup
CliModule --> ServerModule : dispatch_serve_command
CliModule --> WordClickPipeline : indirect_via_solve
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Owner
Author
|
测试正常,只是数据实际上还是没有很快,需要考虑这个issue #2 |
There was a problem hiding this comment.
Hey - 我发现了 4 个问题,并给出了一些整体性反馈:
word_ocr.py模块的 docstring 里仍然提到yolo_word.onnx/siamese_word.onnx,但实际打包的文件是word_click_detector.onnx/word_click_matcher.onnx;对齐这些名称可以避免在调试模型问题时产生混淆。_render_char每次调用都会重新创建ImageFont.truetype字体对象,当目标较多时这会带来明显的开销;建议在模块级缓存已加载的字体并复用它。- 在
server._warmup_all中你导入并调用了_get_yolo_session/_get_siamese_session这些私有辅助函数;更稳健的做法是要么暴露用于获取 provider 信息的公共访问器,要么只依赖公共的warmup()API,这样内部重构时就不会轻易破坏 server。
给 AI Agents 的提示
请处理本次代码评审中的所有评论:
## 总体评论
- `word_ocr.py` 模块的 docstring 里仍然提到 `yolo_word.onnx` / `siamese_word.onnx`,但实际打包的文件是 `word_click_detector.onnx` / `word_click_matcher.onnx`;对齐这些名称可以避免在调试模型问题时产生混淆。
- `_render_char` 每次调用都会重新创建 `ImageFont.truetype` 字体对象,当 `locate_chars_by_siamese` 频繁调用或目标较多时,这会带来明显的开销;建议在模块级缓存已加载的字体并复用它。你也可以按字符缓存渲染后的字形,避免在多次调用间重复渲染。
- 在 `server._warmup_all` 中你导入并调用了 `_get_yolo_session` / `_get_siamese_session` 这些私有辅助函数;更稳健的做法是要么暴露用于获取 provider 信息的公共访问器,要么只依赖公共的 `warmup()` API,这样内部重构时就不会轻易破坏 server。
## 单独评论
### Comment 1
<location path="src/crack_tcaptcha/solvers/word_ocr.py" line_range="171-179" />
<code_context>
+ return img
+
+
+def _render_char(char: str) -> np.ndarray:
+ """Render one CJK char to a 52×52 BGR image using the bundled font."""
+ from PIL import Image, ImageDraw, ImageFont
+
+ if not _FONT_PATH.is_file():
+ raise SolveError(f"word_click: missing font at {_FONT_PATH}")
+ img = Image.new("RGB", (_CHAR_RENDER_SIZE, _CHAR_RENDER_SIZE), color="white")
+ draw = ImageDraw.Draw(img)
+ font = ImageFont.truetype(str(_FONT_PATH), _CHAR_RENDER_FONT_SIZE)
+ bbox = font.getbbox(char)
+ text_w = bbox[2] - bbox[0]
</code_context>
<issue_to_address>
**suggestion (performance):** 避免在每次字符渲染时重新加载 TTF 字体,以减少每个请求的延迟。
`_render_char` 每次调用都会创建新的 `ImageFont.truetype`,当 `locate_chars_by_siamese` 频繁执行或目标很多时,这个开销会比较大。可以只加载一次字体进行缓存(例如使用模块级 `_FONT` 并在需要时加锁),在后续调用中复用。你也可以按字符缓存渲染结果,从而在多次调用之间避免重复渲染同一字符。
建议实现如下:
```python
# Cache the TrueType font and rendered glyphs to avoid per-call overhead.
_CHAR_FONT = None
_CHAR_FONT_LOCK = threading.Lock()
_CHAR_GLYPH_CACHE: dict[str, np.ndarray] = {}
_CHAR_GLYPH_CACHE_LOCK = threading.Lock()
def _render_char(char: str) -> np.ndarray:
"""Render one CJK char to a 52×52 BGR image using the bundled font."""
from PIL import Image, ImageDraw, ImageFont
global _CHAR_FONT
if not _FONT_PATH.is_file():
raise SolveError(f"word_click: missing font at {_FONT_PATH}")
# Fast path: return cached glyph if available.
with _CHAR_GLYPH_CACHE_LOCK:
cached = _CHAR_GLYPH_CACHE.get(char)
if cached is not None:
# Return a copy so callers can't mutate the cached image.
return cached.copy()
# Lazily initialize and cache the font once.
with _CHAR_FONT_LOCK:
if _CHAR_FONT is None:
_CHAR_FONT = ImageFont.truetype(str(_FONT_PATH), _CHAR_RENDER_FONT_SIZE)
font = _CHAR_FONT
img = Image.new("RGB", (_CHAR_RENDER_SIZE, _CHAR_RENDER_SIZE), color="white")
draw = ImageDraw.Draw(img)
bbox = font.getbbox(char)
text_w = bbox[2] - bbox[0]
x = (_CHAR_RENDER_SIZE - text_w) // 2
y = -3
draw.text((x, y), char, fill=_CHAR_RENDER_COLOR, font=font)
cv2 = _import_cv2()
bgr_img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
# Store in cache for subsequent calls.
with _CHAR_GLYPH_CACHE_LOCK:
_CHAR_GLYPH_CACHE[char] = bgr_img
return bgr_img.copy()
```
1. 请确保在 `src/crack_tcaptcha/solvers/word_ocr.py` 顶部导入了 `threading`:
- 如果尚未导入,请添加 `import threading`。
2. 如果项目需要兼容 3.9 之前的 Python 版本,且不支持 `dict[str, np.ndarray]` 这种类型注解,
请将 `_CHAR_GLYPH_CACHE: dict[str, np.ndarray] = {}` 改为兼容写法,例如 `from typing import Dict` 并使用 `Dict[str, np.ndarray]`。
</issue_to_address>
### Comment 2
<location path="src/crack_tcaptcha/server.py" line_range="136" />
<code_context>
+ if not appid:
+ self._send_json(400, {"status": "error", "msg": "missing appid"})
+ return
+ retries = int(body.get("retries", body.get("max_retries", 3)))
+ entry_url = body.get("entry_url", "")
+
</code_context>
<issue_to_address>
**issue (bug_risk):** 在解析请求体中的 `retries` 之前进行校验,避免因为错误输入导致 500。
直接用 `int(...)` 解析 `retries`,在输入为非整数(比如 "many" 或浮点数)时会抛出 `ValueError`,从而把本应是客户端错误变成 500。请先对 `retries` 做验证,当它不是正整数或超出允许范围时,返回 400 并附上清晰的错误信息。
</issue_to_address>
### Comment 3
<location path="AGENTS.md" line_range="28" />
<code_context>
# Install with optional extras
-uv sync --extra icon-click # adds ddddocr + onnxruntime (needed for icon_click and word_click)
+uv sync --extra icon-click # ddddocr + onnxruntime (icon_click pipeline)
+uv sync --extra word-click # onnxruntime + opencv-headless + ddddocr (word_click pipeline, local YOLO+Siamese)
uv sync --extra dev # pytest, respx, ruff, hypothesis
uv sync --extra docs # mkdocs-material
</code_context>
<issue_to_address>
**issue (typo):** 这里的包名很可能应该是 `opencv-python-headless`,另外 `YOLO + Siamese` 两侧可以加空格。
在其它文档(比如 README 的 extras 表格和 `docs/word-click.md`)中,这个 extra 被写成 `opencv-python-headless`,此处使用 `opencv-headless` 会造成不一致,可能会误导用户在安装 extra 时使用错误的名字。另外,为了与其他地方保持一致,可以把 `YOLO+Siamese` 改为 `local YOLO + Siamese`。
```suggestion
uv sync --extra word-click # onnxruntime + opencv-python-headless + ddddocr (word_click pipeline, local YOLO + Siamese)
```
</issue_to_address>
### Comment 4
<location path="src/crack_tcaptcha/solvers/word_ocr.py" line_range="268" />
<code_context>
+ return arr[None, ...]
+
+
+def _siamese_score_batch(crops: list[np.ndarray], ref: np.ndarray) -> list[float]:
+ """Score every crop against the ref in one (or as few as possible) ORT calls.
+
</code_context>
<issue_to_address>
**issue (complexity):** 建议重构 `_siamese_score_batch` 和贪心分配循环,将预处理集中化、将 batch 支持检测逻辑单独抽离出来,并使用更具声明性的索引选择方式,以获得更清晰、且更不易出错的控制流。
在保持当前行为不变的前提下,你可以简化两个相对复杂的部分:`_siamese_score_batch` 和贪心分配逻辑。
---
### 1) 简化 `_siamese_score_batch` 结构
当前这个函数同时在做:
* 检测是否支持 batch。
* 对 crops 进行两次预处理(一次在 batch 的 `try` 分支里,一次在逐对处理路径里)。
你可以通过以下方式减少分支和重复工作:
* 先对所有 crops 做一次统一预处理。
* 把“尝试一次 batch,然后缓存结果”的逻辑抽成一个小的内部块。
* 保持 `_siamese_batch_supported` 的行为完全不变。
这可以保留动态检测 + 回退的特性,同时让主流程更易读,开销也更小。
```python
def _siamese_score_batch(crops: list[np.ndarray], ref: np.ndarray) -> list[float]:
global _siamese_batch_supported
if not crops:
return []
sess = _get_siamese_session()
assert _siamese_input_names is not None
n0, n1 = _siamese_input_names
ref_prepped = _prep_siamese(ref) # (1, 3, 52, 52)
prepped = [_prep_siamese(c) for c in crops] # list of (1, 3, 52, 52)
# try batched once; cache decision
if _siamese_batch_supported is not False:
try:
batch = np.concatenate(prepped, axis=0) # (N, 3, 52, 52)
refs = np.repeat(ref_prepped, len(prepped), 0) # (N, 3, 52, 52)
pred = sess.run(None, {n0: batch, n1: refs})[0]
arr = np.asarray(pred).reshape(-1)
if arr.size == len(prepped):
_siamese_batch_supported = True
return [float(v) for v in arr]
except Exception as e:
log.info("word_click siamese batch not supported, using per-pair: %s", e)
_siamese_batch_supported = False
# per-pair fallback (same semantics as current code)
out: list[float] = []
for p in prepped:
pred = sess.run(None, {n0: p, n1: ref_prepped})[0]
out.append(float(np.asarray(pred).reshape(-1)[0]))
return out
```
这样有几个好处:
* 只需要一次 `_prep_siamese` 循环。
* 控制流更加线性,并将“batch 探测”逻辑清晰地隔离出来。
* 维持相同的日志和 `_siamese_batch_supported` 行为。
---
### 2) 让贪心分配逻辑更具声明性
目前的贪心分配逻辑是手动维护 `best_idx` / `best_score`,并在所有候选都用完时重新扫描。可以通过以下方式增强可读性:
* 维护一个“未使用索引”的集合。
* 预先为每个 target 计算“全局最佳”索引,用于需要重复使用时。
* 使用 `max(..., key=...)` 而不是手写循环。
这样可以在保持原有行为(对每个 target 选择最佳未使用的候选,否则复用该 target 的全局最佳)的同时,缩短命令式逻辑。
```python
# Full score matrix: rows = targets, cols = crop indices.
score_matrix: list[list[float]] = []
for ch in targets:
ref = _render_char(ch)
score_matrix.append(_siamese_score_batch(crops, ref))
# Precompute global best index per target for the "reuse best overall" case.
global_best_idx: list[int] = []
for scores in score_matrix:
if not scores:
global_best_idx.append(-1)
continue
global_best_idx.append(max(range(len(scores)), key=scores.__getitem__))
result: list[tuple[int, int]] = []
used: set[int] = set(range(len(crops))) # start with all, then flip logic?
```
使用未用索引集合后会更好:
```python
result: list[tuple[int, int]] = []
unused: set[int] = set(range(len(crops)))
for ti, ch in enumerate(targets):
scores = score_matrix[ti]
# best among unused, if any
if unused:
best_idx = max(unused, key=scores.__getitem__)
best_score = scores[best_idx]
else:
best_idx = global_best_idx[ti]
if best_idx < 0:
raise SolveError(f"word_click: no candidate for target {ch!r}")
best_score = scores[best_idx]
if best_idx in unused:
unused.remove(best_idx)
result.append(centers[best_idx])
log.info("word_click: %r → %s (score=%.3f)", ch, centers[best_idx], best_score)
```
这在保持相同贪心策略和重用语义的情况下,使代码更易于理解和修改。
---
如果你认为当前行为已经满足需求,上述这两处聚焦的重构可以在不牺牲鲁棒性和性能调优的前提下,消除一些“手工”状态管理和分支。
</issue_to_address>帮我变得更有用!请对每条评论点 👍 或 👎,我会根据这些反馈改进后续评审。
Original comment in English
Hey - I've found 4 issues, and left some high level feedback:
- The
word_ocr.pymodule docstring still refers toyolo_word.onnx/siamese_word.onnx, but the actual bundled files areword_click_detector.onnx/word_click_matcher.onnx; aligning these names will avoid confusion when debugging model issues. _render_charrecreates theImageFont.truetypefont object on every call, which can be a noticeable overhead when there are many targets; consider caching the loaded font at module scope and reusing it.- In
server._warmup_allyou import and call_get_yolo_session/_get_siamese_session, which are private helpers; it would be more robust to either expose public accessors for provider info or rely solely on the publicwarmup()API so refactors of internals don’t break the server.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `word_ocr.py` module docstring still refers to `yolo_word.onnx` / `siamese_word.onnx`, but the actual bundled files are `word_click_detector.onnx` / `word_click_matcher.onnx`; aligning these names will avoid confusion when debugging model issues.
- `_render_char` recreates the `ImageFont.truetype` font object on every call, which can be a noticeable overhead when there are many targets; consider caching the loaded font at module scope and reusing it.
- In `server._warmup_all` you import and call `_get_yolo_session` / `_get_siamese_session`, which are private helpers; it would be more robust to either expose public accessors for provider info or rely solely on the public `warmup()` API so refactors of internals don’t break the server.
## Individual Comments
### Comment 1
<location path="src/crack_tcaptcha/solvers/word_ocr.py" line_range="171-179" />
<code_context>
+ return img
+
+
+def _render_char(char: str) -> np.ndarray:
+ """Render one CJK char to a 52×52 BGR image using the bundled font."""
+ from PIL import Image, ImageDraw, ImageFont
+
+ if not _FONT_PATH.is_file():
+ raise SolveError(f"word_click: missing font at {_FONT_PATH}")
+ img = Image.new("RGB", (_CHAR_RENDER_SIZE, _CHAR_RENDER_SIZE), color="white")
+ draw = ImageDraw.Draw(img)
+ font = ImageFont.truetype(str(_FONT_PATH), _CHAR_RENDER_FONT_SIZE)
+ bbox = font.getbbox(char)
+ text_w = bbox[2] - bbox[0]
</code_context>
<issue_to_address>
**suggestion (performance):** Avoid reloading the TTF font on every character render to reduce per-request latency.
`_render_char` creates a new `ImageFont.truetype` on every call, which is costly when `locate_chars_by_siamese` runs often or over many targets. Cache the font once (e.g., a module-level `_FONT` with locking if needed) and reuse it. You might also cache rendered glyphs per character to avoid repeated rendering across calls.
Suggested implementation:
```python
# Cache the TrueType font and rendered glyphs to avoid per-call overhead.
_CHAR_FONT = None
_CHAR_FONT_LOCK = threading.Lock()
_CHAR_GLYPH_CACHE: dict[str, np.ndarray] = {}
_CHAR_GLYPH_CACHE_LOCK = threading.Lock()
def _render_char(char: str) -> np.ndarray:
"""Render one CJK char to a 52×52 BGR image using the bundled font."""
from PIL import Image, ImageDraw, ImageFont
global _CHAR_FONT
if not _FONT_PATH.is_file():
raise SolveError(f"word_click: missing font at {_FONT_PATH}")
# Fast path: return cached glyph if available.
with _CHAR_GLYPH_CACHE_LOCK:
cached = _CHAR_GLYPH_CACHE.get(char)
if cached is not None:
# Return a copy so callers can't mutate the cached image.
return cached.copy()
# Lazily initialize and cache the font once.
with _CHAR_FONT_LOCK:
if _CHAR_FONT is None:
_CHAR_FONT = ImageFont.truetype(str(_FONT_PATH), _CHAR_RENDER_FONT_SIZE)
font = _CHAR_FONT
img = Image.new("RGB", (_CHAR_RENDER_SIZE, _CHAR_RENDER_SIZE), color="white")
draw = ImageDraw.Draw(img)
bbox = font.getbbox(char)
text_w = bbox[2] - bbox[0]
x = (_CHAR_RENDER_SIZE - text_w) // 2
y = -3
draw.text((x, y), char, fill=_CHAR_RENDER_COLOR, font=font)
cv2 = _import_cv2()
bgr_img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
# Store in cache for subsequent calls.
with _CHAR_GLYPH_CACHE_LOCK:
_CHAR_GLYPH_CACHE[char] = bgr_img
return bgr_img.copy()
```
1. At the top of `src/crack_tcaptcha/solvers/word_ocr.py`, ensure `threading` is imported:
- Add `import threading` if it is not already present.
2. If the project targets Python versions earlier than 3.9 and does not support `dict[str, np.ndarray]` type hints, change `_CHAR_GLYPH_CACHE: dict[str, np.ndarray] = {}` to a compatible annotation such as `from typing import Dict` and `Dict[str, np.ndarray]`.
</issue_to_address>
### Comment 2
<location path="src/crack_tcaptcha/server.py" line_range="136" />
<code_context>
+ if not appid:
+ self._send_json(400, {"status": "error", "msg": "missing appid"})
+ return
+ retries = int(body.get("retries", body.get("max_retries", 3)))
+ entry_url = body.get("entry_url", "")
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Validate `retries` from the request body to avoid 500s on bad input.
Parsing `retries` with `int(...)` will raise `ValueError` for non-integer input (e.g. "many", a float), causing a 500 for a client error. Please validate `retries` first and return a 400 with a clear message when it’s not a positive integer or is out of the allowed range.
</issue_to_address>
### Comment 3
<location path="AGENTS.md" line_range="28" />
<code_context>
# Install with optional extras
-uv sync --extra icon-click # adds ddddocr + onnxruntime (needed for icon_click and word_click)
+uv sync --extra icon-click # ddddocr + onnxruntime (icon_click pipeline)
+uv sync --extra word-click # onnxruntime + opencv-headless + ddddocr (word_click pipeline, local YOLO+Siamese)
uv sync --extra dev # pytest, respx, ruff, hypothesis
uv sync --extra docs # mkdocs-material
</code_context>
<issue_to_address>
**issue (typo):** The package name here likely should be `opencv-python-headless`, and you may want spacing around `YOLO + Siamese`.
In other docs (README extras table and `docs/word-click.md`), this extra is documented as `opencv-python-headless`, so using `opencv-headless` here is inconsistent and could mislead users installing the extra. Also, for consistency with other references, consider `local YOLO + Siamese` instead of `YOLO+Siamese`.
```suggestion
uv sync --extra word-click # onnxruntime + opencv-python-headless + ddddocr (word_click pipeline, local YOLO + Siamese)
```
</issue_to_address>
### Comment 4
<location path="src/crack_tcaptcha/solvers/word_ocr.py" line_range="268" />
<code_context>
+ return arr[None, ...]
+
+
+def _siamese_score_batch(crops: list[np.ndarray], ref: np.ndarray) -> list[float]:
+ """Score every crop against the ref in one (or as few as possible) ORT calls.
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `_siamese_score_batch` and the greedy assignment loop to centralize preprocessing, isolate the batch-detection logic, and use more declarative index selection for clearer, less error-prone control flow.
You can keep all current behavior but simplify two of the more complex areas: `_siamese_score_batch` and the greedy assignment.
---
### 1) Simplify `_siamese_score_batch` structure
Right now the function both:
* Detects whether batching is supported.
* Prepares crops twice (once in the batch `try`, once in the per‑pair path).
You can reduce branching and duplicate work by:
* Preprocessing all crops once up front.
* Moving the “try batch once, then cache” logic into a small helper.
* Keeping `_siamese_batch_supported` behavior exactly the same.
This keeps the feature (dynamic detection + fallback) but makes the main flow easier to read and cheaper.
```python
def _siamese_score_batch(crops: list[np.ndarray], ref: np.ndarray) -> list[float]:
global _siamese_batch_supported
if not crops:
return []
sess = _get_siamese_session()
assert _siamese_input_names is not None
n0, n1 = _siamese_input_names
ref_prepped = _prep_siamese(ref) # (1, 3, 52, 52)
prepped = [_prep_siamese(c) for c in crops] # list of (1, 3, 52, 52)
# try batched once; cache decision
if _siamese_batch_supported is not False:
try:
batch = np.concatenate(prepped, axis=0) # (N, 3, 52, 52)
refs = np.repeat(ref_prepped, len(prepped), 0) # (N, 3, 52, 52)
pred = sess.run(None, {n0: batch, n1: refs})[0]
arr = np.asarray(pred).reshape(-1)
if arr.size == len(prepped):
_siamese_batch_supported = True
return [float(v) for v in arr]
except Exception as e:
log.info("word_click siamese batch not supported, using per-pair: %s", e)
_siamese_batch_supported = False
# per-pair fallback (same semantics as current code)
out: list[float] = []
for p in prepped:
pred = sess.run(None, {n0: p, n1: ref_prepped})[0]
out.append(float(np.asarray(pred).reshape(-1)[0]))
return out
```
Benefits:
* Only one `_prep_siamese` loop.
* The control flow is linear with a clearly isolated “batch probe” block.
* Maintains the same logging and `_siamese_batch_supported` behavior.
---
### 2) Make greedy assignment more declarative
The greedy assignment currently manually tracks `best_idx` / `best_score` and then re‑scans when everything is used. You can make this more readable by:
* Maintaining a set of unused indices.
* Precomputing the “global best” index per target for the reuse case.
* Using `max(..., key=...)` instead of hand‑rolled loops.
This preserves exactly the same behavior (best unused per target, otherwise reuse best overall), but shrinks the imperative logic.
```python
# Full score matrix: rows = targets, cols = crop indices.
score_matrix: list[list[float]] = []
for ch in targets:
ref = _render_char(ch)
score_matrix.append(_siamese_score_batch(crops, ref))
# Precompute global best index per target for the "reuse best overall" case.
global_best_idx: list[int] = []
for scores in score_matrix:
if not scores:
global_best_idx.append(-1)
continue
global_best_idx.append(max(range(len(scores)), key=scores.__getitem__))
result: list[tuple[int, int]] = []
used: set[int] = set(range(len(crops))) # start with all, then flip logic?
```
Better with unused set:
```python
result: list[tuple[int, int]] = []
unused: set[int] = set(range(len(crops)))
for ti, ch in enumerate(targets):
scores = score_matrix[ti]
# best among unused, if any
if unused:
best_idx = max(unused, key=scores.__getitem__)
best_score = scores[best_idx]
else:
best_idx = global_best_idx[ti]
if best_idx < 0:
raise SolveError(f"word_click: no candidate for target {ch!r}")
best_score = scores[best_idx]
if best_idx in unused:
unused.remove(best_idx)
result.append(centers[best_idx])
log.info("word_click: %r → %s (score=%.3f)", ch, centers[best_idx], best_score)
```
This keeps the same greedy strategy and reuse semantics, but is easier to follow and modify.
---
If you’re happy with the current behavior, these two focused refactors remove some of the “manual” bookkeeping and branching without dropping any of the robustness or performance tuning you’ve added.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
lifefloating
commented
Apr 24, 2026
Owner
Author
lifefloating
left a comment
There was a problem hiding this comment.
考虑tdc转rust,tdc过程不够快
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by Sourcery
将基于 LLM 的
word_click求解器替换为本地 YOLOv8 + Siamese ONNX 流水线,并新增一个长生命周期的 HTTP 服务器以复用已加载的模型。New Features:
word_click验证挑战。/solve和/health端点,以支持通过 HTTP 并发、低延迟地进行求解。word-click可选依赖 extra,用于为新的求解路径安装 ONNX Runtime、OpenCV 和 ddddocr。Enhancements:
word_click流水线,优先使用本地 Siamese 路径;当模型或依赖缺失,或检测失败时,回退到旧的基于 ddddocr 的实现。Build:
pyproject中的可选依赖,新增word-clickextra,并通过 hatch 配置确保 ONNX 模型和字体资源被包含在构建产物中。Documentation:
word_click实现、HTTP 服务模式、配置环境变量以及更新后的依赖 extras。Original summary in English
Summary by Sourcery
Replace the word_click LLM-based solver with a local YOLOv8 + Siamese ONNX pipeline and add a long-running HTTP server for reuse of loaded models.
New Features:
Enhancements:
Build:
Documentation: