Auto-improve content extraction with OpenAI evaluation by paskal · Pull Request #74 · ukeeper/ukeeper-readability

paskal · 2026-03-29T20:42:30Z

Summary

Adds automatic extraction quality evaluation using OpenAI during content extraction
When OpenAI is configured (--openai-api-key) and no existing rule for the domain, GPT evaluates the extraction result and suggests CSS selectors if the result is poor
Iterates up to 3 times (configurable via --openai-max-iter), saves the best selector as a rule for future use
ExtractAndImprove() force mode ignores existing rules — for when a user reports bad extraction
Protected POST /api/content-parsed-wrong?url=... endpoint for force mode
Fail-open: GPT errors never break extraction — original result returned unchanged

Depends on #73 (modularise-retrieval).

Key design decisions

Inline evaluation: runs during Extract(), not async. First request to a new domain may take longer (GPT calls), but always returns the best result. Subsequent requests use the saved rule
GPT sees URL + extracted text + truncated HTML: enough context for accurate selector suggestions without excessive token cost
Force mode: passes nil rule to general parser (ignores stored rules), then re-evaluates from scratch
Image extraction deferred: extractPics (which downloads images via HTTP) runs once on the final result, not on every evaluation iteration

New configuration

Flag	Env	Default	Description
`--openai-api-key`	`OPENAI_API_KEY`	none	Enables auto-evaluation when set
`--openai-model`	`OPENAI_MODEL`	`gpt-5.4-mini`	Model for evaluation
`--openai-max-iter`	`OPENAI_MAX_ITER`	`3`	Max evaluation iterations

New interface

type AIEvaluator interface {
    Evaluate(ctx, url, extractedText, htmlBody, prevSelector string) (*EvalResult, error)
}

Extract URL fetching abstraction from the inline HTTP logic in extractWithRules. Defines Retriever interface, RetrieveResult struct, and HTTPRetriever with Safari user-agent, redirect following, and timeout support. Includes moq generate directive and comprehensive tests.

Generate moq mock for Retriever interface as a test-only file (retriever_mock_test.go) instead of mocks/ subpackage to avoid import cycle (mocks/retriever.go would import extractor, cycling with readability_test.go). Run gofmt on all modified files, zero lint issues.

- fix err shadowing in deferred Body.Close() in both retrievers (use closeErr) - handle Cloudflare API success=false response explicitly instead of treating JSON error as HTML - truncate CF API error body to 512 bytes in error messages - add comment documenting CF retriever URL limitation (no final URL after JS redirects) - fix pre-existing %b format verb in text.go logging (should be %v) - replace network-dependent TestCloudflareRetriever_DefaultBaseURL with local httptest - add TestCloudflareRetriever_SuccessFalse for the new success=false handling - add TestExtractWithCustomRetriever integration test using RetrieverMock - remove duplicate plan file from docs/plans/ (already in completed/) - update README.md with new CF CLI flags and feature description - update CLAUDE.md CI bullet to reflect split docker.yml workflow

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

…ion quality evaluation

Add AIEvaluator and MaxGPTIter fields to UReadability, implement evaluateAndImprove() loop that iterates with AI to find better CSS selectors, and add ExtractAndImprove() force mode that bypasses stored rules for re-evaluation.

Add GET /api/content-parsed-wrong protected endpoint that triggers ExtractAndImprove() to re-evaluate and improve extraction for a URL.

- fix UTF-8 truncation in buildUserPrompt (rune-safe slicing for multi-byte content) - pass prevSelector through evaluation loop so AI avoids repeating failed selectors - fix double getText processing in extractWithSelector (return raw HTML, process once) - add normalizeLinks and extractPics to AI-improved content - change /api/content-parsed-wrong from GET to POST (mutating operation) - add context timeout (60s) for OpenAI API calls - return sentinel errInvalidJSON instead of nil,nil anti-pattern - return error on double invalid JSON instead of silent fail-open - create OpenAI client once via sync.Once instead of per-call - consolidate duplicate genParser closure with getContentGeneral - add test for retry succeeding after initial invalid JSON - add test for Rules.Save failure not propagating - fix CLAUDE.md mock location description - add /api/content-parsed-wrong to README API section

- customParser now delegates to extractWithSelector (eliminates duplicated goquery parse+find+html loop) - image extraction moved out of evaluation loop — runs once on final result instead of every iteration - extract "ai-evaluator" to aiEvaluatorUser constant - fix incorrect doc comment on callAPI - remove unused getAuth test helper - remove redundant cancel() call and restating comments

paskal added 27 commits March 29, 2026 20:28

feat: implement CloudflareRetriever for Browser Rendering API

5bcd16d

feat: wire Retriever interface into UReadability extraction pipeline

4b932a3

feat: add CLI flags and wire Cloudflare retriever in main.go

6128691

feat: verify acceptance criteria for Retriever interface

8d560a0

feat: update documentation for Retriever interface and CLI flags

a7ff6d0

fix: address code review findings

7119f29

fix: address code review findings

ef56b96

fix: address codex review findings

298fa03

fix: address code review findings

e0562ed

fix: address code review findings

60f1a07

fix: address code review findings

38255d5

fix: cache default retriever, add defensive timeouts, extract constants

cd8ab64

fix: revert token auth addition to POST /api/extract

4bdfd51

POST /api/extract never had token auth in the original code. The checkToken refactoring should only apply to the legacy /content/v1/parser endpoint which always had it.

docs: add OpenAI auto-extraction improvement plan

1ef736b

feat: add AIEvaluator interface and OpenAI implementation for extract…

9a71639

…ion quality evaluation

feat: add OpenAI CLI flags and wire evaluator into main.go

e1be5f6

feat: add REST endpoint for force-mode re-extraction with AI evaluation

600bd8c

Add GET /api/content-parsed-wrong protected endpoint that triggers ExtractAndImprove() to re-evaluate and improve extraction for a URL.

feat: run linter and final code quality checks

02fa8fd

feat: verify all acceptance criteria for OpenAI auto-extraction

93aebe5

feat: update documentation for OpenAI auto-extraction feature

6a5e14d

fix: address code review findings

4df81fa

paskal mentioned this pull request Mar 29, 2026

Add OpenAI-powered content parsing improvement feature #34

Closed

umputun deleted the branch modularise-retrieval April 12, 2026 20:49

umputun closed this Apr 12, 2026

paskal reopened this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-improve content extraction with OpenAI evaluation#74

Auto-improve content extraction with OpenAI evaluation#74
paskal wants to merge 27 commits into
modularise-retrievalfrom
openai-auto-extraction

paskal commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paskal commented Mar 29, 2026

Summary

Key design decisions

New configuration

New interface

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants