Named entity extraction and resolution from digitised historical documents.
semant-ne processes a single document at a time: it reads OCR output (PAGE XML or ALTO XML), prepares clean text blocks, sends them to an LLM for named entity extraction, and optionally links the discovered entities to an existing Weaviate database and updates the records directly.
The tool is designed for large-scale historical digitisation pipelines where many documents are processed in parallel. Cross-document consistency is maintained by Weaviate and the durable per-document JSON output — not by shared memory.
- Architecture
- Quick start
- Installation
- Preparing a document
- Configuration
- CLI reference
- Output format
- Weaviate integration
- Example configurations
- Code structure
- Testing
- Known limitations
flowchart TD
A["Document directory\n(.jpg, .xml, document.json)"] --> B["Document loader\nsemant_ne.document_loader"]
B --> C["PAGE XML / ALTO parser\npagexml_parser · alto_parser"]
C --> D["Text block builder\nblock_preparation.simple_ocr_blocks"]
D --> E{"--extract?"}
E -- no --> OUT["DocumentResult JSON\n(blocks only)"]
E -- yes --> F["LLM page extractor\nllm.page_extractor"]
F --> G["Entity ledger\nledger.EntityLedger"]
G --> H["Claim validator\nclaim_validator"]
H --> I["Entity consolidation\nconsolidation"]
I --> J{"--update-weaviate?"}
J -- none --> OUT2["DocumentResult JSON\n(entities + mentions)"]
J -- evidence or direct --> K["Weaviate resolver\nweaviate.resolver"]
K --> L["Evidence writer\nweaviate.evidence_writer"]
K --> M["Record updater\nweaviate.record_updater"]
L --> OUT3["DocumentResult JSON\n(+ Weaviate update log)"]
M --> OUT3
erDiagram
EntityRecord {
string db_entity_id PK
string entity_type
string canonical_label
string[] aliases
string search_text
string short_description
string markdown_text
string[] evidence_document_ids
int semant_revision
}
EntityDocumentEvidence {
string evidence_id PK
string document_id
string local_entity_id
string db_entity_id FK
string entity_type
int mention_count
float confidence
string run_id
}
EntityRecord ||--o{ EntityDocumentEvidence : "has evidence"
flowchart TD
A["LocalEntity from ledger"] --> B["BM25 search in Weaviate\n(search_text + canonical_label)"]
B --> C{"candidates found?"}
C -- no --> D["status: new_entity_candidate"]
C -- yes --> E["Score each candidate\n_score_candidate()"]
E --> F{"top score ≥\nauto_link_threshold\nAND margin ≥ min_margin?"}
F -- yes --> G["status: linked"]
F -- no --> H{"top score ≥\nreview_threshold?"}
H -- no --> I["status: new_entity_candidate"]
H -- yes --> J{"use_llm_resolver?"}
J -- no --> K["status: ambiguous"]
J -- yes --> L["LLM resolver call"]
L --> M{"LLM decision"}
M -- link --> G
M -- ambiguous --> K
M -- new --> I
# 1. Install
pip install -e ".[dev]"
# 2. Block preparation only (no LLM needed)
semant-ne process-document \
--doc-root ./document_2 \
--output ./output/doc2_blocks.json
# 3. Full extraction with local vLLM
semant-ne process-document \
--doc-root ./document_2 \
--output ./output/doc2.json \
--config ./configurations/doc2_vllm.yaml \
--extract
# 4. Validate the output
semant-ne validate-output --input ./output/doc2.json
# 5. Inspect prepared blocks
semant-ne inspect-document-blocks --input ./output/doc2.json --page 1- Python ≥ 3.11
- An OpenAI-compatible LLM endpoint (vLLM, Ollama, …) for
--extract - Weaviate v4 for
--update-weaviate evidence|direct
git clone <repo-url>
cd semant-ne
# Create and activate a virtual environment
python -m venv .venv && source .venv/bin/activate
# Install the package (core dependencies)
pip install -e .
# Install optional extras
pip install -e ".[weaviate]" # Weaviate client
pip install -e ".[dev]" # pytest, ruff, mypypip install vllm
vllm serve google/gemma-4-E2B-it \
--max-model-len 24000 \
--reasoning-parser gemma4 \
--port 26001Any OpenAI-compatible model works; adjust llm.text_model in config accordingly.
docker run -d \
--name weaviate \
-p 8080:8080 \
cr.weaviate.io/semitechnologies/weaviate:latest
# Create required collections (idempotent)
semant-ne init-weaviate --url http://localhost:8080A document directory contains:
| File pattern | Required | Purpose |
|---|---|---|
*.page.xml or *.xml (PAGE XML) |
either/or | OCR text with bounding boxes |
*.alto.xml or *.xml (ALTO) |
either/or | OCR text with bounding boxes |
*.jpg / *.png / *.tif |
optional | Page images (for vision models) |
document.json |
optional | Document-level metadata |
pages.json |
optional | Explicit page ordering |
Bare .xml files are auto-detected as PAGE XML or ALTO by namespace sniffing.
{
"document_id": "mc_abe045-00b4qk",
"title": "Sokol: časopis pro tělesnou výchovu",
"date": "1940",
"language": ["cs"],
"authors": []
}All fields are optional. If omitted, the document ID defaults to the directory name.
| Format | Support |
|---|---|
| PAGE XML 2019-07-15 | Full |
| PAGE XML (older namespaces) | Graceful fallback |
| ALTO 2.x / 3.x | Full |
| Mixed (PAGE XML + ALTO for same page) | PAGE XML preferred |
All configuration is in a YAML file passed via --config. If omitted, defaults are used.
# LLM connection and extraction settings
llm:
base_url: "http://localhost:26001/v1" # OpenAI-compatible endpoint
api_key: "token-abc123" # any non-empty string for vLLM
text_model: "google/gemma-4-E2B-it"
max_tokens_per_request: 4096
thinking_budget_tokens: null # for models with reasoning mode
include_page_image: false # pass page image to vision model
max_blocks_per_batch: 30 # pages with more blocks are split
# Document-local entity deduplication
merging:
mode: "python_conservative" # python_conservative | llm_page | final_llm_consolidation
final_consolidation: false
python_merge_threshold: 0.92
# Block preparation tuning
blocks:
target_chars: 1200
max_chars: 2500
overlap_chars: 150
# Weaviate connection (needed for --update-weaviate evidence|direct)
weaviate:
url: "http://localhost:8080"
api_key: null
entity_record_collection: "EntityRecord"
evidence_collection: "EntityDocumentEvidence"
# Entity resolution thresholds
resolution:
auto_link_threshold: 0.92 # score above this → auto-link
review_threshold: 0.75 # score above this → ambiguous / LLM check
min_margin_for_auto_link: 0.08 # min gap between top-2 candidates
use_llm_resolver: true
max_candidates: 5
# Weaviate write policy
updates:
update_weaviate: "none" # none | evidence | direct
write_entity_document_evidence: true
direct_update_entity_record: true
add_aliases: true
fill_empty_short_description: true
overwrite_short_description: false
append_markdown_evidence: true
add_external_ids: false
add_facts: false
tolerate_parallel_update_loss: true
language_hints:
- "cs"See configurations/ for ready-to-use examples.
The main command. Prepares text blocks and optionally runs LLM extraction and Weaviate integration.
semant-ne process-document
--doc-root PATH Document directory [required]
--output PATH Output JSON path [required]
--doc-id TEXT Override document ID
--config PATH Config YAML (defaults used if omitted)
--extract / --no-extract Run LLM entity extraction (default: --no-extract)
--update-weaviate TEXT none|evidence|direct (default: none)
--verbose / -v
Check a previously produced JSON file for schema and cross-reference errors.
semant-ne validate-output --input PATH
Pretty-print the prepared text blocks from a JSON output file.
semant-ne inspect-document-blocks --input PATH [--page N]
Create the required Weaviate collections (idempotent).
semant-ne init-weaviate
[--url TEXT] default: http://localhost:8080
[--api-key TEXT]
[--entity-record-collection TEXT] default: EntityRecord
[--evidence-collection TEXT] default: EntityDocumentEvidence
[--force] drop and recreate
Re-run all Weaviate writes for a previously produced JSON (useful after failures or schema changes).
semant-ne reapply-weaviate-updates
--input PATH
[--config PATH]
[--evidence-only]
The output is a single JSON file following schema_version: "semant-ne-document-v2".
{
"schema_version": "semant-ne-document-v2",
"document_id": "mc_abe045-00b4qk",
"run": {
"run_id": "2026-05-28T02:13:00Z_mc_abe045-00b4qk_abcd1234",
"started_at": "...",
"finished_at": "...",
"tool_version": "0.1.0",
"weaviate_update_mode": "none"
},
"document_metadata": { "title": "...", "date": "1940", "language": ["cs"] },
"prepared_pages": [
{
"page_id": "mc_abe045-00b4qk_0003",
"page_index": 1,
"blocks": [
{
"block_id": "mc_abe045-00b4qk_0003_b0001",
"text": "SPLŇUJE SE TO, PO ČEM SOKOLSTVO VOLALO PŘED PĚTI LÉTY.",
"bbox": { "x": 282, "y": 136, "w": 1020, "h": 270 },
"confidence": 1.0
}
]
}
],
"unique_entities": [
{
"local_entity_id": "le_000001",
"entity_type": "organization",
"canonical_label": "Sokolstvo",
"surface_forms": ["Sokolstvo", "Sokol"],
"resolution": {
"status": "new_entity_candidate",
"confidence": 0.0,
"explanation": "No database candidates found."
}
}
],
"mentions": [
{
"mention_id": "m_a1b2c3d4e5f6",
"local_entity_id": "le_000001",
"surface_text": "Sokolstvo",
"anchor": {
"page_id": "mc_abe045-00b4qk_0003",
"block_id": "mc_abe045-00b4qk_0003_b0001",
"char_start": 42,
"char_end": 51
}
}
],
"weaviate_updates": [],
"warnings": []
}One object per canonical named entity in your database. The search_text field is BM25-indexed for fast retrieval.
Key fields: db_entity_id, entity_type, canonical_label, aliases, search_text, short_description, markdown_text, evidence_document_ids, semant_revision.
One object per document × entity link. Created or updated by --update-weaviate evidence.
Key fields: evidence_id (ev_{document_id}_{local_entity_id}_{db_entity_id}), document_id, db_entity_id, surface_forms, mention_count, confidence, run_id.
Candidates are ranked with a weighted formula (spec §16.2):
| Component | Weight |
|---|---|
| Label similarity (rapidfuzz token_set_ratio) | 0.25 |
| Alias similarity | 0.15 |
| Entity type match | 0.15 |
| Context similarity (short description vs document summary) | 0.15 |
| Date compatibility | 0.10 |
| Claim compatibility | 0.10 |
| Language match | 0.05 |
| Authority link match | 0.05 |
Thresholds (configurable):
| Threshold | Default | Meaning |
|---|---|---|
auto_link_threshold |
0.92 | Auto-link without LLM review |
review_threshold |
0.75 | Pass to LLM resolver if use_llm_resolver: true |
min_margin_for_auto_link |
0.08 | Top-2 score gap required for auto-link |
When update_weaviate: direct, the following fields may be updated on existing EntityRecord objects:
| Field | Rule |
|---|---|
aliases |
Add attested surface forms and LLM-proposed variants |
short_description |
Fill if empty (optionally overwrite) |
markdown_text |
Append evidence section (idempotent) |
evidence_document_ids |
Add current document ID |
evidence_refs |
Add/update evidence reference |
search_text |
Recomputed from canonical label + aliases + description |
semant_revision |
Incremented by 1 |
updated_at |
Always set |
The write is best-effort idempotent; repeated runs produce the same result.
Pre-built configurations are in configurations/:
| File | Use case |
|---|---|
minimal_ollama.yaml |
Local Ollama, no Weaviate |
doc2_vllm.yaml |
vLLM with Czech language hints |
vllm_with_weaviate_evidence.yaml |
vLLM + Weaviate evidence writes |
full_direct_update.yaml |
Full pipeline with direct EntityRecord updates |
semant_ne/
├── cli.py # typer application, all CLI commands
├── config.py # Pydantic config models + YAML loader
├── models.py # All domain data models (Pydantic v2)
│
├── document_loader.py # Auto-discovers pages from a directory
├── pagexml_parser.py # PAGE XML 2019-07-15 parser
├── alto_parser.py # ALTO 2.x/3.x parser
├── block_preparation/
│ └── simple_ocr_blocks.py # Mode A: OCR regions → PreparedTextBlock list
│
├── llm/
│ ├── client.py # LLMClient (OpenAI-compatible, retry, JSON extract)
│ ├── schemas.py # Pydantic schemas for LLM I/O
│ ├── prompts.py # Prompt builder (build_page_extraction_messages)
│ └── page_extractor.py # Orchestrates extraction; handles block batching
│
├── anchoring.py # Fuzzy mention → text block anchoring
├── entity_policy.py # EntityType normalisation (PERSON → person)
├── ledger.py # EntityLedger: accumulates entities across pages
├── claim_validator.py # Validates and normalises LLM-extracted claims
├── consolidation.py # python_conservative_merge; FinalEntityConsolidator
│
├── weaviate/
│ ├── client.py # open_weaviate() context manager
│ ├── resolver.py # BM25 search + weighted scoring + LLM resolver
│ ├── evidence_writer.py # Upserts EntityDocumentEvidence
│ ├── record_updater.py # Direct updates to EntityRecord
│ └── init_collections.py # Creates Weaviate collections (idempotent)
│
├── output_writer.py # Atomic JSON write; load_document_result
└── validation.py # Cross-reference validation of DocumentResult
scripts/
└── init_weaviate.py # Standalone CLI for collection initialisation
configurations/
├── minimal_ollama.yaml
├── doc2_vllm.yaml
├── vllm_with_weaviate_evidence.yaml
└── full_direct_update.yaml
classDiagram
class DocumentResult {
document_id: str
run: RunInfo
document_metadata: DocumentMetadata
prepared_pages: list[PreparedPage]
unique_entities: list[LocalEntity]
mentions: list[Mention]
weaviate_updates: list[WeaviateUpdateRecord]
warnings: list[str]
}
class LocalEntity {
local_entity_id: str
entity_type: EntityType
canonical_label: str
surface_forms: list[str]
extracted_claims: list[Claim]
resolution: ResolutionDecision
}
class Mention {
mention_id: str
local_entity_id: str
surface_text: str
anchor: TextAnchor
}
class TextAnchor {
page_id: str
block_id: str
char_start: int
char_end: int
alignment_method: str
}
class ResolutionDecision {
status: linked|ambiguous|new_entity_candidate|unresolved
db_entity_id: str
confidence: float
candidate_scores: list[CandidateScore]
}
class Claim {
claim_id: str
predicate: str
value_text: str
update_eligibility: str
}
DocumentResult "1" --> "many" LocalEntity : unique_entities
DocumentResult "1" --> "many" Mention : mentions
LocalEntity "1" --> "many" Mention : via local_entity_id
Mention "1" --> "1" TextAnchor : anchor
LocalEntity "1" --> "0..1" ResolutionDecision : resolution
LocalEntity "1" --> "many" Claim : extracted_claims
# Run all tests
pytest
# With coverage
pytest --cov=semant_ne --cov-report=term-missing
# Specific module
pytest tests/test_ledger.py -v219 tests across 15 modules. Integration tests against ./document/ are skipped if the directory is not present.
| Test module | What it covers |
|---|---|
test_pagexml_parser |
PAGE XML parsing, reading order, confidence |
test_alto_parser |
ALTO parsing, normalisation |
test_block_preparation |
Block splitting, merging, char offsets |
test_document_loader |
Directory discovery, sniffing, metadata |
test_output_schema |
Serialisation round-trip, schema_version |
test_entity_policy |
Type normalisation, aliases |
test_anchoring |
Exact/fuzzy/context-only/unanchored strategy |
test_ledger |
attach_or_create, merge, claims, summaries |
test_llm_schemas |
JSON extraction, permissive parsing |
test_claim_validator |
Predicate mapping, subject validation |
test_consolidation |
python_conservative_merge, FinalEntityConsolidator |
test_weaviate_resolver |
Scoring formula, auto-link, mocked Weaviate |
test_evidence_writer |
Create/update evidence objects, mocked Weaviate |
test_record_updater |
Alias/description/markdown updates, mocked Weaviate |
test_integration_real_document |
Full pipeline on ./document/ |
| Limitation | Workaround |
|---|---|
Pages with very dense OCR (single block > ~18 000 chars + entity memory) may overflow max_model_len even after batching |
Lower max_blocks_per_batch or split the page manually |
max_tokens_per_request: 4096 is too small for very rich pages |
Increase to 8192 if your model supports it |
Weaviate BM25 search only returns up to max_candidates=5 results; rare entities may be missed |
Increase max_candidates |
llm_page and llm_suggest_then_python merge modes are parsed but not yet active |
Use python_conservative or final_llm_consolidation |
| No multi-document batch command | Process documents in a shell loop |
| Entity ID counter is module-level; reuse the Python process across documents causes monotonically increasing IDs | This is intentional; IDs remain unique within a session |
MIT — see LICENSE.