semant-ne

Named entity extraction and resolution from digitised historical documents.

semant-ne processes a single document at a time: it reads OCR output (PAGE XML or ALTO XML), prepares clean text blocks, sends them to an LLM for named entity extraction, and optionally links the discovered entities to an existing Weaviate database and updates the records directly.

The tool is designed for large-scale historical digitisation pipelines where many documents are processed in parallel. Cross-document consistency is maintained by Weaviate and the durable per-document JSON output — not by shared memory.

Architecture

End-to-end pipeline

flowchart TD
    A["Document directory\n(.jpg, .xml, document.json)"] --> B["Document loader\nsemant_ne.document_loader"]
    B --> C["PAGE XML / ALTO parser\npagexml_parser · alto_parser"]
    C --> D["Text block builder\nblock_preparation.simple_ocr_blocks"]
    D --> E{"--extract?"}
    E -- no --> OUT["DocumentResult JSON\n(blocks only)"]
    E -- yes --> F["LLM page extractor\nllm.page_extractor"]
    F --> G["Entity ledger\nledger.EntityLedger"]
    G --> H["Claim validator\nclaim_validator"]
    H --> I["Entity consolidation\nconsolidation"]
    I --> J{"--update-weaviate?"}
    J -- none --> OUT2["DocumentResult JSON\n(entities + mentions)"]
    J -- evidence or direct --> K["Weaviate resolver\nweaviate.resolver"]
    K --> L["Evidence writer\nweaviate.evidence_writer"]
    K --> M["Record updater\nweaviate.record_updater"]
    L --> OUT3["DocumentResult JSON\n(+ Weaviate update log)"]
    M --> OUT3

Weaviate data model

erDiagram
    EntityRecord {
        string db_entity_id PK
        string entity_type
        string canonical_label
        string[] aliases
        string search_text
        string short_description
        string markdown_text
        string[] evidence_document_ids
        int semant_revision
    }
    EntityDocumentEvidence {
        string evidence_id PK
        string document_id
        string local_entity_id
        string db_entity_id FK
        string entity_type
        int mention_count
        float confidence
        string run_id
    }
    EntityRecord ||--o{ EntityDocumentEvidence : "has evidence"

Entity resolution flow

flowchart TD
    A["LocalEntity from ledger"] --> B["BM25 search in Weaviate\n(search_text + canonical_label)"]
    B --> C{"candidates found?"}
    C -- no --> D["status: new_entity_candidate"]
    C -- yes --> E["Score each candidate\n_score_candidate()"]
    E --> F{"top score ≥\nauto_link_threshold\nAND margin ≥ min_margin?"}
    F -- yes --> G["status: linked"]
    F -- no --> H{"top score ≥\nreview_threshold?"}
    H -- no --> I["status: new_entity_candidate"]
    H -- yes --> J{"use_llm_resolver?"}
    J -- no --> K["status: ambiguous"]
    J -- yes --> L["LLM resolver call"]
    L --> M{"LLM decision"}
    M -- link --> G
    M -- ambiguous --> K
    M -- new --> I

Quick start

# 1. Install
pip install -e ".[dev]"

# 2. Block preparation only (no LLM needed)
semant-ne process-document \
  --doc-root ./document_2 \
  --output ./output/doc2_blocks.json

# 3. Full extraction with local vLLM
semant-ne process-document \
  --doc-root ./document_2 \
  --output ./output/doc2.json \
  --config ./configurations/doc2_vllm.yaml \
  --extract

# 4. Validate the output
semant-ne validate-output --input ./output/doc2.json

# 5. Inspect prepared blocks
semant-ne inspect-document-blocks --input ./output/doc2.json --page 1

Installation

Requirements

Python ≥ 3.11
An OpenAI-compatible LLM endpoint (vLLM, Ollama, …) for --extract
Weaviate v4 for --update-weaviate evidence|direct

Install from source

git clone <repo-url>
cd semant-ne

# Create and activate a virtual environment
python -m venv .venv && source .venv/bin/activate

# Install the package (core dependencies)
pip install -e .

# Install optional extras
pip install -e ".[weaviate]"   # Weaviate client
pip install -e ".[dev]"        # pytest, ruff, mypy

Start a local vLLM server

pip install vllm

vllm serve google/gemma-4-E2B-it \
  --max-model-len 24000 \
  --reasoning-parser gemma4 \
  --port 26001

Any OpenAI-compatible model works; adjust llm.text_model in config accordingly.

Start Weaviate (optional)

docker run -d \
  --name weaviate \
  -p 8080:8080 \
  cr.weaviate.io/semitechnologies/weaviate:latest

# Create required collections (idempotent)
semant-ne init-weaviate --url http://localhost:8080

Preparing a document

A document directory contains:

File pattern	Required	Purpose
`.page.xml` or `.xml` (PAGE XML)	either/or	OCR text with bounding boxes
`.alto.xml` or `.xml` (ALTO)	either/or	OCR text with bounding boxes
`.jpg` / `.png` / `*.tif`	optional	Page images (for vision models)
`document.json`	optional	Document-level metadata
`pages.json`	optional	Explicit page ordering

Bare .xml files are auto-detected as PAGE XML or ALTO by namespace sniffing.

`document.json`

{
  "document_id": "mc_abe045-00b4qk",
  "title": "Sokol: časopis pro tělesnou výchovu",
  "date": "1940",
  "language": ["cs"],
  "authors": []
}

All fields are optional. If omitted, the document ID defaults to the directory name.

Supported OCR formats

Format	Support
PAGE XML 2019-07-15	Full
PAGE XML (older namespaces)	Graceful fallback
ALTO 2.x / 3.x	Full
Mixed (PAGE XML + ALTO for same page)	PAGE XML preferred

Configuration

All configuration is in a YAML file passed via --config. If omitted, defaults are used.

# LLM connection and extraction settings
llm:
  base_url: "http://localhost:26001/v1"   # OpenAI-compatible endpoint
  api_key: "token-abc123"                 # any non-empty string for vLLM
  text_model: "google/gemma-4-E2B-it"
  max_tokens_per_request: 4096
  thinking_budget_tokens: null            # for models with reasoning mode
  include_page_image: false               # pass page image to vision model
  max_blocks_per_batch: 30               # pages with more blocks are split

# Document-local entity deduplication
merging:
  mode: "python_conservative"            # python_conservative | llm_page | final_llm_consolidation
  final_consolidation: false
  python_merge_threshold: 0.92

# Block preparation tuning
blocks:
  target_chars: 1200
  max_chars: 2500
  overlap_chars: 150

# Weaviate connection (needed for --update-weaviate evidence|direct)
weaviate:
  url: "http://localhost:8080"
  api_key: null
  entity_record_collection: "EntityRecord"
  evidence_collection: "EntityDocumentEvidence"

# Entity resolution thresholds
resolution:
  auto_link_threshold: 0.92             # score above this → auto-link
  review_threshold: 0.75                # score above this → ambiguous / LLM check
  min_margin_for_auto_link: 0.08        # min gap between top-2 candidates
  use_llm_resolver: true
  max_candidates: 5

# Weaviate write policy
updates:
  update_weaviate: "none"               # none | evidence | direct
  write_entity_document_evidence: true
  direct_update_entity_record: true
  add_aliases: true
  fill_empty_short_description: true
  overwrite_short_description: false
  append_markdown_evidence: true
  add_external_ids: false
  add_facts: false
  tolerate_parallel_update_loss: true

language_hints:
  - "cs"

See configurations/ for ready-to-use examples.

CLI reference

`process-document`

The main command. Prepares text blocks and optionally runs LLM extraction and Weaviate integration.

semant-ne process-document
  --doc-root PATH         Document directory  [required]
  --output PATH           Output JSON path    [required]
  --doc-id TEXT           Override document ID
  --config PATH           Config YAML (defaults used if omitted)
  --extract / --no-extract  Run LLM entity extraction (default: --no-extract)
  --update-weaviate TEXT  none|evidence|direct  (default: none)
  --verbose / -v

`validate-output`

Check a previously produced JSON file for schema and cross-reference errors.

semant-ne validate-output --input PATH

`inspect-document-blocks`

Pretty-print the prepared text blocks from a JSON output file.

semant-ne inspect-document-blocks --input PATH [--page N]

`init-weaviate`

Create the required Weaviate collections (idempotent).

semant-ne init-weaviate
  [--url TEXT]                              default: http://localhost:8080
  [--api-key TEXT]
  [--entity-record-collection TEXT]         default: EntityRecord
  [--evidence-collection TEXT]              default: EntityDocumentEvidence
  [--force]                                 drop and recreate

`reapply-weaviate-updates`

Re-run all Weaviate writes for a previously produced JSON (useful after failures or schema changes).

semant-ne reapply-weaviate-updates
  --input PATH
  [--config PATH]
  [--evidence-only]

Output format

The output is a single JSON file following schema_version: "semant-ne-document-v2".

{
  "schema_version": "semant-ne-document-v2",
  "document_id": "mc_abe045-00b4qk",

  "run": {
    "run_id": "2026-05-28T02:13:00Z_mc_abe045-00b4qk_abcd1234",
    "started_at": "...",
    "finished_at": "...",
    "tool_version": "0.1.0",
    "weaviate_update_mode": "none"
  },

  "document_metadata": { "title": "...", "date": "1940", "language": ["cs"] },

  "prepared_pages": [
    {
      "page_id": "mc_abe045-00b4qk_0003",
      "page_index": 1,
      "blocks": [
        {
          "block_id": "mc_abe045-00b4qk_0003_b0001",
          "text": "SPLŇUJE SE TO, PO ČEM SOKOLSTVO VOLALO PŘED PĚTI LÉTY.",
          "bbox": { "x": 282, "y": 136, "w": 1020, "h": 270 },
          "confidence": 1.0
        }
      ]
    }
  ],

  "unique_entities": [
    {
      "local_entity_id": "le_000001",
      "entity_type": "organization",
      "canonical_label": "Sokolstvo",
      "surface_forms": ["Sokolstvo", "Sokol"],
      "resolution": {
        "status": "new_entity_candidate",
        "confidence": 0.0,
        "explanation": "No database candidates found."
      }
    }
  ],

  "mentions": [
    {
      "mention_id": "m_a1b2c3d4e5f6",
      "local_entity_id": "le_000001",
      "surface_text": "Sokolstvo",
      "anchor": {
        "page_id": "mc_abe045-00b4qk_0003",
        "block_id": "mc_abe045-00b4qk_0003_b0001",
        "char_start": 42,
        "char_end": 51
      }
    }
  ],

  "weaviate_updates": [],
  "warnings": []
}

Weaviate integration

Collections

`EntityRecord`

One object per canonical named entity in your database. The search_text field is BM25-indexed for fast retrieval.

Key fields: db_entity_id, entity_type, canonical_label, aliases, search_text, short_description, markdown_text, evidence_document_ids, semant_revision.

`EntityDocumentEvidence`

One object per document × entity link. Created or updated by --update-weaviate evidence.

Key fields: evidence_id (ev_{document_id}_{local_entity_id}_{db_entity_id}), document_id, db_entity_id, surface_forms, mention_count, confidence, run_id.

Resolution scoring

Candidates are ranked with a weighted formula (spec §16.2):

Component	Weight
Label similarity (rapidfuzz token_set_ratio)	0.25
Alias similarity	0.15
Entity type match	0.15
Context similarity (short description vs document summary)	0.15
Date compatibility	0.10
Claim compatibility	0.10
Language match	0.05
Authority link match	0.05

Thresholds (configurable):

Threshold	Default	Meaning
`auto_link_threshold`	0.92	Auto-link without LLM review
`review_threshold`	0.75	Pass to LLM resolver if `use_llm_resolver: true`
`min_margin_for_auto_link`	0.08	Top-2 score gap required for auto-link

Direct update policy

When update_weaviate: direct, the following fields may be updated on existing EntityRecord objects:

Field	Rule
`aliases`	Add attested surface forms and LLM-proposed variants
`short_description`	Fill if empty (optionally overwrite)
`markdown_text`	Append evidence section (idempotent)
`evidence_document_ids`	Add current document ID
`evidence_refs`	Add/update evidence reference
`search_text`	Recomputed from canonical label + aliases + description
`semant_revision`	Incremented by 1
`updated_at`	Always set

The write is best-effort idempotent; repeated runs produce the same result.

Example configurations

Pre-built configurations are in configurations/:

File	Use case
`minimal_ollama.yaml`	Local Ollama, no Weaviate
`doc2_vllm.yaml`	vLLM with Czech language hints
`vllm_with_weaviate_evidence.yaml`	vLLM + Weaviate evidence writes
`full_direct_update.yaml`	Full pipeline with direct EntityRecord updates

Code structure

semant_ne/
├── cli.py                    # typer application, all CLI commands
├── config.py                 # Pydantic config models + YAML loader
├── models.py                 # All domain data models (Pydantic v2)
│
├── document_loader.py        # Auto-discovers pages from a directory
├── pagexml_parser.py         # PAGE XML 2019-07-15 parser
├── alto_parser.py            # ALTO 2.x/3.x parser
├── block_preparation/
│   └── simple_ocr_blocks.py  # Mode A: OCR regions → PreparedTextBlock list
│
├── llm/
│   ├── client.py             # LLMClient (OpenAI-compatible, retry, JSON extract)
│   ├── schemas.py            # Pydantic schemas for LLM I/O
│   ├── prompts.py            # Prompt builder (build_page_extraction_messages)
│   └── page_extractor.py     # Orchestrates extraction; handles block batching
│
├── anchoring.py              # Fuzzy mention → text block anchoring
├── entity_policy.py          # EntityType normalisation (PERSON → person)
├── ledger.py                 # EntityLedger: accumulates entities across pages
├── claim_validator.py        # Validates and normalises LLM-extracted claims
├── consolidation.py          # python_conservative_merge; FinalEntityConsolidator
│
├── weaviate/
│   ├── client.py             # open_weaviate() context manager
│   ├── resolver.py           # BM25 search + weighted scoring + LLM resolver
│   ├── evidence_writer.py    # Upserts EntityDocumentEvidence
│   ├── record_updater.py     # Direct updates to EntityRecord
│   └── init_collections.py   # Creates Weaviate collections (idempotent)
│
├── output_writer.py          # Atomic JSON write; load_document_result
└── validation.py             # Cross-reference validation of DocumentResult

scripts/
└── init_weaviate.py          # Standalone CLI for collection initialisation

configurations/
├── minimal_ollama.yaml
├── doc2_vllm.yaml
├── vllm_with_weaviate_evidence.yaml
└── full_direct_update.yaml

Key data model relationships

classDiagram
    class DocumentResult {
        document_id: str
        run: RunInfo
        document_metadata: DocumentMetadata
        prepared_pages: list[PreparedPage]
        unique_entities: list[LocalEntity]
        mentions: list[Mention]
        weaviate_updates: list[WeaviateUpdateRecord]
        warnings: list[str]
    }
    class LocalEntity {
        local_entity_id: str
        entity_type: EntityType
        canonical_label: str
        surface_forms: list[str]
        extracted_claims: list[Claim]
        resolution: ResolutionDecision
    }
    class Mention {
        mention_id: str
        local_entity_id: str
        surface_text: str
        anchor: TextAnchor
    }
    class TextAnchor {
        page_id: str
        block_id: str
        char_start: int
        char_end: int
        alignment_method: str
    }
    class ResolutionDecision {
        status: linked|ambiguous|new_entity_candidate|unresolved
        db_entity_id: str
        confidence: float
        candidate_scores: list[CandidateScore]
    }
    class Claim {
        claim_id: str
        predicate: str
        value_text: str
        update_eligibility: str
    }
    DocumentResult "1" --> "many" LocalEntity : unique_entities
    DocumentResult "1" --> "many" Mention : mentions
    LocalEntity "1" --> "many" Mention : via local_entity_id
    Mention "1" --> "1" TextAnchor : anchor
    LocalEntity "1" --> "0..1" ResolutionDecision : resolution
    LocalEntity "1" --> "many" Claim : extracted_claims

Testing

# Run all tests
pytest

# With coverage
pytest --cov=semant_ne --cov-report=term-missing

# Specific module
pytest tests/test_ledger.py -v

219 tests across 15 modules. Integration tests against ./document/ are skipped if the directory is not present.

Test module	What it covers
`test_pagexml_parser`	PAGE XML parsing, reading order, confidence
`test_alto_parser`	ALTO parsing, normalisation
`test_block_preparation`	Block splitting, merging, char offsets
`test_document_loader`	Directory discovery, sniffing, metadata
`test_output_schema`	Serialisation round-trip, schema_version
`test_entity_policy`	Type normalisation, aliases
`test_anchoring`	Exact/fuzzy/context-only/unanchored strategy
`test_ledger`	attach_or_create, merge, claims, summaries
`test_llm_schemas`	JSON extraction, permissive parsing
`test_claim_validator`	Predicate mapping, subject validation
`test_consolidation`	python_conservative_merge, FinalEntityConsolidator
`test_weaviate_resolver`	Scoring formula, auto-link, mocked Weaviate
`test_evidence_writer`	Create/update evidence objects, mocked Weaviate
`test_record_updater`	Alias/description/markdown updates, mocked Weaviate
`test_integration_real_document`	Full pipeline on `./document/`

Known limitations

Limitation	Workaround
Pages with very dense OCR (single block > ~18 000 chars + entity memory) may overflow `max_model_len` even after batching	Lower `max_blocks_per_batch` or split the page manually
`max_tokens_per_request: 4096` is too small for very rich pages	Increase to 8192 if your model supports it
Weaviate BM25 search only returns up to `max_candidates=5` results; rare entities may be missed	Increase `max_candidates`
`llm_page` and `llm_suggest_then_python` merge modes are parsed but not yet active	Use `python_conservative` or `final_llm_consolidation`
No multi-document batch command	Process documents in a shell loop
Entity ID counter is module-level; reuse the Python process across documents causes monotonically increasing IDs	This is intentional; IDs remain unique within a session

Licence

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configurations		configurations
scripts		scripts
semant_ne		semant_ne
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
implementation_progress.md		implementation_progress.md
initial_specification.md		initial_specification.md
project_guidelines.md		project_guidelines.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

semant-ne

Table of contents

Architecture

End-to-end pipeline

Weaviate data model

Entity resolution flow

Quick start

Installation

Requirements

Install from source

Start a local vLLM server

Start Weaviate (optional)

Preparing a document

document.json

Supported OCR formats

Configuration

CLI reference

process-document

validate-output

inspect-document-blocks

init-weaviate

reapply-weaviate-updates

Output format

Weaviate integration

Collections

EntityRecord

EntityDocumentEvidence

Resolution scoring

Direct update policy

Example configurations

Code structure

Key data model relationships

Testing

Known limitations

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`document.json`

`process-document`

`validate-output`

`inspect-document-blocks`

`init-weaviate`

`reapply-weaviate-updates`

`EntityRecord`

`EntityDocumentEvidence`

Packages