Skip to content

DCGM/semant-ner

Repository files navigation

semant-ne

Named entity extraction and resolution from digitised historical documents.

semant-ne processes a single document at a time: it reads OCR output (PAGE XML or ALTO XML), prepares clean text blocks, sends them to an LLM for named entity extraction, and optionally links the discovered entities to an existing Weaviate database and updates the records directly.

The tool is designed for large-scale historical digitisation pipelines where many documents are processed in parallel. Cross-document consistency is maintained by Weaviate and the durable per-document JSON output — not by shared memory.


Table of contents


Architecture

End-to-end pipeline

flowchart TD
    A["Document directory\n(.jpg, .xml, document.json)"] --> B["Document loader\nsemant_ne.document_loader"]
    B --> C["PAGE XML / ALTO parser\npagexml_parser · alto_parser"]
    C --> D["Text block builder\nblock_preparation.simple_ocr_blocks"]
    D --> E{"--extract?"}
    E -- no --> OUT["DocumentResult JSON\n(blocks only)"]
    E -- yes --> F["LLM page extractor\nllm.page_extractor"]
    F --> G["Entity ledger\nledger.EntityLedger"]
    G --> H["Claim validator\nclaim_validator"]
    H --> I["Entity consolidation\nconsolidation"]
    I --> J{"--update-weaviate?"}
    J -- none --> OUT2["DocumentResult JSON\n(entities + mentions)"]
    J -- evidence or direct --> K["Weaviate resolver\nweaviate.resolver"]
    K --> L["Evidence writer\nweaviate.evidence_writer"]
    K --> M["Record updater\nweaviate.record_updater"]
    L --> OUT3["DocumentResult JSON\n(+ Weaviate update log)"]
    M --> OUT3
Loading

Weaviate data model

erDiagram
    EntityRecord {
        string db_entity_id PK
        string entity_type
        string canonical_label
        string[] aliases
        string search_text
        string short_description
        string markdown_text
        string[] evidence_document_ids
        int semant_revision
    }
    EntityDocumentEvidence {
        string evidence_id PK
        string document_id
        string local_entity_id
        string db_entity_id FK
        string entity_type
        int mention_count
        float confidence
        string run_id
    }
    EntityRecord ||--o{ EntityDocumentEvidence : "has evidence"
Loading

Entity resolution flow

flowchart TD
    A["LocalEntity from ledger"] --> B["BM25 search in Weaviate\n(search_text + canonical_label)"]
    B --> C{"candidates found?"}
    C -- no --> D["status: new_entity_candidate"]
    C -- yes --> E["Score each candidate\n_score_candidate()"]
    E --> F{"top score ≥\nauto_link_threshold\nAND margin ≥ min_margin?"}
    F -- yes --> G["status: linked"]
    F -- no --> H{"top score ≥\nreview_threshold?"}
    H -- no --> I["status: new_entity_candidate"]
    H -- yes --> J{"use_llm_resolver?"}
    J -- no --> K["status: ambiguous"]
    J -- yes --> L["LLM resolver call"]
    L --> M{"LLM decision"}
    M -- link --> G
    M -- ambiguous --> K
    M -- new --> I
Loading

Quick start

# 1. Install
pip install -e ".[dev]"

# 2. Block preparation only (no LLM needed)
semant-ne process-document \
  --doc-root ./document_2 \
  --output ./output/doc2_blocks.json

# 3. Full extraction with local vLLM
semant-ne process-document \
  --doc-root ./document_2 \
  --output ./output/doc2.json \
  --config ./configurations/doc2_vllm.yaml \
  --extract

# 4. Validate the output
semant-ne validate-output --input ./output/doc2.json

# 5. Inspect prepared blocks
semant-ne inspect-document-blocks --input ./output/doc2.json --page 1

Installation

Requirements

  • Python ≥ 3.11
  • An OpenAI-compatible LLM endpoint (vLLM, Ollama, …) for --extract
  • Weaviate v4 for --update-weaviate evidence|direct

Install from source

git clone <repo-url>
cd semant-ne

# Create and activate a virtual environment
python -m venv .venv && source .venv/bin/activate

# Install the package (core dependencies)
pip install -e .

# Install optional extras
pip install -e ".[weaviate]"   # Weaviate client
pip install -e ".[dev]"        # pytest, ruff, mypy

Start a local vLLM server

pip install vllm

vllm serve google/gemma-4-E2B-it \
  --max-model-len 24000 \
  --reasoning-parser gemma4 \
  --port 26001

Any OpenAI-compatible model works; adjust llm.text_model in config accordingly.

Start Weaviate (optional)

docker run -d \
  --name weaviate \
  -p 8080:8080 \
  cr.weaviate.io/semitechnologies/weaviate:latest

# Create required collections (idempotent)
semant-ne init-weaviate --url http://localhost:8080

Preparing a document

A document directory contains:

File pattern Required Purpose
*.page.xml or *.xml (PAGE XML) either/or OCR text with bounding boxes
*.alto.xml or *.xml (ALTO) either/or OCR text with bounding boxes
*.jpg / *.png / *.tif optional Page images (for vision models)
document.json optional Document-level metadata
pages.json optional Explicit page ordering

Bare .xml files are auto-detected as PAGE XML or ALTO by namespace sniffing.

document.json

{
  "document_id": "mc_abe045-00b4qk",
  "title": "Sokol: časopis pro tělesnou výchovu",
  "date": "1940",
  "language": ["cs"],
  "authors": []
}

All fields are optional. If omitted, the document ID defaults to the directory name.

Supported OCR formats

Format Support
PAGE XML 2019-07-15 Full
PAGE XML (older namespaces) Graceful fallback
ALTO 2.x / 3.x Full
Mixed (PAGE XML + ALTO for same page) PAGE XML preferred

Configuration

All configuration is in a YAML file passed via --config. If omitted, defaults are used.

# LLM connection and extraction settings
llm:
  base_url: "http://localhost:26001/v1"   # OpenAI-compatible endpoint
  api_key: "token-abc123"                 # any non-empty string for vLLM
  text_model: "google/gemma-4-E2B-it"
  max_tokens_per_request: 4096
  thinking_budget_tokens: null            # for models with reasoning mode
  include_page_image: false               # pass page image to vision model
  max_blocks_per_batch: 30               # pages with more blocks are split

# Document-local entity deduplication
merging:
  mode: "python_conservative"            # python_conservative | llm_page | final_llm_consolidation
  final_consolidation: false
  python_merge_threshold: 0.92

# Block preparation tuning
blocks:
  target_chars: 1200
  max_chars: 2500
  overlap_chars: 150

# Weaviate connection (needed for --update-weaviate evidence|direct)
weaviate:
  url: "http://localhost:8080"
  api_key: null
  entity_record_collection: "EntityRecord"
  evidence_collection: "EntityDocumentEvidence"

# Entity resolution thresholds
resolution:
  auto_link_threshold: 0.92             # score above this → auto-link
  review_threshold: 0.75                # score above this → ambiguous / LLM check
  min_margin_for_auto_link: 0.08        # min gap between top-2 candidates
  use_llm_resolver: true
  max_candidates: 5

# Weaviate write policy
updates:
  update_weaviate: "none"               # none | evidence | direct
  write_entity_document_evidence: true
  direct_update_entity_record: true
  add_aliases: true
  fill_empty_short_description: true
  overwrite_short_description: false
  append_markdown_evidence: true
  add_external_ids: false
  add_facts: false
  tolerate_parallel_update_loss: true

language_hints:
  - "cs"

See configurations/ for ready-to-use examples.


CLI reference

process-document

The main command. Prepares text blocks and optionally runs LLM extraction and Weaviate integration.

semant-ne process-document
  --doc-root PATH         Document directory  [required]
  --output PATH           Output JSON path    [required]
  --doc-id TEXT           Override document ID
  --config PATH           Config YAML (defaults used if omitted)
  --extract / --no-extract  Run LLM entity extraction (default: --no-extract)
  --update-weaviate TEXT  none|evidence|direct  (default: none)
  --verbose / -v

validate-output

Check a previously produced JSON file for schema and cross-reference errors.

semant-ne validate-output --input PATH

inspect-document-blocks

Pretty-print the prepared text blocks from a JSON output file.

semant-ne inspect-document-blocks --input PATH [--page N]

init-weaviate

Create the required Weaviate collections (idempotent).

semant-ne init-weaviate
  [--url TEXT]                              default: http://localhost:8080
  [--api-key TEXT]
  [--entity-record-collection TEXT]         default: EntityRecord
  [--evidence-collection TEXT]              default: EntityDocumentEvidence
  [--force]                                 drop and recreate

reapply-weaviate-updates

Re-run all Weaviate writes for a previously produced JSON (useful after failures or schema changes).

semant-ne reapply-weaviate-updates
  --input PATH
  [--config PATH]
  [--evidence-only]

Output format

The output is a single JSON file following schema_version: "semant-ne-document-v2".

{
  "schema_version": "semant-ne-document-v2",
  "document_id": "mc_abe045-00b4qk",

  "run": {
    "run_id": "2026-05-28T02:13:00Z_mc_abe045-00b4qk_abcd1234",
    "started_at": "...",
    "finished_at": "...",
    "tool_version": "0.1.0",
    "weaviate_update_mode": "none"
  },

  "document_metadata": { "title": "...", "date": "1940", "language": ["cs"] },

  "prepared_pages": [
    {
      "page_id": "mc_abe045-00b4qk_0003",
      "page_index": 1,
      "blocks": [
        {
          "block_id": "mc_abe045-00b4qk_0003_b0001",
          "text": "SPLŇUJE SE TO, PO ČEM SOKOLSTVO VOLALO PŘED PĚTI LÉTY.",
          "bbox": { "x": 282, "y": 136, "w": 1020, "h": 270 },
          "confidence": 1.0
        }
      ]
    }
  ],

  "unique_entities": [
    {
      "local_entity_id": "le_000001",
      "entity_type": "organization",
      "canonical_label": "Sokolstvo",
      "surface_forms": ["Sokolstvo", "Sokol"],
      "resolution": {
        "status": "new_entity_candidate",
        "confidence": 0.0,
        "explanation": "No database candidates found."
      }
    }
  ],

  "mentions": [
    {
      "mention_id": "m_a1b2c3d4e5f6",
      "local_entity_id": "le_000001",
      "surface_text": "Sokolstvo",
      "anchor": {
        "page_id": "mc_abe045-00b4qk_0003",
        "block_id": "mc_abe045-00b4qk_0003_b0001",
        "char_start": 42,
        "char_end": 51
      }
    }
  ],

  "weaviate_updates": [],
  "warnings": []
}

Weaviate integration

Collections

EntityRecord

One object per canonical named entity in your database. The search_text field is BM25-indexed for fast retrieval.

Key fields: db_entity_id, entity_type, canonical_label, aliases, search_text, short_description, markdown_text, evidence_document_ids, semant_revision.

EntityDocumentEvidence

One object per document × entity link. Created or updated by --update-weaviate evidence.

Key fields: evidence_id (ev_{document_id}_{local_entity_id}_{db_entity_id}), document_id, db_entity_id, surface_forms, mention_count, confidence, run_id.

Resolution scoring

Candidates are ranked with a weighted formula (spec §16.2):

Component Weight
Label similarity (rapidfuzz token_set_ratio) 0.25
Alias similarity 0.15
Entity type match 0.15
Context similarity (short description vs document summary) 0.15
Date compatibility 0.10
Claim compatibility 0.10
Language match 0.05
Authority link match 0.05

Thresholds (configurable):

Threshold Default Meaning
auto_link_threshold 0.92 Auto-link without LLM review
review_threshold 0.75 Pass to LLM resolver if use_llm_resolver: true
min_margin_for_auto_link 0.08 Top-2 score gap required for auto-link

Direct update policy

When update_weaviate: direct, the following fields may be updated on existing EntityRecord objects:

Field Rule
aliases Add attested surface forms and LLM-proposed variants
short_description Fill if empty (optionally overwrite)
markdown_text Append evidence section (idempotent)
evidence_document_ids Add current document ID
evidence_refs Add/update evidence reference
search_text Recomputed from canonical label + aliases + description
semant_revision Incremented by 1
updated_at Always set

The write is best-effort idempotent; repeated runs produce the same result.


Example configurations

Pre-built configurations are in configurations/:

File Use case
minimal_ollama.yaml Local Ollama, no Weaviate
doc2_vllm.yaml vLLM with Czech language hints
vllm_with_weaviate_evidence.yaml vLLM + Weaviate evidence writes
full_direct_update.yaml Full pipeline with direct EntityRecord updates

Code structure

semant_ne/
├── cli.py                    # typer application, all CLI commands
├── config.py                 # Pydantic config models + YAML loader
├── models.py                 # All domain data models (Pydantic v2)
│
├── document_loader.py        # Auto-discovers pages from a directory
├── pagexml_parser.py         # PAGE XML 2019-07-15 parser
├── alto_parser.py            # ALTO 2.x/3.x parser
├── block_preparation/
│   └── simple_ocr_blocks.py  # Mode A: OCR regions → PreparedTextBlock list
│
├── llm/
│   ├── client.py             # LLMClient (OpenAI-compatible, retry, JSON extract)
│   ├── schemas.py            # Pydantic schemas for LLM I/O
│   ├── prompts.py            # Prompt builder (build_page_extraction_messages)
│   └── page_extractor.py     # Orchestrates extraction; handles block batching
│
├── anchoring.py              # Fuzzy mention → text block anchoring
├── entity_policy.py          # EntityType normalisation (PERSON → person)
├── ledger.py                 # EntityLedger: accumulates entities across pages
├── claim_validator.py        # Validates and normalises LLM-extracted claims
├── consolidation.py          # python_conservative_merge; FinalEntityConsolidator
│
├── weaviate/
│   ├── client.py             # open_weaviate() context manager
│   ├── resolver.py           # BM25 search + weighted scoring + LLM resolver
│   ├── evidence_writer.py    # Upserts EntityDocumentEvidence
│   ├── record_updater.py     # Direct updates to EntityRecord
│   └── init_collections.py   # Creates Weaviate collections (idempotent)
│
├── output_writer.py          # Atomic JSON write; load_document_result
└── validation.py             # Cross-reference validation of DocumentResult

scripts/
└── init_weaviate.py          # Standalone CLI for collection initialisation

configurations/
├── minimal_ollama.yaml
├── doc2_vllm.yaml
├── vllm_with_weaviate_evidence.yaml
└── full_direct_update.yaml

Key data model relationships

classDiagram
    class DocumentResult {
        document_id: str
        run: RunInfo
        document_metadata: DocumentMetadata
        prepared_pages: list[PreparedPage]
        unique_entities: list[LocalEntity]
        mentions: list[Mention]
        weaviate_updates: list[WeaviateUpdateRecord]
        warnings: list[str]
    }
    class LocalEntity {
        local_entity_id: str
        entity_type: EntityType
        canonical_label: str
        surface_forms: list[str]
        extracted_claims: list[Claim]
        resolution: ResolutionDecision
    }
    class Mention {
        mention_id: str
        local_entity_id: str
        surface_text: str
        anchor: TextAnchor
    }
    class TextAnchor {
        page_id: str
        block_id: str
        char_start: int
        char_end: int
        alignment_method: str
    }
    class ResolutionDecision {
        status: linked|ambiguous|new_entity_candidate|unresolved
        db_entity_id: str
        confidence: float
        candidate_scores: list[CandidateScore]
    }
    class Claim {
        claim_id: str
        predicate: str
        value_text: str
        update_eligibility: str
    }
    DocumentResult "1" --> "many" LocalEntity : unique_entities
    DocumentResult "1" --> "many" Mention : mentions
    LocalEntity "1" --> "many" Mention : via local_entity_id
    Mention "1" --> "1" TextAnchor : anchor
    LocalEntity "1" --> "0..1" ResolutionDecision : resolution
    LocalEntity "1" --> "many" Claim : extracted_claims
Loading

Testing

# Run all tests
pytest

# With coverage
pytest --cov=semant_ne --cov-report=term-missing

# Specific module
pytest tests/test_ledger.py -v

219 tests across 15 modules. Integration tests against ./document/ are skipped if the directory is not present.

Test module What it covers
test_pagexml_parser PAGE XML parsing, reading order, confidence
test_alto_parser ALTO parsing, normalisation
test_block_preparation Block splitting, merging, char offsets
test_document_loader Directory discovery, sniffing, metadata
test_output_schema Serialisation round-trip, schema_version
test_entity_policy Type normalisation, aliases
test_anchoring Exact/fuzzy/context-only/unanchored strategy
test_ledger attach_or_create, merge, claims, summaries
test_llm_schemas JSON extraction, permissive parsing
test_claim_validator Predicate mapping, subject validation
test_consolidation python_conservative_merge, FinalEntityConsolidator
test_weaviate_resolver Scoring formula, auto-link, mocked Weaviate
test_evidence_writer Create/update evidence objects, mocked Weaviate
test_record_updater Alias/description/markdown updates, mocked Weaviate
test_integration_real_document Full pipeline on ./document/

Known limitations

Limitation Workaround
Pages with very dense OCR (single block > ~18 000 chars + entity memory) may overflow max_model_len even after batching Lower max_blocks_per_batch or split the page manually
max_tokens_per_request: 4096 is too small for very rich pages Increase to 8192 if your model supports it
Weaviate BM25 search only returns up to max_candidates=5 results; rare entities may be missed Increase max_candidates
llm_page and llm_suggest_then_python merge modes are parsed but not yet active Use python_conservative or final_llm_consolidation
No multi-document batch command Process documents in a shell loop
Entity ID counter is module-level; reuse the Python process across documents causes monotonically increasing IDs This is intentional; IDs remain unique within a session

Licence

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages