Impresso Text Preparation provides OCR format conversion pipelines for newspaper and radio historical data, from a variety of source formats into Impresso's unified representations: Canonical and Rebuilt formats.
- The canonical format is focuses on maintaining the logical structure and hierarchy of the source historical media content; defining Issues and Physical Support objects (pages, audio recordings).
- The rebuilt format defines a uniform document representation of Content Items (articles, ads, images...) designed for large-scale downstream applications such as Machine Learning fine-tuning and inference and Information Retrieval.
The repository is fully documented in the Impresso Text Preparation docs, which includes detailed descriptions of the architecture, pipelines, and usage instructions.
Install from PyPI:
pip install impresso-text-preparationInstall from the repository root:
pip install -e .- π§© Overall logic: ingest raw OCR/XML source collections, detect and select individual issues from a local file-system, convert them into Impresso canonical JSON, and upload the result to S3. Optionally, a data manifest versioning the contents of the created data can be computed on the fly and pushed to the same location on S3 (for more info on Data Versioning Manifests in Impresso see here).
- π Main folder:
text_preparation/importers/ - π Main launcher: format-specific wrapper scripts under
text_preparation/importer_scripts/, which call the generic importer engine intext_preparation/importers/generic_importer.py.
Supported importers by format:
Base modules for formats represented in multiple sources:
- METS/ALTO β Widely used format supporting both OCR only (only ALTO) or OCR+OLR (ALTO + METS) variants
- Each source has its own METS/ALTO flavor with specific file and metadata organization logic, requiring custom processing, but they all share some core parsing and formatting logic.
- Wrapper:
mets_alto_importer.py|Package:text_preparation/importers/mets_alto
- TETML β OCR format from TET (Text Extraction Toolkit) software, used for PDF-embedded OCR in some sources (e.g. FedGaz). No OLR, but has a unique structure and metadata organization.
- Wrapper:
tetmlimporter.py| Package:text_preparation/importers/tetml
- Wrapper:
Per-provider submodules:
- π¨π SNL - Swiss National Library + LeTemps
- Olive XML format (with OLR β
) - First set of shared data
- Wrapper:
oliveimporter.py| Package:text_preparation/importers/olive
- Wrapper:
- Mets/Alto - RERO variant (with OLR β
) - second and third sets of shared data
- Wrapper:
reroimporter.py| Package:text_preparation/importers/rero
- Wrapper:
- Olive XML format (with OLR β
) - First set of shared data
- π±πΊ LUX β National Library of Luxembourg
- Mets/Alto - BNL variant (with OLR β )
- Wrapper:
luximporter.py| Package:text_preparation/importers/lux
- π«π· BNF β BibliothΓ¨que Nationale de France
- Mets/Alto - BNF variant (with OLR β
)
- Wrapper:
bnfimporter.py| Package:text_preparation/importers/bnf
- Wrapper:
- Mets/Alto - BNF-Europeana variant (with OLR β
)
- Wrapper:
bnfen_importer.py| Package:text_preparation/importers/bnf_en
- Wrapper:
- Mets/Alto - BNF variant (with OLR β
)
- π¨π BCUL β BibliothΓ¨que Cantonale Universitaire de Lausanne
- ABBYY format (No OLR β)
- Wrapper:
bculimporter.py| Package:text_preparation/importers/bcul
- π¬π§ BL β British Library
- Mets/Alto format - OmniPage variant (with OLR β )
- Wrapper:
blomnimporter.py| Package:text_preparation/importers/bl/omni
- π§πͺ KBR β Royal Library of Belgium
- ALTO - KBR format (No OLR β)
- Wrapper:
kbrimporter.py| Package:text_preparation/importers/kbr
- π©πͺ SUB β Hamburg State Library
- ALTO - SUB format (No OLR β)
- Wrapper:
subimporter.py| Package:text_preparation/importers/sub
- π¨π SWA β Schweizerisches Wirtschaftsarchiv
- ALTO - SWA format (No OLR β)
- Wrapper:
swaimporter.py| Package:text_preparation/importers/swa
- π¨π FedGaz β Swiss Federal Gazette
- PDF-embedded OCR format converted to Tetml (No OLR β)
- Wrapper:
fedgazimporter.py| Package:text_preparation/importers/fedgaz
- Coming soon π¦
- π¦πΉ ONB β Austrian National Library
- π³π± KB β National Library of the Netherlands
- π©πͺ SBB - Berlin State Library
- π¨π SWISSINFO β Swissinfo radio bulletin format
- PDF-embedded OCR originally, extracted, and converted to a custom format for import (No OLR β)
- Wrapper:
swissinfoimporter.py| Package:text_preparation/importers/swissinfo
- π¨π RTS β Radio TΓ©lΓ©vision Suisse
- AudioDoc ASR format
- Wrapper:
rtsimporter.py| Package:text_preparation/importers/rts
- π«π· INA β Institut National de l'Audiovisuel (France)
- Custom JSON-based format extracted from Whisper outputs
- Wrapper:
inaimporter.py| Package:text_preparation/importers/ina
Importers are executed through the format-specific wrapper scripts in text_preparation/importer_scripts/, which all call the generic importer engine in text_preparation/importers/generic_importer.py and text_preparation/importers/core.py.
Example command for an import of BNL data:
python -m text_preparation.importer_scripts.bnlimporter \
--input-dir /path/to/BNL-source-data \
--output-dir /path/to/canonical-output \
--config-file /path/to/text_preparation/config/importer_config/import_BNL.json \
--provider BNL \
--chunk-size 10 \
--log-file /path/to/log_file.log \
--s3-bucket my-canonical-bucket \
--clear \ # or --incremental
--verbose # debug modeImporter script options:
--input-dir: directory containing the source data.--output-dir: directory to write canonical JSON output.--temp-dir: temporary directory for archive extraction.--image-dirs: optional image metadata directories (only for Olive format).--config-file: JSON config file for selective import.--s3-bucket: upload output to the given S3 bucket.--scheduler: Dask scheduler address.--num-workers: number of local Dask workers.--clear: remove the output directory before running.--incremental: skip already imported issues (presently in the output directory).--dont-push-manifest: do not push generated manifest to git.--is-audio: enable audio import mode.--verbose: enable detailed logging.
The selection of the actual newspaper data to be imported can be controlled by means of a configuration file (JSON format). The path to this file is passed via the --config-file CLI parameter.
This JSON file contains three properties:
aliases: a dictionary containing the media aliases to be imported (e.g.GDL);exclude_aliases: a list of the media aliases to be excluded;year_only: a boolean flag indicating whether date ranges are expressed by using years or more granular dates (in the formatYYYY/MM/DD). When ingesting large amounts of data, this allows to organise data imports into batches or homogeneous collections.
Here is a simple configuration file:
{
"aliases": {
"GDL": []
},
"exclude_aliases": [],
"year_only": false
}Here is a more complex configuration file (only contents for the decade 1950-1960 of GDL are processed):
{
"aliases": {
"GDL": "1950/01/01-1960/12/31"
},
"exclude_aliases": [],
"year_only": false
}- π Overall logic: read canonical Impresso JSON from S3, join it with page/audio support data, rebuild content items, and write downstream output for Solr or Passim.
- π Main folder:
text_preparation/rebuilders/ - π Main launcher:
text_preparation/rebuilders/rebuilder.py. - π§ͺ Core code:
- core orchestrator:
rebuilder.py - helper functions:
helpers.py - paper-based content item specific reconstruction logic:
paper_rebuilders.py - audio-based content item specific reconstruction logic:
audio_rebuilders.py
- core orchestrator:
The rebuilder reads canonical Impresso JSON from S3 and produces rebuilt files for downstream use.
Example command:
python -m text_preparation.rebuilders.rebuilder rebuild_articles \
--input-bucket my-canonical-bucket \
--output-dir /path/to/rebuilt-output \
--filter-config /path/to/text_preparation/config/rebuilt_config/filter_config.json \
--log-file /path/to/log_file.log \
--format solr \
--nworkers 32 \
--temp-dir /path/to/temp_dir \
--verbose \
--compute-mft \ Common rebuilder options:
rebuild_articles: rebuilds canonical issues into articles.--input-bucket: S3 bucket containing canonical JSON input.--output-dir: local directory for rebuilt output.--output-bucket: optional S3 bucket to upload rebuilt files.--filter-config: JSON file that defines issue batches to rebuild.--format:solrorpassim.--languages: comma-separated languages to filter.--nworkers: number of local Dask workers.--compute-mft: compute the rebuilt manifest on the fly during processing.--prev-manifest: optional path to a previous manifest.--git-repo: local path to the impresso-text-acquisition repository.--temp-dir: temporary directory for repository cloning or tmp work.--scheduler: Dask scheduler address.--clear: remove output directory before rebuilding.--verbose: enable detailed logging.
- π Overall logic: explore, analyze and prepare raw or messy source data before import, e.g. reorganize original files, extract OCR from PDFs, generate index metadata and format-specific preprocessing steps. Each script or notebook is customized for specific data and preprocessing tasks.
- π Main entry points:
text_preparation/importer_scripts/preprocessing/bl_reorganize_original_data.pytext_preparation/importer_scripts/preprocessing/swissinfo_extract_ocr_from_pdfs.py- Jupyter notebooks such as
bcul_preprocess_issues.ipynb,bnl_preprocess_issues.ipynb, andindex_generator.ipynb.
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2024 The Impresso team.
This project is available under the GNU Affero General Public License v3 or later.