KCAC Research Monorepo

This repository collects the local KCAC OCR and document-understanding projects into one GitHub repository.

The main project at the repository root is the KCAC OCR Pipeline v0.1. Additional related tools live under projects/.

Projects

Path	Purpose	Main documentation
`.`	OCR pipeline for formatting datasets, testing OCR/VLM engines, consensus, PAGE XML, eScriptorium export, reports, and Hugging Face JSONL export.	`docs/how_to_run_and_use.md`
`projects/tesseract-ocr-training/`	Tesseract, PaddleOCR, and OCR baseline training/evaluation scripts for Kurdish OCR experiments.	`projects/tesseract-ocr-training/README.ne`
`projects/kcac-pdf-fetcher/`	KCAC archive downloader/client for page images, PDFs, OCR text extraction, and validation.	`projects/kcac-pdf-fetcher/README.ne`
`projects/surya-kurdish-region-detection/`	Surya-based Kurdish text-line and layout-region detection workflow.	`projects/surya-kurdish-region-detection/README.md`

Large generated artifacts are intentionally not part of git: PDFs, page-image datasets, caches, trained weights, model checkpoints, output folders, and annotation outputs. Keep those locally or publish them separately as dataset/model releases.

OCR Pipeline Quick Start

Run from the repository root:

(& conda shell.powershell hook) | Out-String | Invoke-Expression
conda activate surya
python -m pip install -r requirements.txt
python -m pipeline --config config.yaml.example doctor

For your RTX 3090 and Ollama Qwen setup:

ollama list
ollama run qwen25vl-sorani-ocr:latest
python -m pipeline --config config.ollama.example doctor
python -m pipeline --config config.ollama.example bootstrap --limit 1

The full setup guide, file map, and reasoning are in docs/how_to_run_and_use.md.

OCR Pipeline Commands

python -m pipeline --config config.yaml bootstrap --limit 1
python -m pipeline --config config.yaml consensus --limit 1
python -m pipeline --config config.yaml pagexml --limit 1
python -m pipeline --config config.yaml escriptorium --limit 1
python -m pipeline --config config.yaml queue
python -m pipeline --config config.yaml reports
python -m pipeline --config config.yaml benchmark
python -m pipeline --config config.yaml hf-export

run-all runs the same stages in order. Existing per-engine/page outputs are skipped unless --force is passed.

Model And Data Notes

Kraken is the optional in-process legacy OCR dependency in requirements-ocr-py310.txt because it is not installable in Python 3.13. Calamari is split into requirements-calamari-py310.txt and should be installed in a separate environment because its python-bidi dependency conflicts with Kraken.

Use config.full.example only after you place real checkpoints under models/kraken/ and models/calamari/ or edit those paths. doctor is expected to fail for any enabled engine whose binary, checkpoint, Ollama server/model, or API key is missing.

KCAC Web Page

🚀 Open KCAC interactive page:
Click here to open home1.html

Source file: docs/home1.html

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
ds_test/409		ds_test/409
models		models
pipeline		pipeline
plans		plans
projects		projects
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.full.example		config.full.example
config.ollama.example		config.ollama.example
config.yaml.example		config.yaml.example
home.html		home.html
pyproject.toml		pyproject.toml
requirements-calamari-py310.txt		requirements-calamari-py310.txt
requirements-ocr-py310.txt		requirements-ocr-py310.txt
requirements-transformers-gpu.txt		requirements-transformers-gpu.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KCAC Research Monorepo

Projects

OCR Pipeline Quick Start

OCR Pipeline Commands

Model And Data Notes

KCAC Web Page

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KCAC Research Monorepo

Projects

OCR Pipeline Quick Start

OCR Pipeline Commands

Model And Data Notes

KCAC Web Page

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages