Invoice Engine

A robust AI-powered ETL pipeline designed to digitize and clean 25 years of legacy business invoices stored as binary .doc files.

Getting Started

Prerequisites

Python 3.14+
uv package manager
System Tools: antiword (for document parsing), libwmf (for thumbnail extraction).

Setup

Clone the repository.
Initialize the environment:
```
uv sync
```

Configure your environment variables by creating a .env file:

FOLDER_ID_SOURCE_DOCS=...
SERVICE_ACCOUNT_CREDENTIALS_DRIVE_READER=drive-reader-service-account.json

Usage

The project utilizes uv project scripts for a streamlined CLI experience.

Running the Pipeline

The pipeline is stage-gated, ensuring each step is completed for the entire archive before proceeding.

# Run the full pipeline (Discovery -> Conversion -> Scoping)
uv run run-pipeline

# Run specific stages
uv run run-pipeline --stages discovery
uv run run-pipeline --stages conversion
uv run run-pipeline --stages scoping

Exploration & Diagnostics

Use the inspector to manually triage specific files or batches:

uv run explore-doc-inspector --output-format bracketed --batch 5

Project Evolution

V2: Python Pipeline (Current)

Migrating to a robust, modular Python architecture to handle high-volume processing (~2500 files) with stateful resumption and parallel execution.

Discovery: High-speed parallel mirroring of remote Drive files with atomic write safety.
Conversion: Structural parsing using antiword to produce "Bracketed Text" format with <br> table cell fidelity and OLE2 metadata headers.
Scoping: Forensic date extraction and dual-branch bucketing to define the active working set.

V1: Apps Script Prototype (Legacy)

A Google Apps Script-based prototype that proved the viability of Gemini-powered extraction.

Limited by Google Script execution timeouts.
Utilized Google Docs API for initial structural extraction.
Stored in the /v1 directory for historical reference.

Documentation

/memory-bank: Durable store of design thinking, project progress, and architectural patterns.
/v2: Core Python implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idx		.idx
memory-bank		memory-bank
v1		v1
v2		v2
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invoice Engine

Getting Started

Prerequisites

Setup

Usage

Running the Pipeline

Exploration & Diagnostics

Project Evolution

V2: Python Pipeline (Current)

V1: Apps Script Prototype (Legacy)

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Invoice Engine

Getting Started

Prerequisites

Setup

Usage

Running the Pipeline

Exploration & Diagnostics

Project Evolution

V2: Python Pipeline (Current)

V1: Apps Script Prototype (Legacy)

Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages