๐ ํ๊ตญ์ด | ๆฅๆฌ่ช | Espaรฑol
Table Rust EXtractor โ Extract tables from PDFs with zero native dependencies. Built in Rust, usable from Node.js and Python.
Two packages are available โ choose the one that fits your use case:
| Package | Install | How it works |
|---|---|---|
@dreamyoungs/trex |
npm i @dreamyoungs/trex |
CLI wrapper โ auto-downloads TREX binary |
@dreamyoungs/trex-node |
npm i @dreamyoungs/trex-node |
Native NAPI-RS binding โ no subprocess |
// Both packages share the same API
const { extract } = require("@dreamyoungs/trex"); // CLI wrapper
// const { extract } = require("@dreamyoungs/trex-node"); // or native binding
const tables = await extract("invoice.pdf", {
pages: [1, 2],
mode: "auto"
});
console.log(tables[0].headers); // ["Item", "Qty", "Unit Price", "Amount"]
console.log(tables[0].rows); // [["A4 Paper", "10", "5,000", "50,000"], ...]trex extract invoice.pdf --format json
trex extract invoice.pdf --format csv > output.csv
trex extract invoice.pdf --pages 3,5,7 --mode latticedocker build -t trex .
docker run --rm -p 8080:8080 trex
curl -X POST http://localhost:8080/extract \
-F "[email protected]" \
-F "mode=auto" \
-F "format=json"PDF table extraction has long been dominated by the Python ecosystem โ tools like Camelot, Tabula, and pdfplumber all require heavy runtimes (OpenCV, Ghostscript, Java) and struggle with memory limits in serverless environments.
TREX takes a different approach:
| Python tools | TREX | |
|---|---|---|
| Runtime | Python + OpenCV + Ghostscript | Single Rust binary |
| Memory | 200โ500 MB+ | ~30 MB |
| Container size | 500 MB+ | ~15 MB |
| Language support | Python only | Rust, Node.js, Python, Docker |
| Improvement loop | Manual | DL Router + ML training scripts |
- ๐ Lightweight & Fast โ Single binary, no native dependencies. Runs instantly in serverless containers (Cloud Run, Lambda) without OOM issues.
- ๐ง Improvable with DL โ An optional DL Router can be retrained on extraction failures to improve table detection accuracy. You run the training pipeline manually or via your own scheduler (e.g. GitHub Actions cron).
- ๐ Multi-Runtime โ Use TREX from Node.js (
npm install), Python (pip install), Docker REST API, or the CLI. The same Rust core powers all of them. - ๐ง Production-Ready Telemetry โ Built-in event logging (
--event-log) captures extraction metrics for production monitoring. Collected events can be fed into the ML training pipeline to retrain the router model.
TREX detects tables using three strategies:
Lattice โ For tables with visible gridlines. Detects line segments and computes cell regions from intersections. No OpenCV required.
Stream โ For tables without gridlines. Clusters text box coordinates to infer columns and rows.
DL Router (optional) โ A lightweight ONNX model analyzes page features and routes each page to the optimal strategy (Lattice / Stream / Blend). When no model is provided, a built-in heuristic router is used instead.
graph LR
A[PDF] --> B{DL Router}
B -->|gridlines| C[Lattice]
B -->|no lines| D[Stream]
B -->|mixed| G[Blend]
C --> E[Cell Merge]
D --> E
G --> E
E --> F[JSON / CSV]
Collect extraction events in production and retrain the router model in batch:
# 1. Run TREX with event logging enabled
trex extract report.pdf \
--event-log logs/extraction_events.ndjson \
--event-document-key "doc-123" \
--event-training-opt-in
# 2. Retrain the router model
python3 ml/update_router.py \
--events logs/extraction_events.ndjson \
--work-dir ml/artifacts/updateThis is not an always-on server โ run it manually or via a scheduler (e.g. GitHub Actions cron).
See ml/README.md for the full pipeline and ml/MODEL_CONTRACT.md for model I/O specs.
trex extract <file.pdf> [OPTIONS]
Options:
--pages <1,3,5 | 1-10> Pages to process
--mode <auto|lattice|stream|dl> Parsing mode (default: auto)
--format <json|csv> Output format (default: json)
--dl-model <path.onnx> DL router model path (requires --features dl)
--dl-min-confidence <0.55> Min confidence for DL routing
--event-log <path.ndjson> Write extraction events for feedback loop
--event-document-key <key> Document identifier for events
--event-tenant-id <id> Tenant identifier
--event-training-opt-in Allow this data for model trainingLanguage output follows system locale (LC_ALL, LANG). Override with TREX_LANG=ko or TREX_LANG=en.
npm install @dreamyoungs/trexAuto-downloads a platform TREX binary on install. If download fails, set TREX_BIN or pass binPath.
const { extract, extractCsv, extractFromBuffer } = require("@dreamyoungs/trex");
const tables = await extract("invoice.pdf", {
pages: [1, 2],
mode: "auto"
// binPath: "/usr/local/bin/trex", // optional: override binary path
});npm install @dreamyoungs/trex-nodeNAPI-RS native binding โ calls Rust directly with no subprocess overhead. Same API as the CLI wrapper.
const { extract } = require("@dreamyoungs/trex-node");
const tables = extract("invoice.pdf", { mode: "Auto" }); // synchronousimport trex
tables = trex.extract("invoice.pdf", pages=[1, 2])
print(tables[0].rows)docker build -t trex .
docker run --rm -p 8080:8080 trexPer-request language: Accept-Language: ko-KR header.
TREX does one thing: converts the physical table layout on a page into a 2D array.
Things it intentionally does not do: LLM-based analysis, cross-page table merging, header normalization, or data type inference. These belong in the application layer consuming TREX's output.
| Area | Choice | Note |
|---|---|---|
| Language | Rust | Core engine |
| PDF Parser | lopdf / pdf-extract |
Low-level PDF access |
| DL Runtime | tract-onnx (optional) |
ONNX model inference |
| HTTP Server | Axum | Docker REST API |
| Node.js | CLI wrapper + NAPI-RS | npm/trex, bindings/node |
| Python Bindings | PyO3 + maturin | pip install support |
- Lattice mode (gridline-based extraction)
- Stream mode (coordinate-based inference)
- DL Router with feedback pipeline
- CLI interface
- Docker REST API server
- Node.js npm wrapper + NAPI-RS bindings
- PyO3 Python bindings
- WebAssembly build (in-browser)
- Benchmark suite with real-world comparisons
MIT OR Apache-2.0
Table Rust EXtractor โ ์ธ๋ถ ์์กด์ฑ ์์ด PDF์์ ํ๋ฅผ ์ถ์ถํ๋ Rust ์์ง. Node.js์ Python์์ ๋ฐ๋ก ์ฌ์ฉํ ์ ์์ต๋๋ค.
๋ ๊ฐ์ง ํจํค์ง๊ฐ ์์ต๋๋ค โ ์ฉ๋์ ๋ง๊ฒ ์ ํํ์ธ์:
| ํจํค์ง | ์ค์น | ๋ฐฉ์ |
|---|---|---|
@dreamyoungs/trex |
npm i @dreamyoungs/trex |
CLI ๋ํผ โ TREX ๋ฐ์ด๋๋ฆฌ ์๋ ๋ค์ด๋ก๋ |
@dreamyoungs/trex-node |
npm i @dreamyoungs/trex-node |
NAPI-RS ๋ค์ดํฐ๋ธ ๋ฐ์ธ๋ฉ โ ์๋ธํ๋ก์ธ์ค ์์ |
// ๋ ํจํค์ง ๋ชจ๋ ๋์ผํ API
const { extract } = require("@dreamyoungs/trex"); // CLI ๋ํผ
// const { extract } = require("@dreamyoungs/trex-node"); // ๋๋ ๋ค์ดํฐ๋ธ ๋ฐ์ธ๋ฉ
const tables = await extract("invoice.pdf", {
pages: [1, 2],
mode: "auto"
});
console.log(tables[0].headers); // ["ํญ๋ชฉ", "์๋", "๋จ๊ฐ", "๊ธ์ก"]
console.log(tables[0].rows); // [["A4 ์ฉ์ง", "10", "5,000", "50,000"], ...]trex extract invoice.pdf --format json
trex extract invoice.pdf --format csv > output.csv
trex extract invoice.pdf --pages 3,5,7 --mode latticedocker build -t trex .
docker run --rm -p 8080:8080 trex
curl -X POST http://localhost:8080/extract \
-F "[email protected]" \
-F "mode=auto" \
-F "format=json"PDF ํ ์ด๋ธ ์ถ์ถ์ ์ค๋ซ๋์ ํ์ด์ฌ ์ํ๊ณ๊ฐ ๋ ์ ํด ์์ต๋๋ค. Camelot, Tabula, pdfplumber ๋ฑ ๋ชจ๋ ๋๊ตฌ๊ฐ ๋ฌด๊ฑฐ์ด ๋ฐํ์(OpenCV, Ghostscript, Java)์ ํ์๋ก ํ๋ฉฐ, ์๋ฒ๋ฆฌ์ค ํ๊ฒฝ์์๋ ๋ฉ๋ชจ๋ฆฌ ์ ํ์ผ๋ก ๋์ฉ๋ ์ฒ๋ฆฌ๊ฐ ์ด๋ ต์ต๋๋ค.
TREX๋ ๋ค๋ฅธ ์ ๊ทผ ๋ฐฉ์์ ํํฉ๋๋ค:
| ๊ธฐ์กด ํ์ด์ฌ ๋๊ตฌ | TREX | |
|---|---|---|
| ๋ฐํ์ | Python + OpenCV + Ghostscript | ๋จ์ผ Rust ๋ฐ์ด๋๋ฆฌ |
| ๋ฉ๋ชจ๋ฆฌ | 200โ500 MB+ | ~30 MB |
| ์ปจํ ์ด๋ ํฌ๊ธฐ | 500 MB+ | ~15 MB |
| ์ธ์ด ์ง์ | Python๋ง ๊ฐ๋ฅ | Rust, Node.js, Python, Docker |
| ๊ฐ์ ๋ฃจํ | ์๋ | DL Router + ML ํ์ต ์คํฌ๋ฆฝํธ |
- ๐ ๊ฒฝ๋ & ๊ณ ์ โ ๋จ์ผ ๋ฐ์ด๋๋ฆฌ, ๋ค์ดํฐ๋ธ ์์กด์ฑ ์ ๋ก. ์๋ฒ๋ฆฌ์ค ์ปจํ ์ด๋(Cloud Run, Lambda)์์ OOM ์์ด ์ฆ์ ์คํ.
- ๐ง DL ๊ธฐ๋ฐ ๊ฐ์ ๊ฐ๋ฅ โ ์ ํ์ DL Router๋ฅผ ์ถ์ถ ์คํจ ๋ฐ์ดํฐ๋ก ์ฌํ์ตํ์ฌ ์ ํ๋๋ฅผ ๋์ผ ์ ์์ต๋๋ค. ํ์ต ํ์ดํ๋ผ์ธ์ ์๋ ์คํ ๋๋ ์ค์ผ์ค๋ฌ(์: GitHub Actions cron)๋ก ์ด์ํฉ๋๋ค.
- ๐ ๋ฉํฐ ๋ฐํ์ โ Node.js(
npm install), Python(pip install), Docker REST API, CLI ๋ชจ๋ ์ง์. ๋์ผํ Rust ์ฝ์ด๊ฐ ๋ชจ๋ ํ๊ฒฝ์ ๊ตฌ๋ํฉ๋๋ค. - ๐ง ํ๋ก๋์
๋ ๋ ํ
๋ ๋ฉํธ๋ฆฌ โ ๋ด์ฅ ์ด๋ฒคํธ ๋ก๊ทธ(
--event-log)๋ก ์ถ์ถ ๋ฉํธ๋ฆญ์ ์บก์ฒํ์ฌ ๋ชจ๋ํฐ๋ง์ ํ์ฉํฉ๋๋ค. ์์ง๋ ์ด๋ฒคํธ๋ฅผ ML ํ์ต ํ์ดํ๋ผ์ธ์ ๋ฃ์ด ๋ผ์ฐํฐ ๋ชจ๋ธ์ ์ฌํ์ตํ ์ ์์ต๋๋ค.
TREX๋ ์ธ ๊ฐ์ง ์ ๋ต์ผ๋ก ํ๋ฅผ ํ์งํฉ๋๋ค.
Lattice โ ๊ฒฉ์์ ์ด ์๋ ํ๋ฅผ ์ฒ๋ฆฌํฉ๋๋ค. ์ ๋ถ์ ํ์งํ๊ณ ๊ต์ฐจ์ ์ผ๋ก๋ถํฐ ์ ์์ญ์ ๊ฒฐ์ ํฉ๋๋ค. OpenCV ๋ถํ์.
Stream โ ๊ฒฉ์์ ์ด ์๋ ํ๋ฅผ ์ฒ๋ฆฌํฉ๋๋ค. ํ ์คํธ ๋ฐ์ค ์ขํ๋ฅผ ๊ตฐ์งํํ์ฌ ์ด๊ณผ ํ์ ์ถ๋ก ํฉ๋๋ค.
DL Router (์ ํ) โ ๊ฒฝ๋ ONNX ๋ชจ๋ธ์ด ํ์ด์ง ํผ์ฒ๋ฅผ ๋ถ์ํ์ฌ ์ต์ ์ ๋ต(Lattice / Stream / Blend)์ ์ ํํฉ๋๋ค. ๋ชจ๋ธ์ด ์์ผ๋ฉด ๋ด์ฅ ํด๋ฆฌ์คํฑ ๋ผ์ฐํฐ๊ฐ ๋์ฒดํฉ๋๋ค.
graph LR
A[PDF] --> B{DL Router}
B -->|๊ฒฉ์์ | C[Lattice]
B -->|ํ
์คํธ๋ง| D[Stream]
B -->|ํผํฉ| G[Blend]
C --> E[Cell Merge]
D --> E
G --> E
E --> F[JSON / CSV]
์ด์ ํ๊ฒฝ์์ ์ถ์ถ ์ด๋ฒคํธ๋ฅผ ์์งํ๊ณ ๋ผ์ฐํฐ ๋ชจ๋ธ์ ๋ฐฐ์น ์ฌํ์ตํฉ๋๋ค:
# 1. ์ด๋ฒคํธ ๋ก๊ทธ ํ์ฑํํ์ฌ ์คํ
trex extract report.pdf \
--event-log logs/extraction_events.ndjson \
--event-document-key "doc-123" \
--event-training-opt-in
# 2. ๋ผ์ฐํฐ ๋ชจ๋ธ ์ฌํ์ต
python3 ml/update_router.py \
--events logs/extraction_events.ndjson \
--work-dir ml/artifacts/update์์ ์คํ ์๋ฒ๊ฐ ์๋๋๋ค โ ์๋ ์คํ ๋๋ ์ค์ผ์ค๋ฌ(์: GitHub Actions cron)๋ก ์ด์ํ์ธ์.
์ ์ฒด ํ์ดํ๋ผ์ธ์ ml/README.md, ๋ชจ๋ธ I/O ์คํ์ ml/MODEL_CONTRACT.md๋ฅผ ์ฐธ๊ณ ํ์ธ์.
trex extract <file.pdf> [OPTIONS]
Options:
--pages <1,3,5 | 1-10> ์ฒ๋ฆฌํ ํ์ด์ง
--mode <auto|lattice|stream|dl> ํ์ฑ ๋ชจ๋ (๊ธฐ๋ณธ: auto)
--format <json|csv> ์ถ๋ ฅ ํ์ (๊ธฐ๋ณธ: json)
--dl-model <path.onnx> DL ๋ผ์ฐํฐ ๋ชจ๋ธ ๊ฒฝ๋ก (--features dl ํ์)
--dl-min-confidence <0.55> DL ๋ผ์ฐํ
์ต์ ์ ๋ขฐ๋
--event-log <path.ndjson> ํผ๋๋ฐฑ ๋ฃจํ์ฉ ์ด๋ฒคํธ ๊ธฐ๋ก
--event-document-key <key> ์ด๋ฒคํธ ๋ฌธ์ ์๋ณ์
--event-tenant-id <id> ํ
๋ํธ ์๋ณ์
--event-training-opt-in ํ์ต ๋ฐ์ดํฐ ํ์ฉ ๋์์ถ๋ ฅ ์ธ์ด๋ ์์คํ
๋ก์ผ์ผ(LC_ALL, LANG)์ ๋ฐ๋ฆ
๋๋ค. TREX_LANG=ko ๋๋ TREX_LANG=en์ผ๋ก ๋ช
์์ ์ง์ ๊ฐ๋ฅ.
npm install @dreamyoungs/trex์ค์น ์ ํ๋ซํผ TREX ๋ฐ์ด๋๋ฆฌ๋ฅผ ์๋ ๋ค์ด๋ก๋. ์คํจ ์ TREX_BIN ๋๋ binPath๋ก ์ง์ .
const { extract, extractCsv, extractFromBuffer } = require("@dreamyoungs/trex");
const tables = await extract("invoice.pdf", {
pages: [1, 2],
mode: "auto"
// binPath: "/usr/local/bin/trex", // ์ ํ: ๋ฐ์ด๋๋ฆฌ ๊ฒฝ๋ก ์ง์ ์ง์
});npm install @dreamyoungs/trex-nodeNAPI-RS ๋ค์ดํฐ๋ธ ๋ฐ์ธ๋ฉ โ Rust๋ฅผ ์ง์ ํธ์ถํ์ฌ ์๋ธํ๋ก์ธ์ค ์ค๋ฒํค๋ ์์. CLI ๋ํผ์ ๋์ผํ API.
const { extract } = require("@dreamyoungs/trex-node");
const tables = extract("invoice.pdf", { mode: "Auto" }); // ๋๊ธฐ ํธ์ถimport trex
tables = trex.extract("invoice.pdf", pages=[1, 2])
print(tables[0].rows)docker build -t trex .
docker run --rm -p 8080:8080 trex์์ฒญ ๋จ์ ์ธ์ด: Accept-Language: ko-KR ํค๋ ์ฌ์ฉ.
TREX๋ ํ ๊ฐ์ง ์ผ๋ง ํฉ๋๋ค: ํ์ด์ง ์์ ๋ฌผ๋ฆฌ์ ํ ๋ ์ด์์์ 2D ๋ฐฐ์ด๋ก ๋ณํ.
์๋์ ์ผ๋ก ํ์ง ์๋ ๊ฒ: LLM ๊ธฐ๋ฐ ๋ถ์, ํ์ด์ง ๊ฐ ํ ๋ณํฉ, ํค๋ ์ ๊ทํ, ๋ฐ์ดํฐ ํ์ ์ถ๋ก . ์ด๋ฐ ํ์ฒ๋ฆฌ๋ TREX ์ถ๋ ฅ์ ์๋นํ๋ ์ ํ๋ฆฌ์ผ์ด์ ๋ ์ด์ด์์ ์ฒ๋ฆฌํด์ผ ํฉ๋๋ค.
| ์์ญ | ์ ํ | ๋น๊ณ |
|---|---|---|
| ์ธ์ด | Rust | ์ฝ์ด ์์ง |
| PDF ํ์ | lopdf / pdf-extract |
์ ์์ค PDF ๊ตฌ์กฐ ์ ๊ทผ |
| DL ๋ฐํ์ | tract-onnx (์ ํ) |
ONNX ๋ชจ๋ธ ์ถ๋ก |
| HTTP ์๋ฒ | Axum | Docker REST API |
| Node.js | CLI ๋ํผ + NAPI-RS | npm/trex, bindings/node |
| Python ๋ฐ์ธ๋ฉ | PyO3 + maturin | pip install ์ง์ |
- Lattice ๋ชจ๋ (๊ฒฉ์์ ๊ธฐ๋ฐ ์ถ์ถ)
- Stream ๋ชจ๋ (์ขํ ๊ธฐ๋ฐ ์ถ๋ก )
- DL Router + ํผ๋๋ฐฑ ํ์ดํ๋ผ์ธ
- CLI ์ธํฐํ์ด์ค
- Docker REST API ์๋ฒ
- Node.js npm ๋ํผ + NAPI-RS ๋ฐ์ธ๋ฉ
- PyO3 Python ๋ฐ์ธ๋ฉ
- WebAssembly ๋น๋ (๋ธ๋ผ์ฐ์ ๋ด ๋์)
- ๋ฒค์น๋งํฌ ์ค์ํธ ๋ฐ ์ค์ธก ๋น๊ต
MIT OR Apache-2.0