Releases: scribeocr/scribe.js
v0.12.0
This release makes it possible to open and work with multiple documents at the same time. This required updating the interface, so this is a breaking release that likely requires changes to your code. See Migration below. New Guide, API Reference, and CLI Reference.
Highlights
ScribeDocAPI.scribe.openDocument(files)returns a document object you operate on directly. Multiple documents can be open at once. (Guide)- Settings off globals. Per-operation options moved from
scribe.optonto the method orScribeDocthat uses them. (Configuration) - Faster bundled OCR. The updated internal OCR model runs significantly faster (0-30% depending on corpus).
Migration
// Before
await scribe.importFiles(files);
await scribe.recognize({ langs: ['eng'] });
await scribe.download('pdf', 'out.pdf');
// After
const doc = await scribe.openDocument(files);
await doc.recognize({ langs: ['eng'] });
await doc.download('pdf', 'out.pdf');
await doc.terminate();The removed module-level functions (importFiles, recognize, download, exportData, addHighlights, clear, compareOCR, convertOCRPage, evalOCRPage, extractInternalPDFText, …) all have ScribeDoc equivalents. scribe.extractText() is unchanged.
Full Changelog: v0.11.10...v0.12.0
v0.11.3
What's Changed
- Fixed significant number of bugs in new PDF codebase released in
v0.11.0.
Full Changelog: v0.11.0...v0.11.3
v0.11.0
What's Changed
- Implemented new JavaScript-native PDF parsing + rendering code.
- Switched Node.js canvas library from
canvaskit-wasmto new library.- Significantly improved performance for image rendering on Node.js.
- Many other minor changes. This release is a major refactor. See changelog for details.
Full Changelog: v0.10.1...v0.11.0
v0.10.1
What's Changed
- Fixed bug with
.pdfexport where existing invisible text layer was included alongside new invisible text layer. - Highlight annotations are omitted when rendering pages for recognition and re-added upon export.
- This should produce a small improvement to recognition accuracy in highlighted documents.
- Adding new highlight annotations will be supported in a future version.
- Improvements to support for various third party OCR formats.
- Misc minor changes and bug fixes.
v0.10.0
What's Changed
- Added import/export support for ALTO XML
- Improved recognition speed for internal OCR model
- Many small bug fixes and performance improvements.
Full Changelog: v0.9.3...v0.10.0
v0.9.3
What's Changed
- Fixed bug causing text layer in PDF exports to be broken (#58)
- This issue impacts all PDFs created with two patch releases from the last ~week (
0.9.1and0.9.2). Anybody using those versions should update ASAP.
- This issue impacts all PDFs created with two patch releases from the last ~week (
Full Changelog: v0.9.2...v0.9.3
v0.9.2
v0.9.1
What's Changed
- Various updates to experimental and debugging-related features.
- None of the documented features should change with this release.
Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's Changed
- Added URW Gothic font
- Added Deno support
- Updated
.htmlexport format- This format contains a
.htmlfile that should closely resemble the original document. - This should be useful for converting
.pdffiles to a format that can be displayed natively in the browser.
- This format contains a
- Added experimental
.txtimport format- For obvious reasons, importing
.txtfiles will not work with most operations. - This mode is currently exclusively useful for development/debugging purposes and making basic
.pdffiles from.txtfiles.
- For obvious reasons, importing
- Performance improvements to PDF exports
- Various refactoring and minor updates.
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's Changed
- Added
scribeCLI command- If
scribe.jsis installed globally (npm i -g scribe.js-ocr), thescribecommand can be used to process documents from the command line.- For example,
scribe recognize analyst_report.pngruns OCR on an image and saves the result as a PDF.
- For example,
- This feature is still experimental and command/argument names and features may change without warning.
- If
- Added new intermediate data format
.scribefor storing and loading document data.- Given OCR is computationally expensive, it is often desirable to save results for later use without losing data.
- By saving results to
.scribefiles, results can be re-loaded later (e.g. to export with slightly different settings).- While several other output formats can be re-loaded later (notably
.hocrand.pdf), only.scribecan be re-loaded without any data being lost in the export/import process. .scribefiles only contain the text layer; they do not contain embedded images or PDF files..scribefiles can be loaded alongside image/PDF files to restore both image and text data.
- While several other output formats can be re-loaded later (notably
Full Changelog: v0.7.4...v0.8.0