Skip to content

Releases: scribeocr/scribe.js

v0.12.0

27 May 07:37

Choose a tag to compare

This release makes it possible to open and work with multiple documents at the same time. This required updating the interface, so this is a breaking release that likely requires changes to your code. See Migration below. New Guide, API Reference, and CLI Reference.

Highlights

  • ScribeDoc API. scribe.openDocument(files) returns a document object you operate on directly. Multiple documents can be open at once. (Guide)
  • Settings off globals. Per-operation options moved from scribe.opt onto the method or ScribeDoc that uses them. (Configuration)
  • Faster bundled OCR. The updated internal OCR model runs significantly faster (0-30% depending on corpus).

Migration

// Before
await scribe.importFiles(files);
await scribe.recognize({ langs: ['eng'] });
await scribe.download('pdf', 'out.pdf');

// After
const doc = await scribe.openDocument(files);
await doc.recognize({ langs: ['eng'] });
await doc.download('pdf', 'out.pdf');
await doc.terminate();

The removed module-level functions (importFiles, recognize, download, exportData, addHighlights, clear, compareOCR, convertOCRPage, evalOCRPage, extractInternalPDFText, …) all have ScribeDoc equivalents. scribe.extractText() is unchanged.

Full Changelog: v0.11.10...v0.12.0

v0.11.3

06 May 03:43

Choose a tag to compare

What's Changed

  • Fixed significant number of bugs in new PDF codebase released in v0.11.0.

Full Changelog: v0.11.0...v0.11.3

v0.11.0

04 May 06:37

Choose a tag to compare

What's Changed

  • Implemented new JavaScript-native PDF parsing + rendering code.
  • Switched Node.js canvas library from canvaskit-wasm to new library.
    • Significantly improved performance for image rendering on Node.js.
  • Many other minor changes. This release is a major refactor. See changelog for details.

Full Changelog: v0.10.1...v0.11.0

v0.10.1

14 Mar 22:30

Choose a tag to compare

What's Changed

  • Fixed bug with .pdf export where existing invisible text layer was included alongside new invisible text layer.
  • Highlight annotations are omitted when rendering pages for recognition and re-added upon export.
    • This should produce a small improvement to recognition accuracy in highlighted documents.
    • Adding new highlight annotations will be supported in a future version.
  • Improvements to support for various third party OCR formats.
  • Misc minor changes and bug fixes.

v0.10.0

08 Feb 02:31

Choose a tag to compare

What's Changed

  • Added import/export support for ALTO XML
  • Improved recognition speed for internal OCR model
  • Many small bug fixes and performance improvements.

Full Changelog: v0.9.3...v0.10.0

v0.9.3

15 Nov 07:28

Choose a tag to compare

What's Changed

  • Fixed bug causing text layer in PDF exports to be broken (#58)
    • This issue impacts all PDFs created with two patch releases from the last ~week (0.9.1 and 0.9.2). Anybody using those versions should update ASAP.

Full Changelog: v0.9.2...v0.9.3

v0.9.2

14 Nov 04:42

Choose a tag to compare

What's Changed

  • Fixed bug causing crash on single-core systems (#56)
  • Updated scribe.opt.workerN option to cap workers created for PDF rendering

Full Changelog: v0.9.1...v0.9.2

v0.9.1

07 Nov 07:28

Choose a tag to compare

What's Changed

  • Various updates to experimental and debugging-related features.
    • None of the documented features should change with this release.

Full Changelog: v0.9.0...v0.9.1

v0.9.0

08 Sep 08:15

Choose a tag to compare

What's Changed

  • Added URW Gothic font
  • Added Deno support
  • Updated .html export format
    • This format contains a .html file that should closely resemble the original document.
    • This should be useful for converting .pdf files to a format that can be displayed natively in the browser.
  • Added experimental .txt import format
    • For obvious reasons, importing .txt files will not work with most operations.
    • This mode is currently exclusively useful for development/debugging purposes and making basic .pdf files from .txt files.
  • Performance improvements to PDF exports
  • Various refactoring and minor updates.

Full Changelog: v0.8.0...v0.9.0

v0.8.0

09 Mar 09:39

Choose a tag to compare

What's Changed

  • Added scribe CLI command
    • If scribe.js is installed globally (npm i -g scribe.js-ocr), the scribe command can be used to process documents from the command line.
      • For example, scribe recognize analyst_report.png runs OCR on an image and saves the result as a PDF.
    • This feature is still experimental and command/argument names and features may change without warning.
  • Added new intermediate data format .scribe for storing and loading document data.
    • Given OCR is computationally expensive, it is often desirable to save results for later use without losing data.
    • By saving results to .scribe files, results can be re-loaded later (e.g. to export with slightly different settings).
      • While several other output formats can be re-loaded later (notably .hocr and .pdf), only .scribe can be re-loaded without any data being lost in the export/import process.
      • .scribe files only contain the text layer; they do not contain embedded images or PDF files.
        • .scribe files can be loaded alongside image/PDF files to restore both image and text data.

Full Changelog: v0.7.4...v0.8.0