Skip to content

Byte-accurate source positions in TreeSink #734

@OGKevin

Description

@OGKevin

Relates to #48 and #492.

Cadmus is a Reader software for e-readers.

The original HTML parsing implementation was hand written, and is not spec complete and contains some minor bugs.
While working on OGKevin/cadmus#343, I decided to give html5ever a try to replace at least the parsing bit.

It turns out, that html5ever does not provide a way to accurately know a position within a document.
The reason this is needed:

  • Save and restore reading positions across sessions
  • Persist bookmarks and annotations
  • Resolve #anchor-id URI fragment links

For all of these to work correctly across re-parses, the offset stored on each node
must be the byte position of that node's opening token in the source string. It
needs to be stable and comparable to the raw byte sizes of the EPUB spine entries.

For the rendering of dictionary HTML, html5ever covers the use case, as there is no need for position tracking,
but it can't be used for EPUB rendering.

Would it be interesting to add the ability for when parsing a document, to store byte offsets, as this
would be the most stable way to refer to a position within a document and doesn't matter
which parsing system is being used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions