Skip to content

OpenPecha/basic-rtf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

basic-rtf

A minimal Python RTF stream extractor. Parses an RTF file into a flat list of streams — plain-text runs paired with font and position metadata. No dependencies beyond the Python standard library.


Install

pip install -e .

Or, once published:

pip install basic-rtf

Requires Python 3.9+.


Quick start

from basic_rtf import BasicRTF

parser = BasicRTF()
parser.parse_file("document.rtf")

for stream in parser.get_streams():
    # skip structural blocks you don't need
    if stream.get("type") == "pict":
        continue
    font = stream["font"]
    print(f"[{font['name']} {font['size']}pt] {stream['text']!r}")

Command-line

Print plain text:

python -m basic_rtf document.rtf

Print streams as JSON:

python -m basic_rtf document.rtf --json

Stream schema

get_streams() returns a list of dicts. Every stream has:

Key Type Description
text str Decoded text content of the run
font dict Font info: id (int), name (str), size (int, points)
char_start int Byte offset in original RTF where this run begins
char_end int Byte offset where this run ends
type str (optional) Present for structural streams only (see below)

Typed streams

When a stream has a type key it represents an RTF structural block:

type value RTF keyword Meaning
"header" \header Page header text
"footer" \footer Page footer text
"footnote" \footnote Footnote text
"pict" \pict Image data (raw binary — usually skip)

Normal text runs have no type key.

Font dict

stream["font"] == {
    "id":   3,           # \fN index from \fonttbl
    "name": "Arial",     # font name string
    "size": 12,          # in points (RTF half-points ÷ 2)
}

API reference

BasicRTF()

Create a parser instance.

parse_file(file_path, show_progress=False)

Parse file_path (str or pathlib.Path). The file is read as UTF-8 with errors="ignore".

If show_progress=True a progress bar is written to stderr.

The instance resets on each call — safe to reuse.

get_streams() → list[dict]

Return the parsed streams (see schema above).

get_fonts() → list[dict]

Return the raw font table entries from \fonttbl: [{"id": 0, "name": "Times New Roman"}, ...]

get_colors() → list[dict]

Return colour table entries from \colortbl: [{"red": 255, "green": 0, "blue": 0}, ...]

print_debug()

Print every stream with its raw RTF source — useful during development.


Supported RTF features

Feature Notes
\fonttbl Font id/name mapping
\colortbl Colour entries
\uN? Unicode escapes (16-bit signed)
\'hh Hex bytes decoded as cp1252
\~ Non-breaking space → U+00A0
\f, \fs Font / font-size changes flush the current run
\par Paragraph break appended as \n
{\*\...} Destination groups (bookmarks, revisions…) skipped
Typed blocks \header, \footer, \footnote, \pict

Known limitations

  • \'hh is cp1252, not font-specific. If you are working with RTF that uses a legacy byte-encoded font (e.g. old Tibetan Dedris fonts where each byte encodes a glyph ID, not a Unicode codepoint), the hex-byte streams will be wrong. You must post-process stream["text"] using your own font/encoding lookup. See the BDRC integration example below for how this is handled in practice.

  • No section/table stream types. Control words \sect, \cell, \row, and \line are consumed as unknown keywords — they do not produce typed streams. If you need "sect_break", "par_break", "cell_break", or "row_break" stream types, see the advanced parser variant.

  • No \* destination group content. Comments, CSS, revision marks, and other {\*\keyword ...} groups are silently skipped.


BDRC integration

This library is the foundation for several BDRC (Buddhist Digital Resource Center) Tibetan e-text conversion projects. A typical consumer looks like tibetan-etext-tools/IE3CN3396/2_convert_rtf_to_xml.py:

from basic_rtf import BasicRTF

parser = BasicRTF()
parser.parse_file(rtf_path)
streams = parser.get_streams()

for stream in streams:
    match stream.get("type"):
        case "footer" | "header" | "sect_break":
            # project-specific: insert TEI page break
            ...
        case "pict":
            continue
        case _:
            # project-specific: map font bytes → Unicode via dedris_converter
            unicode_text = dedris_to_unicode(stream["text"], stream["font"]["name"])

The Dedris→Unicode mapping, TEI XML generation, and normalization steps live in tibetan-etext-tools and are not part of this library.


Advanced fork

BDRC projects that handle Word documents with \panose font descriptors or need per-stream "sect_break", "par_break", "cell_break", and "row_break" types use an extended variant of this parser. The key extras in that fork (at tibetan-etext-tools/IE3CN3396/basic_rtf.py) are:

  • detect_rtf_format() — heuristic that distinguishes RTF files where Dedris fonts have {\*\panose} descriptors ("complex") from those that don't ("simple"), and selects the appropriate font-table parser accordingly.
  • \loch / \hich / \dbch — low/high/double-byte charset switches that track which effective font applies to a run.
  • \sect, \cell, \row, \line — section and table control words emitted as typed streams ("sect_break", "cell_break", "row_break", "line_break").
  • Dedris-specific handling of } inside a font run (glyph 125 in the Dedris encoding is the same byte as ASCII }).

This basic-rtf library intentionally omits those features to stay small and dependency-free. A possible future basic-rtf-advanced module may package the extended parser — file an issue if you need it.


Non-goals (v1)

  • TEI XML output
  • Dedris / pytiblegenc font encoding
  • pdf-cmap-fix integration
  • Word COM automation
  • Streaming / incremental parse

License

MIT

About

Minimal RTF stream extractor - parses RTF files into text/font streams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages