basic-rtf

A minimal Python RTF stream extractor. Parses an RTF file into a flat list of streams — plain-text runs paired with font and position metadata. No dependencies beyond the Python standard library.

Install

pip install -e .

Or, once published:

pip install basic-rtf

Requires Python 3.9+.

Quick start

from basic_rtf import BasicRTF

parser = BasicRTF()
parser.parse_file("document.rtf")

for stream in parser.get_streams():
    # skip structural blocks you don't need
    if stream.get("type") == "pict":
        continue
    font = stream["font"]
    print(f"[{font['name']} {font['size']}pt] {stream['text']!r}")

Command-line

Print plain text:

python -m basic_rtf document.rtf

Print streams as JSON:

python -m basic_rtf document.rtf --json

Stream schema

get_streams() returns a list of dicts. Every stream has:

Key	Type	Description
`text`	`str`	Decoded text content of the run
`font`	`dict`	Font info: `id` (int), `name` (str), `size` (int, points)
`char_start`	`int`	Byte offset in original RTF where this run begins
`char_end`	`int`	Byte offset where this run ends
`type`	`str` (optional)	Present for structural streams only (see below)

Typed streams

When a stream has a type key it represents an RTF structural block:

`type` value	RTF keyword	Meaning
`"header"`	`\header`	Page header text
`"footer"`	`\footer`	Page footer text
`"footnote"`	`\footnote`	Footnote text
`"pict"`	`\pict`	Image data (raw binary — usually skip)

Normal text runs have no type key.

Font dict

stream["font"] == {
    "id":   3,           # \fN index from \fonttbl
    "name": "Arial",     # font name string
    "size": 12,          # in points (RTF half-points ÷ 2)
}

API reference

`BasicRTF()`

Create a parser instance.

`parse_file(file_path, show_progress=False)`

Parse file_path (str or pathlib.Path). The file is read as UTF-8 with errors="ignore".

If show_progress=True a progress bar is written to stderr.

The instance resets on each call — safe to reuse.

`get_streams() → list[dict]`

Return the parsed streams (see schema above).

`get_fonts() → list[dict]`

Return the raw font table entries from \fonttbl: [{"id": 0, "name": "Times New Roman"}, ...]

`get_colors() → list[dict]`

Return colour table entries from \colortbl: [{"red": 255, "green": 0, "blue": 0}, ...]

`print_debug()`

Print every stream with its raw RTF source — useful during development.

Supported RTF features

Feature	Notes
`\fonttbl`	Font id/name mapping
`\colortbl`	Colour entries
`\uN?`	Unicode escapes (16-bit signed)
`\'hh`	Hex bytes decoded as cp1252
`\~`	Non-breaking space → U+00A0
`\f`, `\fs`	Font / font-size changes flush the current run
`\par`	Paragraph break appended as `\n`
`{\*\...}`	Destination groups (bookmarks, revisions…) skipped
Typed blocks	`\header`, `\footer`, `\footnote`, `\pict`

Known limitations

\'hh is cp1252, not font-specific. If you are working with RTF that uses a legacy byte-encoded font (e.g. old Tibetan Dedris fonts where each byte encodes a glyph ID, not a Unicode codepoint), the hex-byte streams will be wrong. You must post-process stream["text"] using your own font/encoding lookup. See the BDRC integration example below for how this is handled in practice.
No section/table stream types. Control words \sect, \cell, \row, and \line are consumed as unknown keywords — they do not produce typed streams. If you need "sect_break", "par_break", "cell_break", or "row_break" stream types, see the advanced parser variant.
No \* destination group content. Comments, CSS, revision marks, and other {\*\keyword ...} groups are silently skipped.

BDRC integration

This library is the foundation for several BDRC (Buddhist Digital Resource Center) Tibetan e-text conversion projects. A typical consumer looks like tibetan-etext-tools/IE3CN3396/2_convert_rtf_to_xml.py:

from basic_rtf import BasicRTF

parser = BasicRTF()
parser.parse_file(rtf_path)
streams = parser.get_streams()

for stream in streams:
    match stream.get("type"):
        case "footer" | "header" | "sect_break":
            # project-specific: insert TEI page break
            ...
        case "pict":
            continue
        case _:
            # project-specific: map font bytes → Unicode via dedris_converter
            unicode_text = dedris_to_unicode(stream["text"], stream["font"]["name"])

The Dedris→Unicode mapping, TEI XML generation, and normalization steps live in tibetan-etext-tools and are not part of this library.

Advanced fork

BDRC projects that handle Word documents with \panose font descriptors or need per-stream "sect_break", "par_break", "cell_break", and "row_break" types use an extended variant of this parser. The key extras in that fork (at tibetan-etext-tools/IE3CN3396/basic_rtf.py) are:

detect_rtf_format() — heuristic that distinguishes RTF files where Dedris fonts have {\*\panose} descriptors ("complex") from those that don't ("simple"), and selects the appropriate font-table parser accordingly.
\loch / \hich / \dbch — low/high/double-byte charset switches that track which effective font applies to a run.
\sect, \cell, \row, \line — section and table control words emitted as typed streams ("sect_break", "cell_break", "row_break", "line_break").
Dedris-specific handling of } inside a font run (glyph 125 in the Dedris encoding is the same byte as ASCII }).

This basic-rtf library intentionally omits those features to stay small and dependency-free. A possible future basic-rtf-advanced module may package the extended parser — file an issue if you need it.

Non-goals (v1)

TEI XML output
Dedris / pytiblegenc font encoding
pdf-cmap-fix integration
Word COM automation
Streaming / incremental parse

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/basic_rtf		src/basic_rtf
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

basic-rtf

Install

Quick start

Command-line

Stream schema

Typed streams

Font dict

API reference

`BasicRTF()`

`parse_file(file_path, show_progress=False)`

`get_streams() → list[dict]`

`get_fonts() → list[dict]`

`get_colors() → list[dict]`

`print_debug()`

Supported RTF features

Known limitations

BDRC integration

Advanced fork

Non-goals (v1)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

basic-rtf

Install

Quick start

Command-line

Stream schema

Typed streams

Font dict

API reference

BasicRTF()

parse_file(file_path, show_progress=False)

get_streams() → list[dict]

get_fonts() → list[dict]

get_colors() → list[dict]

print_debug()

Supported RTF features

Known limitations

BDRC integration

Advanced fork

Non-goals (v1)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`BasicRTF()`

`parse_file(file_path, show_progress=False)`

`get_streams() → list[dict]`

`get_fonts() → list[dict]`

`get_colors() → list[dict]`

`print_debug()`

Packages