A minimal Python RTF stream extractor. Parses an RTF file into a flat list of streams — plain-text runs paired with font and position metadata. No dependencies beyond the Python standard library.
pip install -e .Or, once published:
pip install basic-rtfRequires Python 3.9+.
from basic_rtf import BasicRTF
parser = BasicRTF()
parser.parse_file("document.rtf")
for stream in parser.get_streams():
# skip structural blocks you don't need
if stream.get("type") == "pict":
continue
font = stream["font"]
print(f"[{font['name']} {font['size']}pt] {stream['text']!r}")Print plain text:
python -m basic_rtf document.rtfPrint streams as JSON:
python -m basic_rtf document.rtf --jsonget_streams() returns a list of dicts. Every stream has:
| Key | Type | Description |
|---|---|---|
text |
str |
Decoded text content of the run |
font |
dict |
Font info: id (int), name (str), size (int, points) |
char_start |
int |
Byte offset in original RTF where this run begins |
char_end |
int |
Byte offset where this run ends |
type |
str (optional) |
Present for structural streams only (see below) |
When a stream has a type key it represents an RTF structural block:
type value |
RTF keyword | Meaning |
|---|---|---|
"header" |
\header |
Page header text |
"footer" |
\footer |
Page footer text |
"footnote" |
\footnote |
Footnote text |
"pict" |
\pict |
Image data (raw binary — usually skip) |
Normal text runs have no type key.
stream["font"] == {
"id": 3, # \fN index from \fonttbl
"name": "Arial", # font name string
"size": 12, # in points (RTF half-points ÷ 2)
}Create a parser instance.
Parse file_path (str or pathlib.Path). The file is read as UTF-8 with
errors="ignore".
If show_progress=True a progress bar is written to stderr.
The instance resets on each call — safe to reuse.
Return the parsed streams (see schema above).
Return the raw font table entries from \fonttbl:
[{"id": 0, "name": "Times New Roman"}, ...]
Return colour table entries from \colortbl:
[{"red": 255, "green": 0, "blue": 0}, ...]
Print every stream with its raw RTF source — useful during development.
| Feature | Notes |
|---|---|
\fonttbl |
Font id/name mapping |
\colortbl |
Colour entries |
\uN? |
Unicode escapes (16-bit signed) |
\'hh |
Hex bytes decoded as cp1252 |
\~ |
Non-breaking space → U+00A0 |
\f, \fs |
Font / font-size changes flush the current run |
\par |
Paragraph break appended as \n |
{\*\...} |
Destination groups (bookmarks, revisions…) skipped |
| Typed blocks | \header, \footer, \footnote, \pict |
-
\'hhis cp1252, not font-specific. If you are working with RTF that uses a legacy byte-encoded font (e.g. old Tibetan Dedris fonts where each byte encodes a glyph ID, not a Unicode codepoint), the hex-byte streams will be wrong. You must post-processstream["text"]using your own font/encoding lookup. See the BDRC integration example below for how this is handled in practice. -
No section/table stream types. Control words
\sect,\cell,\row, and\lineare consumed as unknown keywords — they do not produce typed streams. If you need"sect_break","par_break","cell_break", or"row_break"stream types, see the advanced parser variant. -
No
\*destination group content. Comments, CSS, revision marks, and other{\*\keyword ...}groups are silently skipped.
This library is the foundation for several BDRC (Buddhist Digital Resource
Center) Tibetan e-text conversion projects. A typical consumer looks like
tibetan-etext-tools/IE3CN3396/2_convert_rtf_to_xml.py:
from basic_rtf import BasicRTF
parser = BasicRTF()
parser.parse_file(rtf_path)
streams = parser.get_streams()
for stream in streams:
match stream.get("type"):
case "footer" | "header" | "sect_break":
# project-specific: insert TEI page break
...
case "pict":
continue
case _:
# project-specific: map font bytes → Unicode via dedris_converter
unicode_text = dedris_to_unicode(stream["text"], stream["font"]["name"])The Dedris→Unicode mapping, TEI XML generation, and normalization steps live
in tibetan-etext-tools and are not part of this library.
BDRC projects that handle Word documents with \panose font descriptors or
need per-stream "sect_break", "par_break", "cell_break", and
"row_break" types use an extended variant of this parser. The key extras in
that fork (at tibetan-etext-tools/IE3CN3396/basic_rtf.py) are:
detect_rtf_format()— heuristic that distinguishes RTF files where Dedris fonts have{\*\panose}descriptors ("complex") from those that don't ("simple"), and selects the appropriate font-table parser accordingly.\loch/\hich/\dbch— low/high/double-byte charset switches that track which effective font applies to a run.\sect,\cell,\row,\line— section and table control words emitted as typed streams ("sect_break","cell_break","row_break","line_break").- Dedris-specific handling of
}inside a font run (glyph 125 in the Dedris encoding is the same byte as ASCII}).
This basic-rtf library intentionally omits those features to stay small and
dependency-free. A possible future basic-rtf-advanced module may package the
extended parser — file an issue if you need it.
- TEI XML output
- Dedris / pytiblegenc font encoding
- pdf-cmap-fix integration
- Word COM automation
- Streaming / incremental parse
MIT