Skip to content

perf: highlight Markdown/MDX with tree-sitter (diff viewer, file viewer, content search)#148

Merged
matej21 merged 3 commits into
mainfrom
perf/diff-viewer-treesitter-markdown
Jun 19, 2026
Merged

perf: highlight Markdown/MDX with tree-sitter (diff viewer, file viewer, content search)#148
matej21 merged 3 commits into
mainfrom
perf/diff-viewer-treesitter-markdown

Conversation

@matej21

@matej21 matej21 commented Jun 19, 2026

Copy link
Copy Markdown
Member

Problem

Opening a git diff for some Markdown/MDX files was painfully slow — e.g. a 715-line .mdx with only a 3-line change took ~1.4 s (release) / ~3 s (dev) to render. The same slow highlighter also backs the file viewer and the content-search preview.

Profiling isolated the layer (it is not git, rendering, or diff size):

  • syntect's Sublime Markdown grammar is pathologically slow — ~1 ms/line. On the same bytes: plain text ~0.3 ms, Rust grammar ~35 ms, Markdown ~700 ms (≈20× slower than Rust, ≈2000× slower than plain text). The diff viewer does it twice (old + new) ⇒ ~1.4 s.
  • ~100 % of that cost is syntect's stateful parse, not styling. So "only style displayed lines" saves nothing — reaching a mid-file hunk still requires parsing the whole prefix line-by-line.

Fix

Highlight Markdown/MDX with tree-sitter (tree-sitter-md), which parses the whole document in ~30 ms and — because it parses the entire file cheaply — sidesteps the "carry parse state to a mid-file hunk" problem entirely.

Hybrid design so there's no regression on embedded code:

  • tree-sitter handles the Markdown structure — headings, emphasis, links, inline code, fence delimiters (block + inline highlight queries → a per-byte colour buffer; smaller/more-specific captures win).
  • Fenced code blocks keep going through the existing syntect path for their embedded language (graphql/json/ts/…). Those grammars are fast — only Markdown was the problem.

Unified across all viewers. A shared markdown_line_spans core feeds two thin adapters with the existing output shapes, so callers are unchanged:

  • highlight_markdown_file → diff viewer (line -> spans map, all lines).
  • highlight_markdown_content → file viewer & content-search preview (ordered HighlightedLines, capped at max_lines), wired in via syntax::highlight_content.

Non-Markdown files are completely unaffected (still syntect).

Result

Measured end-to-end on the same file (release), including embedded graphql/json highlighting:

before (syntect) after (tree-sitter)
single pass ~700 ms ~41 ms
diff old + new ~1.4 s ~82 ms

~17× faster — and now the file viewer / content-search preview get the same win.

Notes

  • tree-sitter 0.26 and streaming-iterator are already in the workspace (via gpui-component), so this only adds the tree-sitter-md grammar crate.
  • The grammar compiles C via the cc crate — fine on Linux/macOS; on Windows it needs the MSVC toolchain already used for this repo (x64 Native Tools).
  • New module crates/okena-files/src/markdown_highlight.rs with unit tests (text reconstruction, heading colouring, embedded-language colouring, HighlightedLine shape + max_lines, path matching). No unwrap/expect in non-test code (crate lint).

Manual check suggested

Open a large .mdx/.md file (e.g. a Contember docs page with graphql code fences) in both the diff viewer and the file viewer; confirm it renders instantly with headings/links/code coloured.

🤖 Generated with Claude Code

https://claude.ai/code/session_019JZRfqVneyJnKjdrtSp3ro

@matej21 matej21 changed the title perf(diff): highlight Markdown/MDX with tree-sitter instead of syntect perf: highlight Markdown/MDX with tree-sitter (diff viewer, file viewer, content search) Jun 19, 2026
matej21 and others added 3 commits June 19, 2026 14:23
syntect's Sublime Markdown grammar is pathologically slow — ~1 ms/line,
roughly 20x slower than the Rust grammar on the same bytes and ~2000x
slower than plain text. The diff viewer pre-highlights the full old *and*
new file on every selection, so a 715-line .mdx with a 3-line diff spent
~1.4 s (release) / ~3 s (dev) just colouring, regardless of diff size.
Profiling showed ~100% of that cost is syntect's stateful parse; styling
is free, so "highlight only displayed lines" does not help — reaching a
mid-file hunk still requires parsing the whole prefix.

Switch Markdown/MDX to tree-sitter, which parses the whole document in
~30 ms and sidesteps the mid-file-state problem entirely. Hybrid design:
tree-sitter handles the Markdown structure (headings, emphasis, links,
inline code, fence delimiters) while fenced code blocks keep going
through the existing syntect path for their embedded language
(graphql/json/ts/…) — those grammars are fast; only Markdown was the
problem. Output mirrors the existing `line -> spans` map, so the rest of
the diff pipeline is unchanged.

Measured end-to-end on the same file: ~41 ms (old+new ≈ 82 ms) vs ~1.4 s
before — ~17x faster.

tree-sitter 0.26 and streaming-iterator are already in the workspace via
gpui-component, so this only adds the tree-sitter-md grammar crate.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_019JZRfqVneyJnKjdrtSp3ro
…-sitter too

Unify the highlighting path: `syntax::highlight_content` (used by the file
viewer and the content-search preview) now routes Markdown/MDX through the
same tree-sitter module as the diff viewer, instead of syntect's slow
Markdown grammar.

Extract a shared `markdown_line_spans` core that produces ordered per-line
spans, with two thin adapters: `highlight_markdown_file` (diff viewer,
`line -> spans` map, all lines) and `highlight_markdown_content` (file
viewer, ordered `HighlightedLine`s with plain text, capped at `max_lines`).
The whole document is always parsed (tree-sitter is cheap); only the
emitted line count is capped.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_019JZRfqVneyJnKjdrtSp3ro
…t theme

Markdown element colours were a hand-picked dark/light palette. Resolve
them from the active syntect theme instead (via the TextMate scope each
element maps to), so Markdown matches how that theme renders it elsewhere
and follows the theme if it ever becomes configurable.

Themes vary in coverage: Dracula defines markup.heading/bold/italic/link
etc., while GitHub defines almost no markup.* rules. So each element falls
back to the previous hand-picked colour when the theme has no rule for its
scope (detected by the highlighter returning the default foreground).
Result: Dracula now uses its real Markdown colours; GitHub is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_019JZRfqVneyJnKjdrtSp3ro
@matej21 matej21 force-pushed the perf/diff-viewer-treesitter-markdown branch from d9952f6 to dcbb6dd Compare June 19, 2026 12:23
@matej21 matej21 merged commit 77f73b5 into main Jun 19, 2026
8 checks passed
@matej21 matej21 deleted the perf/diff-viewer-treesitter-markdown branch June 19, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant