Summary
When converting Excel files, Docling internally captures sheet names as GroupItem objects (e.g., name="sheet: {name}", label=GroupLabel.SECTION). However, these sheet names are not rendered as visible headings in the Markdown output.
Problem
During export_to_markdown(), the logical document structure (sheet grouping) is not reflected in the final Markdown. This makes it difficult to distinguish content originating from different sheets, especially for multi-sheet workbooks.
Expected Behavior
Sheet names should be emitted as Markdown section headers (e.g., ## Sheet Name) during export.
Proposed Solutions
-
Automatically render GroupItem names (where label=SECTION) as Markdown headings.
-
Introduce an optional flag, e.g.:
doc.export_to_markdown(include_group_headings=True)
-
Allow customization of heading levels (e.g., ##, ###) for better integration into downstream pipelines.
Current Workarounds
- Manually iterating over pages and injecting headers.
- Using
page_break_placeholder to visually separate sheets.
- Parsing
GroupItem names programmatically.
These approaches add complexity and require post-processing that could be handled natively.
Why This Matters
- Improves readability of exported Markdown.
- Preserves document structure more faithfully.
- Reduces need for custom post-processing in RAG pipelines and document ingestion workflows.
Additional Context
This behavior was observed in the Excel backend implementation where sheets are grouped but not surfaced in Markdown output.
Happy to contribute or test a PR if this direction aligns with the project.
Summary
When converting Excel files, Docling internally captures sheet names as
GroupItemobjects (e.g.,name="sheet: {name}",label=GroupLabel.SECTION). However, these sheet names are not rendered as visible headings in the Markdown output.Problem
During
export_to_markdown(), the logical document structure (sheet grouping) is not reflected in the final Markdown. This makes it difficult to distinguish content originating from different sheets, especially for multi-sheet workbooks.Expected Behavior
Sheet names should be emitted as Markdown section headers (e.g.,
## Sheet Name) during export.Proposed Solutions
Automatically render
GroupItemnames (wherelabel=SECTION) as Markdown headings.Introduce an optional flag, e.g.:
Allow customization of heading levels (e.g.,
##,###) for better integration into downstream pipelines.Current Workarounds
page_break_placeholderto visually separate sheets.GroupItemnames programmatically.These approaches add complexity and require post-processing that could be handled natively.
Why This Matters
Additional Context
This behavior was observed in the Excel backend implementation where sheets are grouped but not surfaced in Markdown output.
Happy to contribute or test a PR if this direction aligns with the project.