Requested feature
I use Docling in RAG pipelines. One of the input data types is some HTML content that can be retrieved via a specific HTML endpoint, which returns only what's contained in a specific field on the page and may not contain a heading at the beginning.
Currently, the HTML backend logic is such that if the text contains headings, the content up to the first heading goes to the Furniture content layer (code). But the purpose of retrieving data via a specific endpoint is precisely to filter the data going to the Body and not transfer this content layer auto-detection logic to Docling.
I'd like to have an option to set the default layer for the html_backend to Body, so that all content goes there and after present in chunks or export to Markdown.
Alternatives
Adding a title to the beginning of the content yourself isn't always convenient, and it doesn't always exist.
Caused by discussion in #2388
Requested feature
I use Docling in RAG pipelines. One of the input data types is some HTML content that can be retrieved via a specific HTML endpoint, which returns only what's contained in a specific field on the page and may not contain a heading at the beginning.
Currently, the HTML backend logic is such that if the text contains headings, the content up to the first heading goes to the Furniture content layer (code). But the purpose of retrieving data via a specific endpoint is precisely to filter the data going to the Body and not transfer this content layer auto-detection logic to Docling.
I'd like to have an option to set the default layer for the html_backend to Body, so that all content goes there and after present in chunks or export to Markdown.
Alternatives
Adding a title to the beginning of the content yourself isn't always convenient, and it doesn't always exist.
Caused by discussion in #2388