Skip to content

Commit 629b5db

Browse files
llm docs: ensurePdfTextPositions
1 parent 23d44fd commit 629b5db

1 file changed

Lines changed: 313 additions & 0 deletions

File tree

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# ensurePdfTextPositions Test API
2+
3+
A test predicate for verifying the spatial layout of text in rendered PDFs. Use this to assert that elements appear in the correct positions relative to each other.
4+
5+
## Requirements
6+
7+
- **Tagged PDFs only**: This API requires PDFs with PDF 1.4+ structure tree support (MCIDs linking text to semantic elements)
8+
- **Typst**: Works out of the box (Typst produces tagged PDFs by default)
9+
- **LaTeX**: Not currently supported (requires `\DocumentMetadata{}` before `\documentclass`)
10+
11+
## Basic Usage
12+
13+
Add assertions to your `.qmd` file's YAML front matter:
14+
15+
```yaml
16+
_quarto:
17+
tests:
18+
typst:
19+
ensurePdfTextPositions:
20+
- # First array: positive assertions (these must be true)
21+
- subject: "Chapter 1"
22+
relation: above
23+
object: "Introduction text"
24+
- subject: "Margin note"
25+
relation: rightOf
26+
object: "Body paragraph"
27+
- [] # Second array: negative assertions (these must NOT be true)
28+
noErrors: default
29+
```
30+
31+
## Text Selectors
32+
33+
A selector identifies text in the PDF. It can be a simple string or an object with options.
34+
35+
### Simple String Selector
36+
37+
```yaml
38+
subject: "Hello World"
39+
```
40+
41+
Searches for text containing "Hello World". The text must appear exactly once in the PDF.
42+
43+
### Object Selector
44+
45+
```yaml
46+
subject:
47+
text: "Hello World"
48+
role: "H1"
49+
page: 1
50+
edge: left
51+
granularity: "Div"
52+
```
53+
54+
| Field | Description |
55+
|-------|-------------|
56+
| `text` | Text to search for (required unless `role: "Page"`) |
57+
| `role` | Expected PDF structure role (see Roles below) |
58+
| `page` | Page number, 1-indexed (required for `role: "Page"`) |
59+
| `edge` | Override which edge to use: `left`, `right`, `top`, `bottom` |
60+
| `granularity` | Aggregate bounding box to ancestor with this role |
61+
62+
## Roles
63+
64+
Standard PDF structure roles: `P`, `H1`, `H2`, `H3`, `Figure`, `Table`, `Span`, `Div`, etc.
65+
66+
### Special Roles
67+
68+
**`Decoration`** - For untagged page elements like headers, footers, and page numbers:
69+
70+
```yaml
71+
subject:
72+
text: "Page 1"
73+
role: "Decoration"
74+
```
75+
76+
- Uses raw text item bounds (no structure tree lookup)
77+
- Allows multiple matches (uses first match)
78+
79+
**`Page`** - Represents the entire page area:
80+
81+
```yaml
82+
subject:
83+
role: "Page"
84+
page: 2
85+
```
86+
87+
- The `text` field is ignored
88+
- Useful for negative assertions (e.g., "this text should NOT be on page 1")
89+
90+
## Relations
91+
92+
### Directional Relations
93+
94+
Assert spatial ordering between elements.
95+
96+
| Relation | Meaning | Default Edges |
97+
|----------|---------|---------------|
98+
| `above` | Subject is above object | subject.bottom < object.top |
99+
| `below` | Subject is below object | subject.top > object.bottom |
100+
| `leftOf` | Subject is left of object | subject.right < object.left |
101+
| `rightOf` | Subject is right of object | subject.left > object.right |
102+
103+
**Distance constraints** (optional):
104+
105+
```yaml
106+
- subject: "Title"
107+
relation: above
108+
object: "Subtitle"
109+
byMin: 10 # At least 10pt gap
110+
byMax: 50 # At most 50pt gap
111+
```
112+
113+
### Alignment Relations
114+
115+
Assert that edges are aligned (within tolerance).
116+
117+
| Relation | Meaning |
118+
|----------|---------|
119+
| `topAligned` | Top edges match |
120+
| `bottomAligned` | Bottom edges match |
121+
| `leftAligned` | Left edges match |
122+
| `rightAligned` | Right edges match |
123+
124+
```yaml
125+
- subject: "Column A"
126+
relation: leftAligned
127+
object: "Column B"
128+
tolerance: 5 # Allow 5pt difference (default: 2pt)
129+
```
130+
131+
### Tag-Only Assertions
132+
133+
Validate that text exists with a specific role (no position comparison):
134+
135+
```yaml
136+
- subject:
137+
text: "Important"
138+
role: "H1"
139+
```
140+
141+
## Edge Overrides
142+
143+
Override which edge is used for comparison:
144+
145+
```yaml
146+
- subject:
147+
text: "Left column"
148+
edge: right # Use right edge of subject
149+
relation: leftOf
150+
object:
151+
text: "Right column"
152+
edge: left # Use left edge of object
153+
```
154+
155+
Default edges per relation:
156+
157+
| Relation | Subject Edge | Object Edge |
158+
|----------|--------------|-------------|
159+
| `leftOf` | right | left |
160+
| `rightOf` | left | right |
161+
| `above` | bottom | top |
162+
| `below` | top | bottom |
163+
| `leftAligned` | left | left |
164+
| `rightAligned` | right | right |
165+
| `topAligned` | top | top |
166+
| `bottomAligned` | bottom | bottom |
167+
168+
## Granularity
169+
170+
Aggregate the bounding box to an ancestor element:
171+
172+
```yaml
173+
- subject:
174+
text: "cell content"
175+
granularity: "Table" # Use entire table's bbox, not just the cell
176+
relation: below
177+
object: "Heading"
178+
```
179+
180+
This walks up the structure tree to find an ancestor with the specified role and computes the bounding box from all its descendant content.
181+
182+
## Negative Assertions
183+
184+
The second array contains assertions that must NOT be true:
185+
186+
```yaml
187+
_quarto:
188+
tests:
189+
typst:
190+
ensurePdfTextPositions:
191+
- # Positive (must be true)
192+
- subject: "Content"
193+
relation: below
194+
object: "Title"
195+
- # Negative (must NOT be true)
196+
- subject: "Margin note"
197+
relation: leftOf
198+
object: "Body text"
199+
```
200+
201+
Negative assertions pass if:
202+
- The elements are on different pages
203+
- The spatial relation does not hold
204+
- Either element is not found
205+
206+
## Coordinate System
207+
208+
- Origin: top-left corner of page
209+
- X: increases rightward
210+
- Y: increases downward
211+
212+
## Error Messages
213+
214+
### "Cannot compare positions: X is on page 1, Y is on page 2"
215+
216+
The two elements are on different pages. This is an error, not a pass or fail. If you expect elements to potentially be on different pages, use negative assertions.
217+
218+
### "Text X is ambiguous - found N matches"
219+
220+
The search text appears multiple times. Use a more specific string, or use `role: "Decoration"` if this is expected (e.g., repeated headers).
221+
222+
### "Text not found in PDF: X"
223+
224+
The search text doesn't exist in the PDF.
225+
226+
### "Text X has no MCID - PDF may not be tagged"
227+
228+
The text exists but isn't linked to the structure tree. Use `role: "Decoration"` for untagged elements.
229+
230+
## Examples
231+
232+
### Verify heading hierarchy
233+
234+
```yaml
235+
ensurePdfTextPositions:
236+
- - subject: "Chapter 1"
237+
relation: above
238+
object: "Section 1.1"
239+
- subject: "Section 1.1"
240+
relation: above
241+
object: "Section 1.2"
242+
- []
243+
```
244+
245+
### Verify margin layout
246+
247+
```yaml
248+
ensurePdfTextPositions:
249+
- - subject: "Margin note text"
250+
relation: rightOf
251+
object: "Body paragraph"
252+
- subject: "Margin note text"
253+
relation: topAligned
254+
object: "Body paragraph"
255+
- []
256+
```
257+
258+
### Verify header/footer positioning
259+
260+
```yaml
261+
ensurePdfTextPositions:
262+
- - subject:
263+
text: "HEADER_TEXT"
264+
role: "Decoration"
265+
relation: above
266+
object: "Page content"
267+
- subject: "Page content"
268+
relation: above
269+
object:
270+
text: "FOOTER_TEXT"
271+
role: "Decoration"
272+
- []
273+
```
274+
275+
### Verify elements are NOT on a specific page
276+
277+
```yaml
278+
ensurePdfTextPositions:
279+
- [] # No positive assertions
280+
- # Negative: "Secret" should NOT be anywhere on page 1
281+
- subject: "Secret content"
282+
relation: above
283+
object:
284+
role: "Page"
285+
page: 1
286+
```
287+
288+
### Verify minimum spacing
289+
290+
```yaml
291+
ensurePdfTextPositions:
292+
- - subject: "Figure 1"
293+
relation: above
294+
object: "Figure 1 caption"
295+
byMin: 5 # At least 5pt gap
296+
byMax: 20 # At most 20pt gap
297+
- []
298+
```
299+
300+
### Verify column alignment
301+
302+
```yaml
303+
ensurePdfTextPositions:
304+
- - subject: "Row 1 Col A"
305+
relation: leftAligned
306+
object: "Row 2 Col A"
307+
tolerance: 1
308+
- subject: "Row 1 Col B"
309+
relation: leftAligned
310+
object: "Row 2 Col B"
311+
tolerance: 1
312+
- []
313+
```

0 commit comments

Comments
 (0)