feat: Add full-text page search to workspace (Closes #6) by Votienduong2208 · Pull Request #11 · YurMil/pdfs-master

Votienduong2208 · 2026-06-12T20:02:22Z

Summary

This PR implements full-text page search functionality for the PDF workspace. The text extraction was already implemented in the codebase during document ingestion ( and ), and the type already had the field. This PR connects the extracted text content to the search/filter functionality.

Changes

App.tsx: Updated the search logic to include in the matching criteria, allowing users to search for text keywords inside PDF pages.
Toolbar.tsx: Updated the search input placeholder from "Search pages" to "Search pages by text, label, or number" to indicate the new capability.
App.tsx: Updated the EmptyState description to mention text content search.
pdfInspection.integration.test.ts: Added a test case to verify text content extraction from PDF pages.

Technical Details

The search now matches against:

Document names
Page labels
Page numbers (sourcePageIndex + 1)
Page text content (NEW) - extracted during document ingestion using pdfjs-dist's API

This is a simple substring match implementation. For larger workspaces, a more sophisticated search index (e.g., flexsearch) could be added in the future.

Closes #6

- Update search filter in App.tsx to include page.textContent in matching - Update search placeholder in Toolbar to indicate text search capability - Update EmptyState description to mention text content search - Add integration test for text content extraction Closes YurMil#6

gemini-code-assist

Code Review

This pull request introduces text-content-based searching to the PDF master application, updating the page filtering logic, UI placeholders, and empty state descriptions, as well as adding an integration test for text extraction. However, several critical issues were identified: the textContent property is missing from the PageEntity and IngestPagePayload type definitions, which will cause TypeScript compilation errors, and the inspectPdfFile function does not yet extract or populate this field, causing the integration test to fail. Additionally, performing case-insensitive text matching inside the filter loop on every keystroke could lead to UI performance issues on larger documents, so caching or pre-lowercasing the text content is recommended.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T20:03:24Z

+    expect(payload.pages[0]?.textContent).toBeDefined();
+    expect(payload.pages[1]?.textContent).toBeDefined();
+    // The fixture draws "Fixture 1" and "Fixture 2" on pages
+    expect(payload.pages[0]?.textContent?.toLowerCase()).toContain('fixture');
+    expect(payload.pages[1]?.textContent?.toLowerCase()).toContain('fixture');


The inspectPdfFile function in pdfInspection.ts does not extract or populate textContent for pages, nor does the IngestPagePayload or PageEntity types in types.ts include a textContent field. As a result, this test will fail both at compile-time (due to missing property on the type) and at run-time (since textContent is never populated by inspectPdfFile).\n\nYou need to:\n1. Update IngestPagePayload and PageEntity in types.ts to include textContent?: string;.\n2. Update inspectPdfFile in pdfInspection.ts to actually extract the text content from the PDF pages (e.g., using pdfjs-dist or another text extraction method) and include it in the returned pages.

gemini-code-assist · 2026-06-12T20:03:25Z

                page.label.toLowerCase().includes(query) ||
-                String(page.sourcePageIndex + 1).includes(query),
+                String(page.sourcePageIndex + 1).includes(query) ||
+                (page.textContent && page.textContent.toLowerCase().includes(query)),


The PageEntity interface in apps/pdf-master/src/domain/types.ts does not define a textContent property. Accessing page.textContent here will result in a TypeScript compilation error.\n\nTo resolve this, you need to update the PageEntity interface in types.ts to include the optional textContent property:\ntypescript\nexport interface PageEntity {\n // ... existing fields\n textContent?: string;\n}\n

gemini-code-assist · 2026-06-12T20:03:25Z

                page.label.toLowerCase().includes(query) ||
-                String(page.sourcePageIndex + 1).includes(query),
+                String(page.sourcePageIndex + 1).includes(query) ||
+                (page.textContent && page.textContent.toLowerCase().includes(query)),


Calling toLowerCase() on page.textContent (which contains the full text of the page) inside the .filter() loop on every keystroke can be highly inefficient and cause UI lag, especially for larger documents with many pages or dense text.\n\nConsider storing a pre-lowercased/normalized version of the text content (e.g., normalizedTextContent) during ingestion, or caching the lowercased text to avoid re-computing it on every render.

Votienduong2208 · 2026-06-12T21:23:37Z

Addressed the review feedback in commit 3c0482f.\n\nChanges:\n- added extContent to the ingest/page types and store hydration path\n- extracted per-page text during PDF inspection using pdfjs-dist\n- rebuilt the tracked static utility assets\n\nVerification:\n- corepack pnpm --dir apps/pdf-master test:run\n- corepack pnpm --dir apps/pdf-master build\n\nNote: the repo-level lint command still reports an unrelated pre-existing
eact-hooks/refs issue in src/components/PdfViewerDialog/PdfCanvasViewer.tsx.

Votienduong2208 · 2026-06-12T22:23:59Z

Pushed a follow-up commit to address the remaining search hot-path review note by caching lowercased page text per page instead of recomputing it on every search filter pass.

Local verification after the update:

corepack pnpm --dir apps/pdf-master run test:run pdfInspection.integration.test.ts
corepack pnpm --dir apps/pdf-master run build

Votienduong2208 · 2026-06-12T23:26:04Z

Addressed the review feedback in follow-up commits:\n\n- added extContent to the page payload/state types\n- extracted per-page text during PDF inspection and covered it with integration tests\n- cached lowercased page text for workspace search instead of lowercasing full page content inside the filter loop\n- refreshed the tracked static bundle under static/utilities/pdf-master\n\nVerification run locally:\n- pnpm.cmd test\n- pnpm.cmd build\n\nBoth passed in this environment (Node 24 with the repo's existing Node 22 engine warning).

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Fix PDF text extraction ingestion

3c0482f

perf: cache lowercased page text for search

12929a9

build: refresh workspace search bundle

1b15bcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add full-text page search to workspace (Closes #6)#11

feat: Add full-text page search to workspace (Closes #6)#11
Votienduong2208 wants to merge 4 commits into
YurMil:mainfrom
Votienduong2208:feature/workspace-text-extraction-fulltext-search

Votienduong2208 commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Votienduong2208 commented Jun 12, 2026

Summary

Changes

Technical Details

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Votienduong2208 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant