26 pdf document summarizer by Franci-343 · Pull Request #39 · Cyborg-Network/Oxtools

Franci-343 · 2026-05-26T19:51:54Z

Oxtools Submission

Project name: [PDF / Document Summarizer]
Contributor: [@Franci-343]

What does this tool do?

The PDF document summarizer processes files in multiple formats (PDF, Word, Excel, CSV, text, or images) and generates a structured report with a summary, key findings, action items, risk factors, and a data table. It works by dividing the document into fragments, summarizing them in parallel, extracting key entities and figures, and finally synthesizing all the information using the Oxlo API.

Submission checklist

Check every box before requesting a review. Unchecked items will result in the PR being sent back.

Structure

My project is in its own directory under projects/[my-project-name]/
I have not placed any files directly in the repository root

Required files

Dockerfile is present and docker build . succeeds
docker-compose.yml is present and docker compose up starts the app
oxlo-manifest.json is present and all fields are filled in
.env.example lists every environment variable the project needs (with empty values)
README.md is present with setup instructions a reviewer can follow exactly

Security

No API keys, private keys, or secrets are hardcoded anywhere in the codebase
My actual .env file is not included in this PR
I have verified my diff with git grep -i "api_key" and found no leaks

Oxlo API

The tool makes at least one functional call to the Oxlo API
The API key is read from the OXLO_API_KEY environment variable

For maintainers

Security scan passed — no secrets in diff
docker build . succeeded locally
docker compose up ran successfully and app is reachable
Oxlo API integration verified
Approved for merge

vercel · 2026-05-26T19:51:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
oxtools	Ready	Preview, Comment	May 27, 2026 9:37pm

oxlo-ai

OxBot Review

This PR introduces a PDF/document summarizer that processes multiple file formats (PDF, Word, Excel, CSV, images) through a Tier 2 Python backend with a custom frontend upload component. While the feature set is ambitious and the Python pipeline shows good modularization, the implementation suffers from significant architectural inconsistencies, mixing core platform modifications with community-project claims. Critical issues span both frontend and backend: the React page is bloated with business logic and security risks, the Python service has incomplete dependencies and API contract mismatches, and shared platform types are polluted with format-specific extensions. Performance bottlenecks appear on both sides, from client-side main-thread processing to repeated spaCy model loading and synchronous blocking in async generators. Collectively, these issues prevent the PR from being mergeable without substantial restructuring.

Notes

Clarify whether this is a core platform Tier 2 tool or a standalone community project under projects/; the current structure mixes both, modifying core app files while claiming to be a community submission.
The frontend-backend contract is inconsistent: the UI invents a pdf-drop input type while the Python service accepts four different document keys, and the FastAPI endpoint must handle both dict errors and async generators.
Heavy document processing should not happen on the main browser thread; either move all parsing to the Python backend or use Web Workers, and avoid dynamic CDN script injection without integrity checks.
The shared platform type system should remain generic; introduce a standard file/files input with optional MIME-type restrictions rather than hardcoding format-specific union members.
The Python service is not runnable as-is due to critically incomplete dependencies and performance anti-patterns such as reloading NLP models per fragment and blocking the async event loop.
Add unit tests for the Python preprocessor and extractor, particularly for edge cases like short pages, empty inputs, and malformed documents.

⚠️ Verdict: Needs Changes | 16 inline comment(s) | 7 file(s) reviewed | ⏱️ 426s

_{Automated review by OxBot}

oxlo-ai · 2026-05-26T19:59:03Z

+		| "loading"
+		| "loading-ocr"
+		| "loading-excel"
+		| "loading-word"


Bug: maxBytes defaults to 20 MB when config.maxSizeMb is undefined, but the error message on the next line defaults to 25 MB. Extract a single local constant (e.g., const maxMb = config.maxSizeMb ?? 20) and use it for both calculations to keep the limit consistent.

oxlo-ai · 2026-05-26T19:59:03Z

+
+	const maxBytes = (config.maxSizeMb || 20) * 1024 * 1024;
+
+	const extractText = useCallback(


Security: loadScript injects third-party scripts from CDNs without Subresource Integrity checks. If the CDN is compromised, arbitrary JavaScript executes in the user's browser. Prefer bundling these libraries via npm. If dynamic loading is required, add s.integrity = 'sha384-...' and s.crossOrigin = 'anonymous'.

oxlo-ai · 2026-05-26T19:59:03Z

+				};
+				const pdfjs = win["pdfjsLib"] as {
+					getDocument: (src: { data: ArrayBuffer }) => {
+						promise: Promise<{ numPages: number; getPage: (n: number) => Promise<PdfPage> }>;


Performance: The PDF OCR fallback creates a high-resolution canvas (scale: 2.5) for every page with no upper bound. A large scanned PDF (e.g., 100+ pages) can allocate gigabytes of memory and crash the tab. Add a maxPages guard, process pages sequentially, and explicitly set canvas.width = canvas.height = 0 after toBlob to help GC.

oxlo-ai · 2026-05-26T19:59:03Z

+					<>
+						<div className="h-6 w-6 animate-spin rounded-full border-2 border-primary border-t-transparent" />
+						<p className="text-sm text-muted-foreground">
+							No text found - running OCR on {fileName}...


Bug: handleDragLeave sets dropState to "done" whenever value is truthy. Because value also contains manually typed text, dragging over the zone and leaving while the textarea has content incorrectly renders the file-upload success UI (checkmark, empty filename). Track whether a file was actually dropped separately from text content.

oxlo-ai · 2026-05-26T19:59:03Z

+					<Textarea
+						value={value}
+						onChange={(e) => {
+							onChange(e.target.value);


Bug: Typing into the textarea sets dropState("done"), which triggers the drop zone's file-upload success UI even though no file was uploaded. Keep manual input state independent of the drop zone; the drop zone should remain in "idle" when the user is typing.

oxlo-ai · 2026-05-26T19:59:03Z

+                        piece = sentence[i : i + CHUNK_MAX_CHARS]
+                        if current and len(current) + len(piece) + 2 > CHUNK_MAX_CHARS:
+                            flush()
+                        if current:


Suggestion: When a sentence exceeds CHUNK_MAX_CHARS and is force-split into pieces, joining them with \n\n breaks semantic continuity because the pieces belong to the same sentence. Consider using a space or empty string as the separator so the LLM still sees contiguous text.

oxlo-ai · 2026-05-26T19:59:03Z

@@ -0,0 +1 @@
+openai>=1.30.0


Bug: This requirements file only lists openai>=1.30.0. For the described tool to function, you must include the complete dependency set. At minimum, add fastapi, uvicorn, python-multipart, and the relevant document parsing libraries (e.g., pymupdf, python-docx, pandas, Pillow). Without these, the Docker build will succeed but the application will fail at runtime.

oxlo-ai · 2026-05-26T19:59:03Z

+
+        yield "[entities] Extracting entities...\n"
+        clean_text = "\n\n".join(chunks)
+        entities = extract_entities(clean_text)


Performance: extract_entities() is invoked synchronously without await. If this helper performs I/O (e.g., Oxlo API calls), it will block the async event loop. If it is CPU-bound, offload it with asyncio.to_thread() to keep the server responsive to other concurrent requests.

Implements a 5-step pipeline for generating structured, validated READMEs: 1. Project Analyzer (analyzer.py) - Extracts metadata: language, package_manager, framework, project_type - Uses deepseek-v3.2 for JSON parsing 2. Section Planner (planner.py) - Deterministic section planning by project_type - web-api/cli/library/web-app/other templates 3. README Writer (writer.py) - LLM generation with llama-3.3-70b - Injects metadata for consistency - Enforces shields.io badges and language-tagged code blocks 4. Section Validator (validator.py) - Regex validation: required sections, badge URLs, code fences - Returns structured issue list 5. Refinement Agent (refiner.py) - Max 2 retries to fix detected issues - Re-targets only broken sections Backend: tool.py, analyzer.py, planner.py, writer.py, validator.py, refiner.py, llm_client.py Tests: test_planner.py, test_validator.py, test_refiner.py, test_e2e_mock.py (13 tests total) Frontend: auto-readme-generator.ts, registry.ts (+2 lines) CI: ci-auto-readme-generator.yml (isolated)

- Updated CI workflow to install dependencies from requirements.txt - Added output format option in ToolDefinition - Introduced project metadata handling in analyzer - Sanitized inputs in refine_readme and write_readme functions - Implemented code block extraction in validator - Created unit tests for the auto-readme-generator pipeline

- Add improved JSON parsing with relaxed regex patterns in analyzer - Implement ValidationError handling for metadata parsing fallback - Strengthen LLM response validation with comprehensive checks - Update CI workflow to install dev requirements for testing - Simplify production dependencies while maintaining core functionality

- validator.py: relax FENCE_OPEN regex to accept GFM info strings (Shashank) - tool.py: wrap plan_sections and validate_readme in try/except - tool.py: separate refiner and post-refine validator exception handlers - refiner.py: add early return guard when issues is empty - refiner.py: fix Awaitable[str] type hint on call_model - analyzer.py: wrap call_oxlo_chat in try/except with DEFAULT_METADATA fallback - tests: convert asyncio.run to async def with pytest.mark.asyncio - tests: add pytest.ini with asyncio_mode = auto

…n, preprocessing, summarization, and synthesis functionalities

…riptions for various file types

…e summarization output formatting

… generation logic

- Missing dependencies (openai, spacy) added - The maxMb constant is now standardized for size checking/errors - handleDragLeave now uses the font, not the value - When typing in a text area, the font is reset to manual - A 20-page limit has been added for OCR and canvas garbage collection - extract_entities() is now executed using asyncio.to_thread - Sentence fragments forcibly split with spaces are now joined Performance Improvement: OCR page limit Bug Fixes: Drag and drop status, error message consistency.

…e max size variable fix(result-viewer): improve log rendering by removing unnecessary index in key feat(app-sidebar): replace img with next/image for optimized logo rendering and add button type fix(usage-counter): remove unused user variable and add button type for login redirect

Franci-343 linked an issue May 26, 2026 that may be closed by this pull request

PDF / Document Summarizer #26

Open

oxlo-ai Bot reviewed May 26, 2026

View reviewed changes

Franci-343 added 8 commits May 26, 2026 17:30

feat(pdf-summarizer): implement PDF summarization tool with extractio…

336b0b3

…n, preprocessing, summarization, and synthesis functionalities

feat(pdf-summarizer): add PDF drop field for document upload

4c38b4c

feat(pdf-summarizer): enhance document upload support and update desc…

edd223b

…riptions for various file types

feat(pdf-summarizer): enhance PDF drop field functionality and improv…

ebdccc7

…e summarization output formatting

Franci-343 force-pushed the 26-pdf-document-summarizer branch from 1275520 to ebdccc7 Compare May 26, 2026 21:33

Franci-343 requested review from beekay2706 and ms-shashank as code owners May 26, 2026 21:33

vercel Bot deployed to Preview May 26, 2026 21:33 View deployment

fix(pdf-summarizer): satisfy biome checks in tool page

b4afaae

vercel Bot deployed to Preview May 26, 2026 21:36 View deployment

fix(ui): hide <think> blocks in results

c167317

vercel Bot deployed to Preview May 27, 2026 02:07 View deployment

feat(pdf-summarizer): align report format with original prompt

f341d6e

vercel Bot deployed to Preview May 27, 2026 02:20 View deployment

feat(pdf-summarizer): enhance think block handling and improve report…

adcf1a2

… generation logic

vercel Bot deployed to Preview May 27, 2026 04:03 View deployment

fix(result-viewer): improve log rendering and enhance loading indicator

d6234a6

vercel Bot deployed to Preview May 27, 2026 20:05 View deployment

vercel Bot deployed to Preview May 27, 2026 21:03 View deployment

vercel Bot deployed to Preview May 27, 2026 21:37 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

26 pdf document summarizer#39

26 pdf document summarizer#39
Franci-343 wants to merge 15 commits into
devfrom
26-pdf-document-summarizer

Franci-343 commented May 26, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

oxlo-ai Bot left a comment

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

Uh oh!

Uh oh!

oxlo-ai Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		const maxBytes = (config.maxSizeMb \|\| 20) * 1024 * 1024;

		const extractText = useCallback(

		@@ -0,0 +1 @@
		openai>=1.30.0

Uh oh!

Conversation

Franci-343 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Oxtools Submission

What does this tool do?

Submission checklist

For maintainers

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oxlo-ai Bot left a comment

Choose a reason for hiding this comment

OxBot Review

Notes

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oxlo-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Franci-343 commented May 26, 2026 •

edited

Loading

vercel Bot commented May 26, 2026 •

edited

Loading