Skip to content

26 pdf document summarizer#39

Open
Franci-343 wants to merge 15 commits into
devfrom
26-pdf-document-summarizer
Open

26 pdf document summarizer#39
Franci-343 wants to merge 15 commits into
devfrom
26-pdf-document-summarizer

Conversation

@Franci-343

@Franci-343 Franci-343 commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Oxtools Submission

Project name: [PDF / Document Summarizer]
Contributor: [@Franci-343]


What does this tool do?

The PDF document summarizer processes files in multiple formats (PDF, Word, Excel, CSV, text, or images) and generates a structured report with a summary, key findings, action items, risk factors, and a data table. It works by dividing the document into fragments, summarizing them in parallel, extracting key entities and figures, and finally synthesizing all the information using the Oxlo API.


Submission checklist

Check every box before requesting a review. Unchecked items will result in the PR being sent back.

Structure

  • My project is in its own directory under projects/[my-project-name]/
  • I have not placed any files directly in the repository root

Required files

  • Dockerfile is present and docker build . succeeds
  • docker-compose.yml is present and docker compose up starts the app
  • oxlo-manifest.json is present and all fields are filled in
  • .env.example lists every environment variable the project needs (with empty values)
  • README.md is present with setup instructions a reviewer can follow exactly

Security

  • No API keys, private keys, or secrets are hardcoded anywhere in the codebase
  • My actual .env file is not included in this PR
  • I have verified my diff with git grep -i "api_key" and found no leaks

Oxlo API

  • The tool makes at least one functional call to the Oxlo API
  • The API key is read from the OXLO_API_KEY environment variable

For maintainers

  • Security scan passed — no secrets in diff
  • docker build . succeeded locally
  • docker compose up ran successfully and app is reachable
  • Oxlo API integration verified
  • Approved for merge

@Franci-343 Franci-343 linked an issue May 26, 2026 that may be closed by this pull request
@vercel

vercel Bot commented May 26, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
oxtools Ready Ready Preview, Comment May 27, 2026 9:37pm

Request Review

@oxlo-ai oxlo-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OxBot Review

This PR introduces a PDF/document summarizer that processes multiple file formats (PDF, Word, Excel, CSV, images) through a Tier 2 Python backend with a custom frontend upload component. While the feature set is ambitious and the Python pipeline shows good modularization, the implementation suffers from significant architectural inconsistencies, mixing core platform modifications with community-project claims. Critical issues span both frontend and backend: the React page is bloated with business logic and security risks, the Python service has incomplete dependencies and API contract mismatches, and shared platform types are polluted with format-specific extensions. Performance bottlenecks appear on both sides, from client-side main-thread processing to repeated spaCy model loading and synchronous blocking in async generators. Collectively, these issues prevent the PR from being mergeable without substantial restructuring.

Notes

  • Clarify whether this is a core platform Tier 2 tool or a standalone community project under projects/; the current structure mixes both, modifying core app files while claiming to be a community submission.
  • The frontend-backend contract is inconsistent: the UI invents a pdf-drop input type while the Python service accepts four different document keys, and the FastAPI endpoint must handle both dict errors and async generators.
  • Heavy document processing should not happen on the main browser thread; either move all parsing to the Python backend or use Web Workers, and avoid dynamic CDN script injection without integrity checks.
  • The shared platform type system should remain generic; introduce a standard file/files input with optional MIME-type restrictions rather than hardcoding format-specific union members.
  • The Python service is not runnable as-is due to critically incomplete dependencies and performance anti-patterns such as reloading NLP models per fragment and blocking the async event loop.
  • Add unit tests for the Python preprocessor and extractor, particularly for edge cases like short pages, empty inputs, and malformed documents.

⚠️ Verdict: Needs Changes | 16 inline comment(s) | 7 file(s) reviewed | ⏱️ 426s


Automated review by OxBot

| "loading"
| "loading-ocr"
| "loading-excel"
| "loading-word"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: maxBytes defaults to 20 MB when config.maxSizeMb is undefined, but the error message on the next line defaults to 25 MB. Extract a single local constant (e.g., const maxMb = config.maxSizeMb ?? 20) and use it for both calculations to keep the limit consistent.


const maxBytes = (config.maxSizeMb || 20) * 1024 * 1024;

const extractText = useCallback(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: loadScript injects third-party scripts from CDNs without Subresource Integrity checks. If the CDN is compromised, arbitrary JavaScript executes in the user's browser. Prefer bundling these libraries via npm. If dynamic loading is required, add s.integrity = 'sha384-...' and s.crossOrigin = 'anonymous'.

};
const pdfjs = win["pdfjsLib"] as {
getDocument: (src: { data: ArrayBuffer }) => {
promise: Promise<{ numPages: number; getPage: (n: number) => Promise<PdfPage> }>;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: The PDF OCR fallback creates a high-resolution canvas (scale: 2.5) for every page with no upper bound. A large scanned PDF (e.g., 100+ pages) can allocate gigabytes of memory and crash the tab. Add a maxPages guard, process pages sequentially, and explicitly set canvas.width = canvas.height = 0 after toBlob to help GC.

<>
<div className="h-6 w-6 animate-spin rounded-full border-2 border-primary border-t-transparent" />
<p className="text-sm text-muted-foreground">
No text found - running OCR on {fileName}...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: handleDragLeave sets dropState to "done" whenever value is truthy. Because value also contains manually typed text, dragging over the zone and leaving while the textarea has content incorrectly renders the file-upload success UI (checkmark, empty filename). Track whether a file was actually dropped separately from text content.

<Textarea
value={value}
onChange={(e) => {
onChange(e.target.value);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Typing into the textarea sets dropState("done"), which triggers the drop zone's file-upload success UI even though no file was uploaded. Keep manual input state independent of the drop zone; the drop zone should remain in "idle" when the user is typing.

piece = sentence[i : i + CHUNK_MAX_CHARS]
if current and len(current) + len(piece) + 2 > CHUNK_MAX_CHARS:
flush()
if current:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: When a sentence exceeds CHUNK_MAX_CHARS and is force-split into pieces, joining them with \n\n breaks semantic continuity because the pieces belong to the same sentence. Consider using a space or empty string as the separator so the LLM still sees contiguous text.

@@ -0,0 +1 @@
openai>=1.30.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This requirements file only lists openai>=1.30.0. For the described tool to function, you must include the complete dependency set. At minimum, add fastapi, uvicorn, python-multipart, and the relevant document parsing libraries (e.g., pymupdf, python-docx, pandas, Pillow). Without these, the Docker build will succeed but the application will fail at runtime.

Comment thread services/python-tools/tools/pdf-summarizer/tool.py
Comment thread services/python-tools/tools/pdf-summarizer/tool.py

yield "[entities] Extracting entities...\n"
clean_text = "\n\n".join(chunks)
entities = extract_entities(clean_text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: extract_entities() is invoked synchronously without await. If this helper performs I/O (e.g., Oxlo API calls), it will block the async event loop. If it is CPU-bound, offload it with asyncio.to_thread() to keep the server responsive to other concurrent requests.

Implements a 5-step pipeline for generating structured, validated READMEs:

1. Project Analyzer (analyzer.py)
   - Extracts metadata: language, package_manager, framework, project_type
   - Uses deepseek-v3.2 for JSON parsing

2. Section Planner (planner.py)
   - Deterministic section planning by project_type
   - web-api/cli/library/web-app/other templates

3. README Writer (writer.py)
   - LLM generation with llama-3.3-70b
   - Injects metadata for consistency
   - Enforces shields.io badges and language-tagged code blocks

4. Section Validator (validator.py)
   - Regex validation: required sections, badge URLs, code fences
   - Returns structured issue list

5. Refinement Agent (refiner.py)
   - Max 2 retries to fix detected issues
   - Re-targets only broken sections

Backend: tool.py, analyzer.py, planner.py, writer.py, validator.py, refiner.py, llm_client.py
Tests: test_planner.py, test_validator.py, test_refiner.py, test_e2e_mock.py (13 tests total)
Frontend: auto-readme-generator.ts, registry.ts (+2 lines)
CI: ci-auto-readme-generator.yml (isolated)
- Updated CI workflow to install dependencies from requirements.txt
- Added output format option in ToolDefinition
- Introduced project metadata handling in analyzer
- Sanitized inputs in refine_readme and write_readme functions
- Implemented code block extraction in validator
- Created unit tests for the auto-readme-generator pipeline
- Add improved JSON parsing with relaxed regex patterns in analyzer
- Implement ValidationError handling for metadata parsing fallback
- Strengthen LLM response validation with comprehensive checks
- Update CI workflow to install dev requirements for testing
- Simplify production dependencies while maintaining core functionality
- validator.py: relax FENCE_OPEN regex to accept GFM info strings (Shashank)
- tool.py: wrap plan_sections and validate_readme in try/except
- tool.py: separate refiner and post-refine validator exception handlers
- refiner.py: add early return guard when issues is empty
- refiner.py: fix Awaitable[str] type hint on call_model
- analyzer.py: wrap call_oxlo_chat in try/except with DEFAULT_METADATA fallback
- tests: convert asyncio.run to async def with pytest.mark.asyncio
- tests: add pytest.ini with asyncio_mode = auto
…n, preprocessing, summarization, and synthesis functionalities
- Missing dependencies (openai, spacy) added
- The maxMb constant is now standardized for size checking/errors
- handleDragLeave now uses the font, not the value
- When typing in a text area, the font is reset to manual
- A 20-page limit has been added for OCR and canvas garbage collection
- extract_entities() is now executed using asyncio.to_thread
- Sentence fragments forcibly split with spaces are now joined

Performance Improvement: OCR page limit

Bug Fixes: Drag and drop status, error message consistency.
…e max size variable

fix(result-viewer): improve log rendering by removing unnecessary index in key
feat(app-sidebar): replace img with next/image for optimized logo rendering and add button type
fix(usage-counter): remove unused user variable and add button type for login redirect
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF / Document Summarizer

1 participant