26 pdf document summarizer#39
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
OxBot Review
This PR introduces a PDF/document summarizer that processes multiple file formats (PDF, Word, Excel, CSV, images) through a Tier 2 Python backend with a custom frontend upload component. While the feature set is ambitious and the Python pipeline shows good modularization, the implementation suffers from significant architectural inconsistencies, mixing core platform modifications with community-project claims. Critical issues span both frontend and backend: the React page is bloated with business logic and security risks, the Python service has incomplete dependencies and API contract mismatches, and shared platform types are polluted with format-specific extensions. Performance bottlenecks appear on both sides, from client-side main-thread processing to repeated spaCy model loading and synchronous blocking in async generators. Collectively, these issues prevent the PR from being mergeable without substantial restructuring.
Notes
- Clarify whether this is a core platform Tier 2 tool or a standalone community project under
projects/; the current structure mixes both, modifying core app files while claiming to be a community submission. - The frontend-backend contract is inconsistent: the UI invents a
pdf-dropinput type while the Python service accepts four different document keys, and the FastAPI endpoint must handle both dict errors and async generators. - Heavy document processing should not happen on the main browser thread; either move all parsing to the Python backend or use Web Workers, and avoid dynamic CDN script injection without integrity checks.
- The shared platform type system should remain generic; introduce a standard
file/filesinput with optional MIME-type restrictions rather than hardcoding format-specific union members. - The Python service is not runnable as-is due to critically incomplete dependencies and performance anti-patterns such as reloading NLP models per fragment and blocking the async event loop.
- Add unit tests for the Python preprocessor and extractor, particularly for edge cases like short pages, empty inputs, and malformed documents.
Automated review by OxBot
| | "loading" | ||
| | "loading-ocr" | ||
| | "loading-excel" | ||
| | "loading-word" |
There was a problem hiding this comment.
Bug: maxBytes defaults to 20 MB when config.maxSizeMb is undefined, but the error message on the next line defaults to 25 MB. Extract a single local constant (e.g., const maxMb = config.maxSizeMb ?? 20) and use it for both calculations to keep the limit consistent.
|
|
||
| const maxBytes = (config.maxSizeMb || 20) * 1024 * 1024; | ||
|
|
||
| const extractText = useCallback( |
There was a problem hiding this comment.
Security: loadScript injects third-party scripts from CDNs without Subresource Integrity checks. If the CDN is compromised, arbitrary JavaScript executes in the user's browser. Prefer bundling these libraries via npm. If dynamic loading is required, add s.integrity = 'sha384-...' and s.crossOrigin = 'anonymous'.
| }; | ||
| const pdfjs = win["pdfjsLib"] as { | ||
| getDocument: (src: { data: ArrayBuffer }) => { | ||
| promise: Promise<{ numPages: number; getPage: (n: number) => Promise<PdfPage> }>; |
There was a problem hiding this comment.
Performance: The PDF OCR fallback creates a high-resolution canvas (scale: 2.5) for every page with no upper bound. A large scanned PDF (e.g., 100+ pages) can allocate gigabytes of memory and crash the tab. Add a maxPages guard, process pages sequentially, and explicitly set canvas.width = canvas.height = 0 after toBlob to help GC.
| <> | ||
| <div className="h-6 w-6 animate-spin rounded-full border-2 border-primary border-t-transparent" /> | ||
| <p className="text-sm text-muted-foreground"> | ||
| No text found - running OCR on {fileName}... |
There was a problem hiding this comment.
Bug: handleDragLeave sets dropState to "done" whenever value is truthy. Because value also contains manually typed text, dragging over the zone and leaving while the textarea has content incorrectly renders the file-upload success UI (checkmark, empty filename). Track whether a file was actually dropped separately from text content.
| <Textarea | ||
| value={value} | ||
| onChange={(e) => { | ||
| onChange(e.target.value); |
There was a problem hiding this comment.
Bug: Typing into the textarea sets dropState("done"), which triggers the drop zone's file-upload success UI even though no file was uploaded. Keep manual input state independent of the drop zone; the drop zone should remain in "idle" when the user is typing.
| piece = sentence[i : i + CHUNK_MAX_CHARS] | ||
| if current and len(current) + len(piece) + 2 > CHUNK_MAX_CHARS: | ||
| flush() | ||
| if current: |
There was a problem hiding this comment.
Suggestion: When a sentence exceeds CHUNK_MAX_CHARS and is force-split into pieces, joining them with \n\n breaks semantic continuity because the pieces belong to the same sentence. Consider using a space or empty string as the separator so the LLM still sees contiguous text.
| @@ -0,0 +1 @@ | |||
| openai>=1.30.0 | |||
There was a problem hiding this comment.
Bug: This requirements file only lists openai>=1.30.0. For the described tool to function, you must include the complete dependency set. At minimum, add fastapi, uvicorn, python-multipart, and the relevant document parsing libraries (e.g., pymupdf, python-docx, pandas, Pillow). Without these, the Docker build will succeed but the application will fail at runtime.
|
|
||
| yield "[entities] Extracting entities...\n" | ||
| clean_text = "\n\n".join(chunks) | ||
| entities = extract_entities(clean_text) |
There was a problem hiding this comment.
Performance: extract_entities() is invoked synchronously without await. If this helper performs I/O (e.g., Oxlo API calls), it will block the async event loop. If it is CPU-bound, offload it with asyncio.to_thread() to keep the server responsive to other concurrent requests.
Implements a 5-step pipeline for generating structured, validated READMEs: 1. Project Analyzer (analyzer.py) - Extracts metadata: language, package_manager, framework, project_type - Uses deepseek-v3.2 for JSON parsing 2. Section Planner (planner.py) - Deterministic section planning by project_type - web-api/cli/library/web-app/other templates 3. README Writer (writer.py) - LLM generation with llama-3.3-70b - Injects metadata for consistency - Enforces shields.io badges and language-tagged code blocks 4. Section Validator (validator.py) - Regex validation: required sections, badge URLs, code fences - Returns structured issue list 5. Refinement Agent (refiner.py) - Max 2 retries to fix detected issues - Re-targets only broken sections Backend: tool.py, analyzer.py, planner.py, writer.py, validator.py, refiner.py, llm_client.py Tests: test_planner.py, test_validator.py, test_refiner.py, test_e2e_mock.py (13 tests total) Frontend: auto-readme-generator.ts, registry.ts (+2 lines) CI: ci-auto-readme-generator.yml (isolated)
- Updated CI workflow to install dependencies from requirements.txt - Added output format option in ToolDefinition - Introduced project metadata handling in analyzer - Sanitized inputs in refine_readme and write_readme functions - Implemented code block extraction in validator - Created unit tests for the auto-readme-generator pipeline
- Add improved JSON parsing with relaxed regex patterns in analyzer - Implement ValidationError handling for metadata parsing fallback - Strengthen LLM response validation with comprehensive checks - Update CI workflow to install dev requirements for testing - Simplify production dependencies while maintaining core functionality
- validator.py: relax FENCE_OPEN regex to accept GFM info strings (Shashank) - tool.py: wrap plan_sections and validate_readme in try/except - tool.py: separate refiner and post-refine validator exception handlers - refiner.py: add early return guard when issues is empty - refiner.py: fix Awaitable[str] type hint on call_model - analyzer.py: wrap call_oxlo_chat in try/except with DEFAULT_METADATA fallback - tests: convert asyncio.run to async def with pytest.mark.asyncio - tests: add pytest.ini with asyncio_mode = auto
…n, preprocessing, summarization, and synthesis functionalities
…riptions for various file types
…e summarization output formatting
1275520 to
ebdccc7
Compare
… generation logic
- Missing dependencies (openai, spacy) added - The maxMb constant is now standardized for size checking/errors - handleDragLeave now uses the font, not the value - When typing in a text area, the font is reset to manual - A 20-page limit has been added for OCR and canvas garbage collection - extract_entities() is now executed using asyncio.to_thread - Sentence fragments forcibly split with spaces are now joined Performance Improvement: OCR page limit Bug Fixes: Drag and drop status, error message consistency.
…e max size variable fix(result-viewer): improve log rendering by removing unnecessary index in key feat(app-sidebar): replace img with next/image for optimized logo rendering and add button type fix(usage-counter): remove unused user variable and add button type for login redirect
Oxtools Submission
Project name: [PDF / Document Summarizer]
Contributor: [@Franci-343]
What does this tool do?
The PDF document summarizer processes files in multiple formats (PDF, Word, Excel, CSV, text, or images) and generates a structured report with a summary, key findings, action items, risk factors, and a data table. It works by dividing the document into fragments, summarizing them in parallel, extracting key entities and figures, and finally synthesizing all the information using the Oxlo API.
Submission checklist
Check every box before requesting a review. Unchecked items will result in the PR being sent back.
Structure
projects/[my-project-name]/Required files
Dockerfileis present anddocker build .succeedsdocker-compose.ymlis present anddocker compose upstarts the appoxlo-manifest.jsonis present and all fields are filled in.env.examplelists every environment variable the project needs (with empty values)README.mdis present with setup instructions a reviewer can follow exactlySecurity
.envfile is not included in this PRgit grep -i "api_key"and found no leaksOxlo API
OXLO_API_KEYenvironment variableFor maintainers
docker build .succeeded locallydocker compose upran successfully and app is reachable