Fix #5930: Extract text from PDFs in ReadFileTool instead of returning base64 by devin-ai-integration[bot] · Pull Request #5932 · crewAIInc/crewAI

devin-ai-integration · 2026-05-26T13:14:30Z

Summary

Fixes #5930 — When using input_files with PDFFile, the read_file tool was base64-encoding the entire raw PDF binary and injecting it into the LLM context. This caused massive context bloat, inconsistent responses, and context overflow for even medium-sized documents.

Root cause: ReadFileTool._run() checked if the content type was a text type, and for anything else (including application/pdf) it fell through to a base64 encoding path.

Fix: Added PDF-specific handling in ReadFileTool._run() that uses pypdf (already a dependency via crewai-files) to extract structured text from PDF pages. Each page is labeled with a --- Page N --- header. Graceful fallbacks are provided for:

pypdf not installed → short install instruction message
No extractable text (image-only PDFs) → friendly diagnostic message
Corrupted PDFs → error message (never base64)

The extracted text is dramatically smaller than base64-encoded binary, keeping the LLM context manageable.

Review & Testing Checklist for Human

Verify that ReadFileTool._run() correctly extracts text from a real multi-page PDF (not just the synthetic test PDFs)
Test with an image-only PDF to confirm the "no extractable text" fallback message appears
Confirm that non-PDF binary files (images, audio, video) still return base64 as before
Run a full crew kickoff with input_files={"doc": PDFFile(source="path/to/real.pdf")} and verify the agent receives readable text, not base64

Notes

pypdf is already a dependency of crewai-files, so no new dependencies are introduced
The fix only affects the read_file tool path — files that are auto-injected via multimodal LLM support (e.g., images on GPT-4o) are unaffected
Added 6 new test cases covering: single-page extraction, multi-page extraction, blank PDFs, corrupted PDFs, missing pypdf, and size comparison vs base64

Link to Devin session: https://app.devin.ai/sessions/0922c01f2b064a7e8e0a6a2f05e9ab09

…g base64 When using input_files with PDFFile, the read_file tool was returning the entire PDF as base64-encoded binary data. This caused: - Massive context bloat for the LLM - Inconsistent responses and context overflow - The same file being re-processed on each tool call Now ReadFileTool detects application/pdf content and extracts text using pypdf (already a dependency via crewai-files) instead of base64-encoding the raw bytes. Each page is labeled with a page number header for clarity. Graceful fallbacks are provided when: - pypdf is not installed (short install message) - The PDF contains no extractable text (friendly message) - The PDF is corrupted (error message, never base64) Co-Authored-By: João <[email protected]>

devin-ai-integration · 2026-05-26T13:14:32Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions Bot added the size/M label May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #5930: Extract text from PDFs in ReadFileTool instead of returning base64#5932

Fix #5930: Extract text from PDFs in ReadFileTool instead of returning base64#5932
devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779800597-fix-pdf-base64-read-file

devin-ai-integration Bot commented May 26, 2026

Uh oh!

devin-ai-integration Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

devin-ai-integration Bot commented May 26, 2026

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented May 26, 2026

🤖 Devin AI Engineer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants