A document-based question-answering chatbot built with Streamlit and powered by advanced NLP models. Upload documents in various formats (PDF, Word, PowerPoint, Excel, images, text files) and ask questions to get answers.
- PDF Documents - Text extraction using PyMuPDF
- Word Documents - Support for
.docand.docxfiles - PowerPoint Presentations - Extract text from
.pptand.pptxslides - Excel Spreadsheets - Process data from
.xlsand.xlsxfiles - Images - OCR text extraction from
.png,.jpg,.jpeg,.bmp,.tiff - Text Files - Direct processing of
.txtand.csvfiles
- Semantic Search - Uses SentenceTransformers for intelligent document retrieval
- Context-Aware Q&A - Powered by Google's FLAN-T5-XL model
- Cosine Similarity Matching - Finds most relevant document sections
- Token Management - Efficient text chunking with tiktoken
- Real-time Conversations - Chat-style UI with message history
- Document Processing Status - Visual feedback during file processing
- Responsive Design - Clean, modern interface optimized for all devices
- Session Persistence - Maintains chat history during the session
- Sentence Transformers - Semantic embeddings and similarity search
- Transformers (Hugging Face) - FLAN-T5-XL for text generation
- PyTorch - Deep learning framework backend
- Scikit-learn - Cosine similarity calculations
- NumPy - Numerical computations and array operations
- PyMuPDF (fitz) - Advanced PDF text extraction
- python-docx - Microsoft Word document processing
- python-pptx - PowerPoint presentation handling
- Pandas - Excel file processing and data manipulation
- Pytesseract - OCR for image text extraction
- Pillow (PIL) - Image processing and manipulation
- Streamlit - Interactive web application framework
- Custom CSS - Responsive chat-style interface design
- tiktoken - OpenAI's tokenizer for text chunking
- OS Operations - File handling and path management
- Python 3.8 or higher
- Tesseract OCR (for image text extraction)
-
Clone the repository
git clone <repository-url> cd qa-chatbot
-
Install dependencies
pip install -r requirements.txt
-
Install Tesseract OCR (for image processing)
- Windows: Download from GitHub
- macOS:
brew install tesseract - Ubuntu:
sudo apt install tesseract-ocr
-
Run the application
streamlit run main.py
-
Open your browser to
http://localhost:8501
- Click "Choose files" to upload one or multiple documents
- Supported formats: PDF, Word, PowerPoint, Excel, Images, Text files
- Wait for processing confirmation
- Type your question in the chat input
- The AI will search through your uploaded documents
- Get contextually relevant answers based on document content
- Ask follow-up questions
- Reference previous answers in the chat history
- Upload additional documents as needed
qa-chatbot/
โโโ main.py # Main Streamlit application
โโโ file_utils.py # Document processing utilities
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
- Streamlit Interface - Chat UI and file upload handling
- Model Loading - FLAN-T5-XL and SentenceTransformer initialization
- Question Processing - Semantic search and answer generation
- Session Management - Chat history and state persistence
- Multi-format Support - Unified text extraction interface
- Error Handling - Robust processing with fallback mechanisms
- Modular Design - Separate functions for each file type
# Automatic format detection and processing
def extract_text(file_path):
ext = os.path.splitext(file_path)[1].lower()
# Route to appropriate extraction function# Find most relevant document chunks
similarity_scores = cosine_similarity([question_embedding], chunk_embeddings)
best_chunks = get_top_k_chunks(similarity_scores)# Generate answers using document context
prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
answer = qa_pipeline(prompt)- Model Caching -
@st.cache_resourcefor efficient model loading - Chunked Processing - Smart text segmentation for large documents
- Memory Management - Efficient embedding storage and retrieval
- Lazy Loading - On-demand model initialization
- Format Validation - Comprehensive file type checking
- Graceful Degradation - Fallback mechanisms for processing failures
- User Feedback - Clear error messages and processing status
- Exception Safety - Robust error catching throughout the pipeline
- RAM: 8GB (16GB recommended for large documents)
- Storage: 2GB free space for models
- Python: 3.8+
- Internet: Required for initial model download
- Documents: PDF, DOC, DOCX, TXT, CSV
- Presentations: PPT, PPTX
- Spreadsheets: XLS, XLSX
- Images: PNG, JPG, JPEG, BMP, TIFF
- Multi-language Support - International document processing
- Advanced OCR - Improved image text extraction
- Cloud Integration - Support for cloud storage services
- Batch Processing - Handle multiple documents simultaneously
- Export Features - Save chat history and answers
- Custom Models - Integration with domain-specific AI models
- Fork the repository
- Create a feature branch (
git checkout -b feature/enhancement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/enhancement) - Create a Pull Request
- Tesseract not found: Ensure OCR software is properly installed and in PATH
- Model loading errors: Check internet connection for initial download
- Memory issues: Close other applications or upgrade RAM for large documents
- File processing fails: Verify file format and try re-uploading
For issues or questions, please open a GitHub issue with:
- Error message (if any)
- File type and size being processed
- System specifications
Built with โค๏ธ using Streamlit, Transformers, and advanced NLP techniques