🤖 The Q&A Chatbot

A document-based question-answering chatbot built with Streamlit and powered by advanced NLP models. Upload documents in various formats (PDF, Word, PowerPoint, Excel, images, text files) and ask questions to get answers.

🚀 Features

📄 Multi-Format Document Support

PDF Documents - Text extraction using PyMuPDF
Word Documents - Support for .doc and .docx files
PowerPoint Presentations - Extract text from .ppt and .pptx slides
Excel Spreadsheets - Process data from .xls and .xlsx files
Images - OCR text extraction from .png, .jpg, .jpeg, .bmp, .tiff
Text Files - Direct processing of .txt and .csv files

🧠 Capabilities

Semantic Search - Uses SentenceTransformers for intelligent document retrieval
Context-Aware Q&A - Powered by Google's FLAN-T5-XL model
Cosine Similarity Matching - Finds most relevant document sections
Token Management - Efficient text chunking with tiktoken

💬 Interactive Chat Interface

Real-time Conversations - Chat-style UI with message history
Document Processing Status - Visual feedback during file processing
Responsive Design - Clean, modern interface optimized for all devices
Session Persistence - Maintains chat history during the session

🛠️ Technology Stack

Core AI/ML Libraries

Sentence Transformers - Semantic embeddings and similarity search
Transformers (Hugging Face) - FLAN-T5-XL for text generation
PyTorch - Deep learning framework backend
Scikit-learn - Cosine similarity calculations
NumPy - Numerical computations and array operations

Document Processing

PyMuPDF (fitz) - Advanced PDF text extraction
python-docx - Microsoft Word document processing
python-pptx - PowerPoint presentation handling
Pandas - Excel file processing and data manipulation
Pytesseract - OCR for image text extraction
Pillow (PIL) - Image processing and manipulation

Web Framework & UI

Streamlit - Interactive web application framework
Custom CSS - Responsive chat-style interface design

Utilities

tiktoken - OpenAI's tokenizer for text chunking
OS Operations - File handling and path management

📦 Installation & Setup

Prerequisites

Python 3.8 or higher
Tesseract OCR (for image text extraction)

Installation Steps

Clone the repository

git clone <repository-url>
cd qa-chatbot

Install dependencies
```
pip install -r requirements.txt
```
Install Tesseract OCR (for image processing)
- Windows: Download from GitHub
- macOS: brew install tesseract
- Ubuntu: sudo apt install tesseract-ocr
Run the application
```
streamlit run main.py
```
Open your browser to http://localhost:8501

🎯 How to Use

Step 1: Upload Documents

Click "Choose files" to upload one or multiple documents
Supported formats: PDF, Word, PowerPoint, Excel, Images, Text files
Wait for processing confirmation

Step 2: Start Asking Questions

Type your question in the chat input
The AI will search through your uploaded documents
Get contextually relevant answers based on document content

Step 3: Continue the Conversation

Ask follow-up questions
Reference previous answers in the chat history
Upload additional documents as needed

🏗️ Project Structure

qa-chatbot/
├── main.py              # Main Streamlit application
├── file_utils.py        # Document processing utilities
├── requirements.txt     # Python dependencies
└── README.md           # Project documentation

🔧 Key Components

main.py

Streamlit Interface - Chat UI and file upload handling
Model Loading - FLAN-T5-XL and SentenceTransformer initialization
Question Processing - Semantic search and answer generation
Session Management - Chat history and state persistence

file_utils.py

Multi-format Support - Unified text extraction interface
Error Handling - Robust processing with fallback mechanisms
Modular Design - Separate functions for each file type

🎨 Features Deep Dive

Intelligent Document Processing

# Automatic format detection and processing
def extract_text(file_path):
    ext = os.path.splitext(file_path)[1].lower()
    # Route to appropriate extraction function

Advanced Semantic Search

# Find most relevant document chunks
similarity_scores = cosine_similarity([question_embedding], chunk_embeddings)
best_chunks = get_top_k_chunks(similarity_scores)

Context-Aware Answer Generation

# Generate answers using document context
prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
answer = qa_pipeline(prompt)

🚀 Performance Optimizations

Model Caching - @st.cache_resource for efficient model loading
Chunked Processing - Smart text segmentation for large documents
Memory Management - Efficient embedding storage and retrieval
Lazy Loading - On-demand model initialization

🔒 Error Handling & Reliability

Format Validation - Comprehensive file type checking
Graceful Degradation - Fallback mechanisms for processing failures
User Feedback - Clear error messages and processing status
Exception Safety - Robust error catching throughout the pipeline

📋 System Requirements

Minimum Requirements

RAM: 8GB (16GB recommended for large documents)
Storage: 2GB free space for models
Python: 3.8+
Internet: Required for initial model download

Supported File Formats

Documents: PDF, DOC, DOCX, TXT, CSV
Presentations: PPT, PPTX
Spreadsheets: XLS, XLSX
Images: PNG, JPG, JPEG, BMP, TIFF

🔮 Future Enhancements

Multi-language Support - International document processing
Advanced OCR - Improved image text extraction
Cloud Integration - Support for cloud storage services
Batch Processing - Handle multiple documents simultaneously
Export Features - Save chat history and answers
Custom Models - Integration with domain-specific AI models

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/enhancement)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/enhancement)
Create a Pull Request

🆘 Troubleshooting

Common Issues

Tesseract not found: Ensure OCR software is properly installed and in PATH
Model loading errors: Check internet connection for initial download
Memory issues: Close other applications or upgrade RAM for large documents
File processing fails: Verify file format and try re-uploading

Support

For issues or questions, please open a GitHub issue with:

Error message (if any)
File type and size being processed
System specifications

Built with ❤️ using Streamlit, Transformers, and advanced NLP techniques

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
file_utils.py		file_utils.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 The Q&A Chatbot

🚀 Features

📄 Multi-Format Document Support

🧠 Capabilities

💬 Interactive Chat Interface

🛠️ Technology Stack

Core AI/ML Libraries

Document Processing

Web Framework & UI

Utilities

📦 Installation & Setup

Prerequisites

Installation Steps

🎯 How to Use

Step 1: Upload Documents

Step 2: Start Asking Questions

Step 3: Continue the Conversation

🏗️ Project Structure

🔧 Key Components

main.py

file_utils.py

🎨 Features Deep Dive

Intelligent Document Processing

Advanced Semantic Search

Context-Aware Answer Generation

🚀 Performance Optimizations

🔒 Error Handling & Reliability

📋 System Requirements

Minimum Requirements

Supported File Formats

🔮 Future Enhancements

🤝 Contributing

🆘 Troubleshooting

Common Issues

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages