Data analysis pipeline and interactive LLM assistant for QAC387.
.
├── builds/
│ ├── build0_data_analysis_pipeline_assignment_1.py # Build 0: Data analysis pipeline
│ └── build1_llm_assistant_Assignment_2.py # Build 1: Interactive LLM assistant
├── src/ # Refactored modules from Build 0
│ ├── __init__.py # Re-exports all functions
│ ├── utilities.py # File I/O and directory helpers
│ ├── profiling.py # Dataset profiling and column splitting
│ ├── summaries.py # Numeric and categorical summaries
│ ├── analysis.py # Missingness, regression, correlations
│ ├── plots.py # Visualization functions
│ └── checks.py # JSON validation and target checks
├── data/
│ └── penguins.csv # Palmer Penguins dataset
├── test_models.py # Module import and functionality tests
├── requirements.txt
├── .env.example # Template for API key configuration
└── ASSIGNMENT_README.md # Original Build 0 assignment instructions
-
Clone the repository and create a virtual environment:
python -m venv .venv source .venv/bin/activate # macOS/Linux .venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Configure your API key:
cp .env.example .env # Edit .env and add your API key and base URL -
Verify modules work:
python test_models.py
Automated pipeline that loads a CSV dataset, generates profiling reports, summary statistics, correlation analysis, regression output, and visualizations.
python builds/build0_data_analysis_pipeline_assignment_1.py --data data/penguins.csvInteractive CLI assistant powered by LangChain LCEL that answers questions about dataset schema. Uses Claude (Anthropic) as the LLM backend via an OpenAI-compatible proxy. Supports three modes:
Run 1 — No memory:
python builds/build1_llm_assistant_Assignment_2.py --data data/penguins.csvRun 2 — With memory (retains conversation context):
python builds/build1_llm_assistant_Assignment_2.py --data data/penguins.csv --memoryRun 3 — Memory + streaming (streams responses token-by-token):
python builds/build1_llm_assistant_Assignment_2.py --data data/penguins.csv --memory --stream| Flag | Description |
|---|---|
--model |
LLM model name (default: claude-opus-4-6) |
--temperature |
Sampling temperature (default: 0.2) |
--quiet_schema |
Suppress schema display at startup |
--report_dir |
Output directory (default: reports) |