A shared workspace for code, experiments, and data pipelines.
- Project 1 — Reading CSV Files with Pandas — load a real aqueous-solubility dataset (AQSolDB) into a pandas DataFrame and explore it with shape, dtypes, summary stats, and filtering.
- Project 2 — Summary Statistics & Outlier Detection — compute quartiles and the IQR and implement Tukey's outlier rule from scratch on the Palmer Penguins dataset, discovering why outliers only surface once you group by species.
The macOS package manager — used to install everything below. If you don't have it:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"macOS ships git with Apple's Command Line Tools:
xcode-select --install # installs git + compilers (skip if already present)
git --version # verifyOptional: brew install git for a newer version than Apple's.
uv manages the Python interpreter, the virtual environment, and packages — all in one fast tool.
brew install uvgit clone [email protected]:SuperCowPowers/data_engineering.git
cd data_engineering
uv sync # creates .venv, installs the right Python + all dependenciesuv sync reads pyproject.toml and .python-version, downloads Python 3.13 if
you don't have it, and builds the environment. That's the whole setup.
uv run python path/to/script.py # run a scriptPrefer the classic workflow? Activate the env and use python directly:
source .venv/bin/activate
python path/to/script.pyPoint your editor at the project's .venv so it uses the right interpreter and
finds the installed packages.
PyCharm
- Settings → Project → Python Interpreter → Add Interpreter → Add Local.
- Choose Existing and select
.venv/bin/pythonin the project. (PyCharm 2024.2+ also has a native uv option that does this for you.)
VS Code
- Install the Python extension.
- Command Palette (⌘⇧P) → Python: Select Interpreter → pick the one under
.venv. VS Code usually auto-detects it on open.
uv run pytest # run testsgit checkout -b my-feature
# ... make changes, commit ...
git push -u origin my-featureThen open a pull request on GitHub for review.
data_engineering/
├── pyproject.toml # project, dependencies, tool config
├── .python-version # pinned Python version
├── uv.lock # exact resolved versions (created by `uv sync`)
├── src/data_engineering/ # importable, shared code
├── tests/ # pytest tests
├── project_1/ # reading CSVs with pandas
└── project_2/ # summary statistics & outlier detection