A relational analytics platform that ingests bibliographic data from DBLP, Kaggle Scimago and iCore26 into a MySQL star schema and serves it through an async FastAPI back-end and a React / Visx front-end. University of Ioannina, Department of Computer Science & Engineering. Coursework MYE030 / ΠΛΕ045, Spring 2026.
flowchart LR
subgraph SOURCES["Raw sources (data/)"]
DBLPJ[DBLP<br/>journals.csv]
DBLPC[DBLP<br/>conferences.csv]
ICORE[iCore26<br/>rankings.csv]
KAGGLE[Kaggle Scimago<br/>rankings.csv]
end
subgraph ETL["Polars ETL (Python 3.13)"]
TRANS[transformer.py<br/>normalise · fuzzy<br/>resolve · validate]
EXPORT[exporter.py<br/>CSVs with N nulls]
end
subgraph WAREHOUSE["MySQL 8 InnoDB / utf8mb4"]
SQL1[01_schema.sql<br/>DDL]
SQL2[02_load.sql<br/>LOAD DATA LOCAL INFILE]
SQL3[03_views.sql<br/>11 views + 6 materialised]
DB[(mye030<br/>star schema)]
end
subgraph API["FastAPI + asyncmy"]
DAO[DAOs<br/>raw parameterised SQL]
ROUTE[28 routers]
DOCS["Swagger / ReDoc<br/>openapi.json"]
end
subgraph FE["React 19 + Vite + Visx"]
UI[13 routes<br/>6 chart families<br/>filters · tables]
end
DBLPJ --> TRANS
DBLPC --> TRANS
ICORE --> TRANS
KAGGLE --> TRANS
TRANS --> EXPORT
EXPORT --> SQL2
SQL1 --> DB
SQL2 --> DB
SQL3 --> DB
DB <-->|asyncmy pool| DAO
DAO --> ROUTE
ROUTE --> DOCS
ROUTE -->|JSON| UI
style DB fill:#E3D5C0,stroke:#14213D,color:#1A1A1A
style TRANS fill:#FFF2A8,stroke:#14213D,color:#1A1A1A
style EXPORT fill:#FFF2A8,stroke:#14213D,color:#1A1A1A
style UI fill:#F5F1E8,stroke:#14213D,color:#1A1A1A
style ROUTE fill:#F5F1E8,stroke:#14213D,color:#1A1A1A
style DAO fill:#F5F1E8,stroke:#14213D,color:#1A1A1A
First-time pull? Read
docs/ONBOARDING.mdfirst — it is the 20-minute walkthrough fromgit cloneto a working app, including where to find the database backup and what to do when ports clash. The rest of this README is the long-form reference.
- Overview
- Prerequisites
- Clone and configure
- Run the platform
- What you can observe
- Feature matrix
- Smoke test (verify it works)
- Running the tests
- Database backup and restore
- Architecture in depth
- Trade-offs at a glance
- Project structure
- Troubleshooting
- Further reading
| Aspect | Number / Choice |
|---|---|
| Pipeline stages | 4 (Extract → Transform → Load → Materialise) |
| Source datasets | 3 (DBLP, Kaggle Scimago, iCore26) |
| Schema tables | 9 (4 lookups + 2 facts + 2 bridges + 1 rejection log) |
| Analytical views | 11 |
| Materialised aggregates | 6 InnoDB tables |
| HTTP endpoints | 28 business + /health + /docs + /redoc + /openapi.json |
| Frontend routes | 13 file-based routes |
| Chart families | 6 (Line, Bar, Scatter, Heatmap, StackedArea, HorizontalBarChart) |
| Backend tests | ~200 pytest tests, 95 % line coverage |
| Frontend tests | ~160 Vitest + ~50 Playwright E2E |
| Hard rules respected | No ORM, no SQL builders, no string-interpolated SQL, no in-Python aggregation |
Required for every run path:
| Tool | Minimum version | Why |
|---|---|---|
| Git | 2.30+ | Clone the repository |
| Docker Desktop / Docker Engine | 24.x+ | Runs the MySQL 8 container |
| Python | 3.13.x (NOT 3.14) | ETL pipeline + FastAPI server |
| uv | 0.4.x+ | Python package manager (replaces pip + venv) |
| Node.js | 20 LTS or 22 LTS | Required by pnpm to build the frontend |
| pnpm | 9.x+ | Frontend package manager |
Recommended (optional):
| Tool | Purpose |
|---|---|
MiKTeX / TeX Live (xelatex) |
Rebuild the LaTeX report from deliverables/report/ |
DataGrip / MySQL Workbench / mysql CLI |
Browse the database, draw ER diagrams |
| VS Code / PyCharm / WebStorm | Edit code with full IntelliSense |
| Playwright | Run the end-to-end suite (pnpm exec playwright install chromium) |
Bare-minimum install: Docker Desktop, Python 3.13 with
uv, Node 22 LTS withpnpm. The per-OS subsections below show the exact commands. The "Recommended" tools above are only needed for deeper inspection.
Open PowerShell (not PowerShell ISE). Most installers below want Run as Administrator the first time.
# Git — https://git-scm.com/download/win (accept defaults)
git --version
# Docker Desktop — https://docker.com/products/docker-desktop/
# Tick "Use WSL 2 instead of Hyper-V" during install, reboot, launch Docker
docker --version
docker compose version
# Python 3.13
winget install --id Python.Python.3.13 --source winget
python --version # expect Python 3.13.x
# uv (Astral)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# Close + reopen the terminal so PATH refreshes
uv --version
# Node 22 LTS + pnpm via Corepack
winget install --id OpenJS.NodeJS.LTS --source winget
corepack enable
corepack prepare pnpm@latest --activate
node --version # expect v22.x
pnpm --version # expect 9.x or 10.xsudo apt update
sudo apt install -y git ca-certificates curl
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER # log out / in after this
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt install -y python3.13 python3.13-venv
curl -LsSf https://astral.sh/uv/install.sh | sh
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt install -y nodejs
sudo corepack enable && corepack prepare pnpm@latest --activateOther systems (macOS, Arch, Fedora …): the tools below all have official binaries / Homebrew packages. Install Docker Desktop, Python 3.13,
uv, Node 22 andpnpm; commands above are straightforward to translate.
# Pick any folder. The path may contain spaces but not " or '.
git clone https://github.com/JohnAng/data-integration-viz-platform.git
cd data-integration-viz-platformCopy .env.example to .env:
# Windows
Copy-Item .env.example .env
# macOS / Linux
cp .env.example .envThe default values:
MYSQL_ROOT_PASSWORD=root
MYSQL_DATABASE=mye030
MYSQL_USER=Angelakos
MYSQL_PASSWORD=2403
MYSQL_CONTAINER_NAME=mye030_mysql
You can change them freely before the first docker compose up.
After the volume is initialised the credentials are baked in — to
change them you must drop the volume with docker compose down -v
(this destroys all data).
Both Docker Compose and pydantic-settings read .env
automatically.
The repository on GitHub ships with empty data folders because the
raw CSVs (~722 MiB) and the MySQL backup (~172 MiB) exceed GitHub's
100 MiB per-file limit. The orchestrator (python run.py) auto-detects
what is on disk and chooses the fastest path, but you need to place at
least one of the two archives first.
| What you need | Where to put it | Result |
|---|---|---|
deliverables/db_backup.sql.gz (~172 MiB) |
deliverables/ directly |
Auto-restore on first boot (~3 min) |
data/dblp_dataset/*.csv + data/icore26_data/*.csv + data/journal_ranking_data_raw/*.csv |
matching subfolders of data/ |
Full ETL pipeline runs (~5 min) |
| Neither | — | python run.py exits with a banner listing the download sources |
Where to get them:
- For evaluators: links to both archives are in
AM2403_prj.txt(submitted viaturnin). The txt file lists Proton Drive URLs that open without a Proton account, plus the exact placement instructions reproduced above. - For the public: the raw CSVs come from the original sources
(one-time download, no auth required):
- DBLP dump (article + inproceedings CSVs): https://dblp.org/xml/release/
- iCore26 conference rankings: http://portal.core.edu.au/conf-ranks/
- Kaggle Scimago journal ranking: search for "Scimago Journal Ranking" on Kaggle.
If you place both the backup and the raw CSVs, python run.py
prefers the backup (Path A). Use python run.py --etl to override and
force the ETL path.
One command, no host-side Python / Node / uv / pnpm required. Just Docker Desktop. ~3 min on first run (image builds + auto-restore).
The Docker stack ships the frontend as a production build: Vite
compiles, tree-shakes and minifies the React source into
/usr/share/nginx/html inside the mye030_frontend image. Nginx
serves the pre-built bundle, no hot-reload, no source maps.
| Surface | What the user gets in Path 0 | What the user gets in Paths A/B/C |
|---|---|---|
| Build mode | Production (vite build, minified, content-hashed) |
Development (vite dev, HMR, source maps) |
| Frontend server | nginx 1.27 alpine | Vite dev server |
| Edit & reload | Rebuild image | Hot module replacement |
| First request latency | <50 ms (static files) | ~300 ms (on-demand transpile) |
| Bundle size | ~600 KB gzipped | unminified, ~6 MB |
| Backend reload on save | No (immutable image) | Yes (uvicorn --reload) |
Use Path 0 for one-shot deployment and demos. Use Paths A/B/C while actively editing code.
python run.pyThat single command does everything:
- checks that Docker is running;
- copies
.env.exampleto.envon first run; - auto-detects whether the gzipped backup is present, the raw CSVs are present, or neither, and prints a clear hint in the third case;
- builds the images on first run;
- brings up MySQL, restores the backup (or runs the ETL), and waits until all three containers report healthy;
- prints the final URLs.
Typical first-run walltime: ~2.5 minutes with the backup present, ~5 minutes if the ETL has to run.
Other entry points:
| Command | What it does |
|---|---|
python run.py --status |
List container health + /health + /api/meta/totals |
python run.py --verify |
Run the data-quality SQL report |
python run.py --etl |
Force the ETL path even if the backup file is on disk |
python run.py --down |
Stop the stack (volume kept) |
python run.py --reset |
Stop the stack and wipe the data volume |
python run.py --help |
Print the full usage |
If you would rather avoid the wrapper, the underlying Docker commands also work standalone:
Copy-Item .env.example .env # first time only
docker compose up -d --wait # blocks until all 3 healthydocker compose up -d alone returns as soon as containers are
started; on first boot the backup restore still has ~2 minutes
to go. --wait blocks the command until every healthcheck passes.
Three containers are built and started:
| Container | Image | Port | Healthcheck | Notes |
|---|---|---|---|---|
mye030_mysql |
mysql:8.0 |
3306 |
mysqladmin ping |
Auto-restores deliverables/db_backup.sql.gz on first boot |
mye030_backend |
Built from src/backend/Dockerfile |
8000 |
GET /health |
uvicorn + asyncmy, waits for MySQL healthy |
mye030_frontend |
Built from src/frontend/Dockerfile (multi-stage Vite + nginx) |
5173 |
GET / |
nginx proxies /api/*, /docs, /redoc, /openapi.json, /health to backend |
Open http://localhost:5173 — the entire app runs through nginx, no CORS dances. The backend Swagger UI is at http://localhost:5173/docs (proxied) or http://localhost:8000/docs (direct).
First boot auto-restore:
deliverables/db_backup.sql.gzis mounted read-only into the MySQL init folder. The tiny init script atscripts/mysql-init.shgunzip-streams it intoMYSQL_DATABASEthe first time the volume is empty. If the backup file is missing, the init script logs a warning and leaves an empty schema — you would then need Path B.
docker compose up # foreground (Ctrl+C to stop)
docker compose up -d # detached
docker compose logs -f # tail every service
docker compose logs backend # tail one service
docker compose ps # status
docker compose restart backend # restart one
docker compose down # stop (volume persists)
docker compose down -v # stop + drop the MySQL volume (wipes data)The host-installed paths exist for hot-reload development and for running the ETL pipeline on raw CSVs. Reviewers should prefer Path 0 above.
# Windows
.\scripts\setup.ps1
# Debian / Ubuntu
./scripts/setup.shThe script does Path A end-to-end on the host (prereq check + .env +
MySQL container only + backend deps + restore + frontend deps).
Idempotent — safe to re-run. If deliverables/db_backup.sql.gz is
missing it prints the Path B commands instead.
The backup deliverables/db_backup.sql.gz (≈ 172 MiB compressed) is a
full mysqldump of the live database — schema + ~2.5M article rows +
~1.4M author rows + views + materialised tables.
GitHub blocks files > 100 MiB. See the Before you start — get the data section above for where to download
db_backup.sql.gz; place it atdeliverables/db_backup.sql.gzbefore running any of the commands below.
docker compose up -d
cd src\backend ; uv sync ; cd ..\..
cd src\backend ; uv run python -m database.db_restore ; cd ..\..
# Quick sanity check
docker exec mye030_mysql mysql -uroot -proot mye030 -e "SELECT * FROM view_corpus_totals;"Then run the API and the frontend.
Use if you placed the source CSVs in data/ (see the Before you
start — get the data section for
download URLs and the expected subfolder layout).
docker compose up -d
cd src\backend
uv sync
uv run python etl/exporter.py # → 9 CSVs in exports/
cd ..\..
docker exec -i mye030_mysql mysql --local-infile=1 -uroot -proot < sql_scripts/01_schema.sql
docker exec mye030_mysql mysql -uroot -proot -e "GRANT ALL PRIVILEGES ON mye030.* TO 'Angelakos'@'%'; FLUSH PRIVILEGES;"
docker exec -i mye030_mysql mysql --local-infile=1 -uroot -proot mye030 < sql_scripts/02_load.sql
docker exec -i mye030_mysql mysql -uroot -proot mye030 < sql_scripts/03_views.sqlPowerShell quirk: if
<does not pipe, replace each line withGet-Content sql_scripts/01_schema.sql -Raw | docker exec -i mye030_mysql mysql ....
Wipe the volume and start over:
docker compose down -v
docker compose up -d
# ...then Path A or Path B from step 3cd src\backend
uv run uvicorn api.main:application --reload --port 8000The server listens on http://127.0.0.1:8000. Open http://localhost:8000/docs for Swagger and http://localhost:8000/redoc for ReDoc.
cd src\frontend
pnpm install # once per environment
pnpm dev # Vite dev server on :5173The dev server proxies /api/* to http://localhost:8000. Override the
target with the VITE_BACKEND_URL environment variable.
Every claim in this README is verifiable through one of the URLs or commands below. No part of the system is black-box.
| Surface | URL / command | What it proves |
|---|---|---|
| Live app | http://localhost:5173 | Landing → dashboard → profiles → charts |
| Swagger UI | http://localhost:8000/docs | Interactive call of every endpoint |
| ReDoc | http://localhost:8000/redoc | Static, narrative-style API reference |
| OpenAPI JSON (live) | http://localhost:8000/openapi.json | Machine-readable contract |
| OpenAPI JSON (offline) | docs/openapi.json |
Same contract, checked in; no server required |
| Health probe | http://localhost:8000/health | Liveness, returns {"status":"ok"} |
| HTTP playbook | src/backend/api.http |
Click-to-run requests in PyCharm / VS Code REST Client |
| DB CLI | docker exec -it mye030_mysql mysql -uroot -proot mye030 |
Run any SQL directly |
| Data quality report | docker exec -i mye030_mysql mysql -uroot -proot mye030 < scripts/data_quality_report.sql |
Row counts per table, entity-resolution match rates (journals 80.2 %, conferences 19.0 %), rejection-log breakdown, orphan / integrity checks (all 0 after a clean load). This script is the canonical source for every cleaning percentage quoted in the report. |
| DB browser | DataGrip / Workbench at localhost:3306 |
Visualise the schema + ER diagram |
| Backend tests | cd src/backend && uv run pytest tests/ -q |
~200 tests, green |
| Coverage HTML | uv run pytest --cov --cov-report=html then open htmlcov/index.html |
Per-line coverage map |
| Frontend tests | cd src/frontend && pnpm test |
~160 Vitest tests, green |
| E2E suite | pnpm exec playwright test (both servers up) |
~50 specs |
| E2E HTML report | pnpm exec playwright show-report after a run |
Per-test traces + screenshots |
| Lint + types | uv run ruff check . && uv run pyright |
Zero errors |
| PDF report | deliverables/AM2403_projectReport.pdf |
29 pages, full deliverable |
| Video script | deliverables/VIDEO_SCRIPT.md |
Shot-by-shot |
| UML diagrams | diagrams/ |
ETL flow, packages, deployment as PlantUML + PNG |
| Architecture & decisions | docs/ARCHITECTURE.md |
One decision per section with alternatives + trade-offs |
| Design system | docs/design.md |
Quiet-luxury design tokens |
| Glossary | docs/GLOSSARY.md |
DBLP / SJR / iCore / FoR / quartile definitions |
Maps the brief's requirements onto the implementation.
| Requirement | Implementation | File | How to verify |
|---|---|---|---|
| Read 3 heterogeneous sources | Polars LazyFrames over CSV | src/backend/etl/transformer.py |
uv run python etl/transformer.py prints row counts |
| Normalise / canonicalise titles | expand_abbreviations + lowercase + strip |
same | inline unit tests in tests/unit/ |
| Acronym resolution (DBLP ↔ iCore) | case-insensitive exact join on acronym | same | rejection log entries |
| Fuzzy resolution (DBLP ↔ Kaggle) | rapidfuzz.token_set_ratio ≥ 85 |
same | match-rate logs printed at end of ETL |
| Validation | NULL titles, year ∉ [1900, 2030], empty authors → rejected | same | inspect rejection_logs table after load |
| Quarantine, no silent loss | rejection log table populated by 02_load.sql |
sql_scripts/02_load.sql |
SELECT COUNT(*), reason FROM rejection_logs GROUP BY reason; |
Export with \N nulls for LOAD DATA INFILE |
polars.write_csv(null_value="\\N") |
src/backend/etl/exporter.py |
inspect any file in exports/ |
| Bulk load via raw SQL | LOAD DATA LOCAL INFILE |
sql_scripts/02_load.sql |
run script, watch row counts |
| Requirement | Implementation | File | How to verify |
|---|---|---|---|
| Integer PKs (no string PKs) | INT UNSIGNED AUTO_INCREMENT on every lookup PK |
sql_scripts/01_schema.sql |
DESCRIBE lookup_journals; |
| FK integrity | Bridges + facts FK to lookups; ON UPDATE CASCADE |
same | SHOW CREATE TABLE bridge_journal_article_authors; |
| Wide journal dimension | every Kaggle metric column inside lookup_journals |
same | DESCRIBE lookup_journals; shows 16 ranking columns |
| Aggregations inside DBMS (no in-Python) | every aggregate is a SQL view or CTE | sql_scripts/03_views.sql |
SHOW FULL TABLES IN mye030 WHERE Table_type='VIEW'; |
| Heavy aggregates as materialised tables | 6 InnoDB tables created with CREATE TABLE … AS SELECT … |
same | SHOW TABLES LIKE 'materialized_%'; |
| Parameterised queries only | every DAO uses %s placeholders |
src/backend/api/data_access/*.py |
grep -r "f'.*SELECT" src/backend/api → empty |
| Backup that recreates DB automatically | gzipped mysqldump + restore helper |
src/backend/database/db_backup.py & db_restore.py |
uv run python -m database.db_restore |
| Requirement | Implementation | File | How to verify |
|---|---|---|---|
| 28 endpoints | 6 routers under api/routers/ |
listed in docs/API_REFERENCE.md |
http://localhost:8000/docs |
| Async all the way down | every handler async def, asyncmy pool |
src/backend/api/database.py |
grep — no sync DB calls |
| Auto OpenAPI generation | FastAPI + Pydantic | src/backend/api/main.py |
http://localhost:8000/openapi.json |
| RFC 7807 problem-details on errors | custom exception handler | src/backend/api/errors.py |
hit /api/journals/99999999 |
| Server-side pagination | uniform PaginatedResponse[T] envelope |
src/backend/api/schemas/common.py |
inspect any list response |
| Year-range filter recomputes aggregates | start_year / end_year on every profile + chart endpoint |
data_access/* |
call same endpoint with and without range |
| Substring filters with prepopulated dropdowns | /api/meta/options feeds every dropdown |
src/backend/api/data_access/meta.py |
http://localhost:8000/api/meta/options |
| Requirement | Implementation | File | How to verify |
|---|---|---|---|
| List + profile pages per entity | 13 file-based routes | src/frontend/src/routes/ |
navigate the app |
| Year-range filter recomputes stats | YearRangeFilter writes search params; queries recompose |
src/frontend/src/components/filters/YearRangeFilter.tsx |
watch the line chart change on Apply |
| LineCharts, BarCharts, ScatterPlots | three Visx chart components | src/frontend/src/components/charts/ |
/charts page |
| Multi-series filtering & legends | InteractiveLegend toggles series |
src/frontend/src/components/charts/InteractiveLegend.tsx |
click a legend chip |
| Server-side sort + sticky header | PaginatedTable writes sort_by + sort_order |
src/frontend/src/components/tables/PaginatedTable.tsx |
click a column header |
| Heatmap + StackedArea + Cumulative growth | extra chart families | src/frontend/src/components/charts/Heatmap.tsx + StackedArea.tsx + HorizontalBarChart.tsx |
/charts switcher |
| Drag-to-zoom on every line chart | brush implementation in LineChart.tsx |
same | drag across the X axis |
| Voronoi nearest-x hover | hover overlay in LineChart.tsx |
same | hover anywhere over the line chart |
| Debounced text search inputs | useDebounce hook |
src/frontend/src/hooks/useDebounce.ts |
type fast in any search input |
| URL-driven filters (shareable) | TanStack Router search params + Zod | every route file | copy URL, paste in new tab |
| Requirement | Implementation | File | How to verify |
|---|---|---|---|
| Unit tests | tests/unit/ |
src/backend/tests/unit/ |
uv run pytest tests/unit -q |
| Integration tests against real MySQL | tests/integration/ with seed fixture |
src/backend/tests/integration/fixtures/seed.sql |
uv run pytest tests/integration -q |
| Frontend unit + route tests | Vitest + Testing Library + MSW | src/frontend/src/**/*.test.tsx |
pnpm test |
| Frontend E2E | Playwright (chromium) | src/frontend/e2e/*.spec.ts |
pnpm exec playwright test |
| Coverage reports | pytest-cov + Vitest v8 | htmlcov/ and coverage/ |
--cov-report=html / pnpm test:coverage |
| Static analysis | ruff + pyright + eslint + tsc | configs in pyproject.toml + eslint.config.* |
run any of the four |
| Requirement | Implementation | File |
|---|---|---|
| PDF technical report | XeLaTeX, 29 pages, Greek | deliverables/AM2403_projectReport.pdf |
mysqldump backup that can recreate DB automatically |
gzipped + idempotent restore CLI | deliverables/db_backup.sql.gz |
| Demo video | 15-min walkthrough | deliverables/AM2403_demo.mp4 (planned) |
| Turn-in file | UoI turnin payload | deliverables/AM2403_prj.txt |
After any of the three paths above and both servers running:
# Backend liveness
curl http://localhost:8000/health
# → {"status":"ok"}
# Real KPI tile data
curl http://localhost:8000/api/meta/totals
# → {"total_articles":...,"total_authors":...,...}
# Pagination envelope
curl "http://localhost:8000/api/journals?page_size=2"
# → {"items":[...],"page":1,"page_size":2,"total_items":...}
# RFC 7807 error path
curl http://localhost:8000/api/journals/99999999
# → {"type":"about:blank","title":"Not Found","status":404,...}Open http://localhost:5173 and verify:
- 3 KPI tiles on the landing page show real numbers (not
—). /dashboardline chart has data./journalspaginates and filters./chartsswitches between 6 chart families and accepts filters.
# Backend (needs MySQL container; creates mye030_test, loads seed.sql)
cd src\backend
uv run pytest tests/ -q
uv run pytest tests/ --cov --cov-report=html # → htmlcov/index.html
uv run pytest tests/unit -q # no DB needed
# Frontend unit + integration (no DB, no backend; MSW mocks)
cd src\frontend
pnpm test
pnpm test:coverage # → coverage/index.html
pnpm test:watch # watch mode
# Frontend E2E (both servers live)
docker compose up -d
cd src\backend
uv run uvicorn api.main:application --port 8000 &
cd ..\frontend
pnpm dev &
pnpm exec playwright install chromium # first time only
pnpm exec playwright test
pnpm exec playwright show-report # HTML report with tracescd src\backend
# Produce deliverables\db_backup.sql.gz (full mysqldump, gzipped, ~160 MiB)
uv run python -m database.db_backup
# Re-import the dump into the (already existing) target database
uv run python -m database.db_restoreBoth commands read MYSQL_CONTAINER_NAME, MYSQL_DATABASE and
MYSQL_ROOT_PASSWORD from .env via pydantic-settings and stream
through Python's gzip straight into / out of docker exec mysqldump
(or mysql). The raw uncompressed dump never has to land on disk.
Typical output sizes for the full dataset:
| Stage | Size | Notes |
|---|---|---|
Raw mysqldump SQL |
~515 MiB | All tables, indexes, views |
Compressed .sql.gz |
~160 MiB | compresslevel=9 |
The compressed backup exceeds GitHub's 100 MiB hard file-size limit. Ship it via a GitHub Release attachment, not the repo.
The full ADR-style record is in docs/ARCHITECTURE.md.
Highlights:
sequenceDiagram
autonumber
actor User as Browser
participant FE as React + Vite (:5173)
participant TQ as TanStack Query cache
participant API as FastAPI / uvicorn (:8000)
participant DB as MySQL 8 (:3306)
User->>FE: navigate /journals/12345?start_year=2010
FE->>TQ: read cache(key)
TQ-->>FE: stale or miss
FE->>API: GET /api/journals/12345?start_year=2010
API->>API: Pydantic validate query params
API->>DB: SELECT ... WHERE journal_id=%s AND year>=%s
DB-->>API: rows
API-->>FE: JournalProfile JSON
FE->>TQ: hydrate cache
FE-->>User: render profile + chart + table
flowchart TB
subgraph backend["src/backend/api"]
R["routers/<br/>HTTP layer"]
D["data_access/<br/>raw SQL DAOs"]
S["schemas/<br/>Pydantic v2 DTOs"]
DB[("database.py<br/>asyncmy pool")]
end
R --> D
R --> S
D --> S
D --> DB
style R fill:#F5F1E8,stroke:#14213D
style D fill:#F5F1E8,stroke:#14213D
style S fill:#F5F1E8,stroke:#14213D
style DB fill:#E3D5C0,stroke:#14213D
ETL data-flow diagram (UML «activityETL» notation, UoI
Blueprints4ETL style) is in diagrams/etl-dataflow.png.
| Decision | Why | Cost accepted |
|---|---|---|
| Raw SQL via asyncmy, no ORM | Brief requirement + 4× throughput vs aiomysql | Every endpoint is a hand-written DAO method |
| Wide journal dimension | One-row reads, simple indexes | ~170 KB extra storage |
| Materialised aggregates | 11–46 s → < 0.3 s queries | Snapshots are read-only between 03_views.sql runs |
| Hybrid entity resolution (exact + fuzzy) | Best of both worlds: 100 % precision on acronyms, recall on fuzzy titles | 81 % of DBLP conferences unranked (visible via ranked_only filter) |
| FastAPI + Pydantic v2 | Free OpenAPI, async, RFC 7807 errors | Two type systems (Pydantic + TS) kept in sync by hand |
| Visx over Recharts | Pixel-perfect charts with drag-to-zoom + Voronoi hover | More component code per chart |
| URL-driven filters (TanStack Router search params) | Every page state is shareable, browser back works | Filters must be schema-validated (Zod) |
| RFC 7807 problem-details everywhere | Single error envelope, machine-readable | Custom handlers in api/errors.py |
| Polars over pandas | ~10× faster, lower memory on this size | Less mature ecosystem |
| Static post-ETL warehouse pattern | Matches the brief's read-mostly workload | No live ingest; ETL is a re-run, not a tail |
Full reasoning, alternatives considered and consequences are in
docs/ARCHITECTURE.md.
.
├── README.md this file
├── docker-compose.yml MySQL 8 service + volume mounts
├── .env credentials (git-ignored)
├── .env.example template
├── scripts/
│ ├── setup.ps1 one-shot bootstrap (Windows)
│ └── setup.sh one-shot bootstrap (macOS / Linux)
├── data/ raw input CSVs (git-ignored)
├── exports/ ETL outputs (git-ignored)
├── sql_scripts/
│ ├── 01_schema.sql DDL: tables, indexes, FKs
│ ├── 02_load.sql DML: LOAD DATA LOCAL INFILE per table
│ └── 03_views.sql 11 views + 6 materialised tables
├── diagrams/ PlantUML sources + PNG exports
├── docs/
│ ├── API_REFERENCE.md endpoint-by-endpoint reference
│ ├── ARCHITECTURE.md ADR-style decisions and trade-offs
│ ├── GLOSSARY.md domain terms
│ ├── design.md visual design system
│ └── openapi.json static OpenAPI 3.1 document
├── src/
│ ├── backend/ Python 3.13 + FastAPI + asyncmy
│ └── frontend/ React 19 + Vite + TanStack + Visx
└── deliverables/ PDF report + video script + db_backup.sql.gz
| Symptom | Likely cause | Fix |
|---|---|---|
docker: command not found |
Docker not installed or PATH missing | Re-install Docker Desktop, restart terminal |
Cannot connect to the Docker daemon |
Docker Desktop not running | Launch Docker Desktop, wait for the whale icon to stop animating |
Bind for 0.0.0.0:3306 failed: port is already allocated |
A local MySQL is using 3306 | Stop the local server, or set MYSQL_PORT=3307 in .env and docker compose up -d again — see "Port conflicts" below |
Frontend container shows nginx HTML but /docs returns Not Found and you see /@vite/client in the page source |
A host-side pnpm dev is also listening on 5173; the OS routes traffic to it instead of the docker nginx |
Stop the host-side Vite (Ctrl+C in that terminal), or set FRONTEND_PORT=5273 in .env and recreate the stack |
Access denied for user 'Angelakos'@'%' |
Volume initialised with different credentials | docker compose down -v then bring back up; or run the GRANT line shown in Path B |
LOAD DATA LOCAL INFILE … is disabled |
local_infile off |
Already set in docker-compose.yml; ensure you did not edit the command: block |
pyright / ruff reports type errors after uv sync |
Stale .venv |
Remove-Item -Recurse -Force .venv then uv sync again |
pnpm install complains about peer deps |
Old corepack pin | corepack prepare pnpm@latest --activate |
xelatex: command not found |
TeX distribution missing | Install MiKTeX (Windows) / MacTeX (macOS) / texlive-xetex (Linux). Only needed to rebuild the PDF |
Frontend shows — for KPI tiles |
Backend not reachable | Confirm uvicorn is up at :8000; check devtools network tab |
deliverables/db_backup.sql.gz missing |
Stripped because >100 MiB | Download from the GitHub Release; place under deliverables/ |
Float too large for page (LaTeX) |
A figure grew | Reduce width= or add height=0.78\textheight,keepaspectratio |
MiKTeX: cannot retrieve attributes for ...\\X.exe |
MiKTeX iterates PATH and stumbles on a directory whose name ends in .exe |
Temporarily remove the offending entry from PATH before running xelatex |
All three host-facing ports are configurable through .env so a
clash with something else on the host does not require editing
tracked files:
| Variable | Default | Container target |
|---|---|---|
FRONTEND_PORT |
5173 |
mye030_frontend:80 (nginx) |
BACKEND_PORT |
8000 |
mye030_backend:8000 (uvicorn) |
MYSQL_PORT |
3306 |
mye030_mysql:3306 |
If a port is taken, set a free one in .env and bring the stack back
up:
# .env
FRONTEND_PORT=5273 # Vite default 5173 was busy
MYSQL_PORT=3307 # local MySQL service was on 3306
docker compose down
docker compose up -dThe scripts/setup.ps1 / scripts/setup.sh bootstrap script
performs a pre-flight port check and exits with a clear message
identifying the offending process before any container is started.
Common culprits:
- A stale
pnpm devon5173(kill the host-side Vite, or flipFRONTEND_PORT). - A locally installed MySQL service on
3306(Stop-Service MySQL on Windows,sudo systemctl stop mysqlon Linux, or flipMYSQL_PORT). - Another
uvicornon8000.
Why this matters: when both a host-side
pnpm devand the dockermye030_frontendbind to5173, the OS routes external traffic to whichever bound first — usuallypnpm dev. The visitor then sees the dev SPA at the URL instead of the proxied Swagger / ReDoc //openapi.json. The pre-flight check in the setup script prevents that scenario; without the script, use.envto relocate the conflicting port.
| Document | When to read |
|---|---|
docs/ONBOARDING.md |
First-time pull, want the shortest path to a running app |
docs/ARCHITECTURE.md |
Want the why behind each design decision |
docs/API_REFERENCE.md |
Want every endpoint + every parameter + every DTO field |
docs/openapi.json |
Want a machine-readable contract |
docs/design.md |
Want to recreate the visual design |
docs/GLOSSARY.md |
Encountered an unfamiliar term |
src/backend/README.md |
Hacking on Python / FastAPI / SQL |
src/backend/api.http |
Want to fire requests by hand |
src/frontend/README.md |
Hacking on React / Visx / Tailwind |
deliverables/AM2403_projectReport.pdf |
Final 34-page writeup |