Data Scientist based in Poland 🇵🇱
Building production-grade ML systems — from statistical analysis to deployed APIs
I work across the full data science lifecycle — from exploratory analysis and statistical hypothesis testing to training ML models, building REST APIs, and shipping interactive dashboards.
My work focuses on:
- Rigorous statistics — hypothesis testing with proper effect sizes, not just p-values
- Production mindset — models that serve real predictions, not just notebooks
- Clear communication — dashboards and docs that translate findings into decisions
Languages & Core
ML & Data Science
Deep Learning & NLP
APIs & Deployment
Statistical Analysis
End-to-end ML system predicting customer churn in real time
| Component | Details |
|---|---|
| Pipeline | Data ingestion → feature engineering → XGBoost (CV AUC 0.75) → SHAP |
| Hypothesis Testing | 5 statistical tests (χ², Welch t-test, Mann-Whitney U) with effect sizes |
| API | FastAPI REST — single & batch prediction, model metrics, live test results |
| Dashboard | Streamlit 4-page app — overview, hypothesis tests, live prediction, model performance |
| Infrastructure | Docker Compose, GitHub Actions CI, pytest (28 tests) |
Production-style ML system predicting Warsaw apartment prices, refreshed weekly with live Otodom data
| Component | Details |
|---|---|
| Data | ~19 months of Polish real estate listings (Aug 2023 – Jun 2024) + weekly live scraping |
| Pipeline | Feature engineering → model training → automated weekly refresh |
| Scope | End-to-end: data collection, preprocessing, modelling, deployment |
| Project | Description | Stack |
|---|---|---|
| ab_testing | Frequentist A/B test for an e-commerce page redesign — Z-test, Welch t-test, power analysis, effect sizes | Python · SciPy · Jupyter |
| esg_risk_ml_project | ESG risk classification using Random Forest | Python · scikit-learn |
| credit_score_classification | Credit score classification with precision/recall analysis | Python · scikit-learn |
| default_rate_calculation | Default rate vs. macroeconomic factors in a hypothetical Polish bank | Python · pandas |
| Project | Description | Stack |
|---|---|---|
| brest_cancer_prediction | Master's thesis — IDC detection from histopathology images with SVC, CNN & EfficientNetB0 | TensorFlow · Keras · scikit-learn |
| smokers_detection | Binary CNN classifier for smoker vs. non-smoker image detection | TensorFlow · Keras |
| reuters | Multi-class neural network classifier for Reuters news articles | TensorFlow · Keras |
| Project | Description | Stack |
|---|---|---|
| roberta_fake_job | Fake job posting detection using RoBERTa | HuggingFace · PyTorch |
| NLP_word_embeddings | Word embedding experiments with Word2Vec and GloVe | Python · Gensim |
| NLP_intro | Sentiment analysis on the Sentimental_Data dataset | Python · NLTK |
| Project | Description | Stack |
|---|---|---|
| vehicles | Polynomial regression + Ridge regularization on a vehicles dataset (R²=0.93) | Python · scikit-learn |
| stroke_predict | Stroke likelihood prediction from patient health features | Python · scikit-learn |
| pointed-gun-at-person_model | ML classifier for Phoenix PD incidents involving drawn firearms | Python · scikit-learn |
| Project | Description | Stack |
|---|---|---|
| vietnam_war_pyspark | PySpark analysis of Vietnam War bombing operations (1955–1975) | PySpark · Python |
| NIST_API | Script for downloading vulnerability reports from the NVD API | Python |
| Gmail_API | Gmail API client for authenticating and downloading email attachments | Python |
Always open to interesting data science problems and collaborations.
