This project implements an end-to-end, production-style User Behavior Anomaly Detection (UBA) system using unsupervised machine learning.
The system learns normal user behavior patterns from historical activity data and identifies anomalous user events that deviate significantly from learned baselines — without relying on labeled anomaly data.
It is designed and implemented as a deployable inference service, not just a notebook experiment.
Live Deployment:
- API Base URL: https://api.alturawing.tech
- Swagger Docs: https://api.alturawing.tech/docs
Modern digital systems generate continuous streams of user activity events such as logins, clicks, session durations, and access times. Rule-based monitoring systems struggle to detect subtle, previously unseen behavioral anomalies, especially when labeled attack data is unavailable.
Goal: Build a data-driven system that learns normal behavior and flags anomalous user activity using unsupervised ML, exposed via a production-ready API.
- Unsupervised anomaly detection (no labels required)
- Behavioral feature engineering (event-level + user-level)
- Isolation Forest–based anomaly scoring
- Percentile-based anomaly thresholding
- REST API for real-time inference
- Fully Dockerized Linux deployment
- HTTPS-enabled reverse proxy (Nginx)
- Clean, modular, industry-grade project structure
User Events (CSV / JSON)
↓
Feature Engineering
↓
StandardScaler
↓
Isolation Forest Model
↓
Anomaly Score
↓
Thresholding Logic
↓
FastAPI Inference Service
↓
Docker Container
↓
Nginx (HTTPS)
- ~70,000 user activity events
- ~500 users
- ~21 days of activity
- ~2% embedded anomalous behavior
- Synthetic but realistic SaaS-style data
Each row represents a single user activity event:
| Field | Description |
|---|---|
user_id |
Unique user identifier |
event_type |
login, logout, view, click, download |
session_duration_sec |
Duration of user session |
events_in_session |
Number of actions in the session |
hour_of_day |
0–23 |
day_of_week |
0–6 |
device_type |
desktop / mobile |
⚠️ No anomaly labels are used during training.
- Session duration
- Events per session
- Time-of-day
- Day-of-week
- Encoded event type
- Encoded device type
Computed per user and merged back:
- Average session duration
- Session duration variability
- Average events per session
- Typical active hours
- Session duration deviation from user baseline
- Event count deviation
- Time-of-day deviation
These features allow the model to detect contextual anomalies, not just global outliers.
- Isolation Forest (unsupervised)
- No labeled anomalies required
- Designed for rare event detection
- Scales well to large datasets
- Widely used in industry UBA systems
- Feature scaling with
StandardScaler - ~200 trees
- Contamination ≈ 2%
- Model outputs continuous anomaly scores
Instead of relying directly on model labels, the system uses:
- Percentile-based thresholding on anomaly scores
- Example: top 2% most anomalous events
This mirrors real production systems, where thresholds are configurable based on operational tolerance.
POST /predict
{
"user_id": 123,
"event_type": "login",
"session_duration_sec": 18000,
"events_in_session": 190,
"hour_of_day": 2,
"day_of_week": 1,
"device_type": "desktop"
}{
"anomaly_score": 0.048707,
"is_anomaly": true,
"threshold_percentile": 0.98
}GET /health
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
uvicorn api.main:app --reloadOpen:
http://127.0.0.1:8000/docs
docker build -t user-behavior-anomaly-api .docker run -p 8000:8000 user-behavior-anomaly-apiOpen:
http://localhost:8000/docs
- Cloud Provider: DigitalOcean
- OS: Ubuntu LTS
- Containerization: Docker
- Reverse Proxy: Nginx
- TLS: Let’s Encrypt (manual DNS challenge)
- Restart Policy:
unless-stopped
The service is exposed securely at:
https://api.alturawing.tech
user-behavior-anomaly-detection/
├── api/ # FastAPI inference service
├── artifacts/ # Trained model & scaler (ignored by Git)
├── data/ # Raw & processed datasets
├── notebooks/ # EDA & training notebooks
├── src/
│ ├── features/ # Feature engineering logic
│ ├── models/ # Training & prediction utilities
│ └── utils/ # Dataset generation
├── Dockerfile
├── requirements.txt
└── README.md
- Synthetic dataset (not real production logs)
- Single-event inference (no batch API yet)
- No real-time streaming ingestion
- Threshold calibrated offline
These are intentional design choices for clarity and focus.
- Batch inference endpoint
- User-adaptive thresholds
- Time-window aggregation
- Streaming ingestion (Kafka)
- Model monitoring & drift detection
- Multi-project hosting strategy
This project demonstrates the ability to:
- Build ML systems without labeled data
- Engineer meaningful behavioral features
- Balance ML performance with operational constraints
- Deploy models as real services
- Think beyond notebooks into production systems
It is designed to be resume-ready, interview-defensible, and extensible.
AlturaWing