🛡️ User Behavior Anomaly Detection using Machine Learning

Overview

This project implements an end-to-end, production-style User Behavior Anomaly Detection (UBA) system using unsupervised machine learning.

The system learns normal user behavior patterns from historical activity data and identifies anomalous user events that deviate significantly from learned baselines — without relying on labeled anomaly data.

It is designed and implemented as a deployable inference service, not just a notebook experiment.

Live Deployment:

API Base URL: https://api.alturawing.tech
Swagger Docs: https://api.alturawing.tech/docs

Problem Statement

Modern digital systems generate continuous streams of user activity events such as logins, clicks, session durations, and access times. Rule-based monitoring systems struggle to detect subtle, previously unseen behavioral anomalies, especially when labeled attack data is unavailable.

Goal: Build a data-driven system that learns normal behavior and flags anomalous user activity using unsupervised ML, exposed via a production-ready API.

Key Features

Unsupervised anomaly detection (no labels required)
Behavioral feature engineering (event-level + user-level)
Isolation Forest–based anomaly scoring
Percentile-based anomaly thresholding
REST API for real-time inference
Fully Dockerized Linux deployment
HTTPS-enabled reverse proxy (Nginx)
Clean, modular, industry-grade project structure

System Architecture

User Events (CSV / JSON)
        ↓
Feature Engineering
        ↓
StandardScaler
        ↓
Isolation Forest Model
        ↓
Anomaly Score
        ↓
Thresholding Logic
        ↓
FastAPI Inference Service
        ↓
Docker Container
        ↓
Nginx (HTTPS)

Dataset

Dataset Characteristics

~70,000 user activity events
~500 users
~21 days of activity
~2% embedded anomalous behavior
Synthetic but realistic SaaS-style data

Event Schema

Each row represents a single user activity event:

Field	Description
`user_id`	Unique user identifier
`event_type`	login, logout, view, click, download
`session_duration_sec`	Duration of user session
`events_in_session`	Number of actions in the session
`hour_of_day`	0–23
`day_of_week`	0–6
`device_type`	desktop / mobile

⚠️ No anomaly labels are used during training.

Feature Engineering Strategy

Event-Level Features

Session duration
Events per session
Time-of-day
Day-of-week
Encoded event type
Encoded device type

User-Level Behavioral Baselines

Computed per user and merged back:

Average session duration
Session duration variability
Average events per session
Typical active hours

Deviation Features (Key Insight)

Session duration deviation from user baseline
Event count deviation
Time-of-day deviation

These features allow the model to detect contextual anomalies, not just global outliers.

Model

Algorithm

Isolation Forest (unsupervised)

Why Isolation Forest?

No labeled anomalies required
Designed for rare event detection
Scales well to large datasets
Widely used in industry UBA systems

Training Details

Feature scaling with StandardScaler
~200 trees
Contamination ≈ 2%
Model outputs continuous anomaly scores

Anomaly Decision Logic

Instead of relying directly on model labels, the system uses:

Percentile-based thresholding on anomaly scores
Example: top 2% most anomalous events

This mirrors real production systems, where thresholds are configurable based on operational tolerance.

API Design

Endpoint

POST /predict

Request Example

{
  "user_id": 123,
  "event_type": "login",
  "session_duration_sec": 18000,
  "events_in_session": 190,
  "hour_of_day": 2,
  "day_of_week": 1,
  "device_type": "desktop"
}

Response Example

{
  "anomaly_score": 0.048707,
  "is_anomaly": true,
  "threshold_percentile": 0.98
}

Additional Endpoint

GET /health

Running the Project

1️⃣ Local Setup (Without Docker)

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
uvicorn api.main:app --reload

Open:

http://127.0.0.1:8000/docs

2️⃣ Run with Docker (Recommended)

Build Image

docker build -t user-behavior-anomaly-api .

Run Container

docker run -p 8000:8000 user-behavior-anomaly-api

Open:

http://localhost:8000/docs

Production Deployment

Cloud Provider: DigitalOcean
OS: Ubuntu LTS
Containerization: Docker
Reverse Proxy: Nginx
TLS: Let’s Encrypt (manual DNS challenge)
Restart Policy: unless-stopped

The service is exposed securely at:

https://api.alturawing.tech

Project Structure

user-behavior-anomaly-detection/
├── api/                # FastAPI inference service
├── artifacts/          # Trained model & scaler (ignored by Git)
├── data/               # Raw & processed datasets
├── notebooks/          # EDA & training notebooks
├── src/
│   ├── features/       # Feature engineering logic
│   ├── models/         # Training & prediction utilities
│   └── utils/          # Dataset generation
├── Dockerfile
├── requirements.txt
└── README.md

Limitations

Synthetic dataset (not real production logs)
Single-event inference (no batch API yet)
No real-time streaming ingestion
Threshold calibrated offline

These are intentional design choices for clarity and focus.

Future Improvements

Batch inference endpoint
User-adaptive thresholds
Time-window aggregation
Streaming ingestion (Kafka)
Model monitoring & drift detection
Multi-project hosting strategy

Why This Project Matters

This project demonstrates the ability to:

Build ML systems without labeled data
Engineer meaningful behavioral features
Balance ML performance with operational constraints
Deploy models as real services
Think beyond notebooks into production systems

It is designed to be resume-ready, interview-defensible, and extensible.

Author

AlturaWing

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
api		api
artifacts		artifacts
data/raw		data/raw
notebooks		notebooks
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛡️ User Behavior Anomaly Detection using Machine Learning

Overview

Problem Statement

Key Features

System Architecture

Dataset

Dataset Characteristics

Event Schema

Feature Engineering Strategy

Event-Level Features

User-Level Behavioral Baselines

Deviation Features (Key Insight)

Model

Algorithm

Why Isolation Forest?

Training Details

Anomaly Decision Logic

API Design

Endpoint

Request Example

Response Example

Additional Endpoint

Running the Project

1️⃣ Local Setup (Without Docker)

2️⃣ Run with Docker (Recommended)

Build Image

Run Container

Production Deployment

Project Structure

Limitations

Future Improvements

Why This Project Matters

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages