A classic problem in financial environments. We will take a look at classifying transactions into fraud and not fraud using scikit-learn packages.
The dataset is obtained from Kaggle here: Digital Payment Fraud Detecton
This project will go over using scikit-learn in binary classification for this dataset and we will cover:
- Data pre - processing and cleaning
- Exploratory Data Analysis and Data mining
- Modelling and evaluation
- FINAL INSIGHTS
Languages used: Python (version 3.14.3), R (version 4.5.2)
Environment: VSCode, RStudio
We mimicked a real life data science workflow by separating the dataset into train and test sets before any EDA to mimic new transactions.
We separated our predictor variables and response variable into x and y respectively.
We created a 70/30 train/test split while stratifying across y to ensure we had an equal ditribution of fraud and not fraud in the train data compared to the test data.
Our data was relatively clean so there was not too much pre-processing to be done.
To identify how well the data could be modeled, I started with an Exploratory Data Analysis (EDA) in R.
I utilized density plots, bar charts and histograms, Q-Q plots and scatter matrices to test the underlying assumptions of the dataset.
IP Risk Score by Fraud
This density curve tells us the probability of finding fraud and non fraud transaction in the relevant ranges (through area under curve).
Our density curve can be broken down into 5 main regions/ ranges:
- 0.00 ~ 0.09 : transactions are relatively equal share of fraud and non fraud.
- 0.09 ~ 0.25 : the density of fraud cases is much larger than that of non fraud.
- 0.25 ~ 0.48 : the density of non fraud cases is much larger than that of fraud.
- 0.51 ~ 0.885 : the density of non fraud cases is much larger than that of fraud.
- 0.885 ~ 1.00 : fraud cases are marginally more probable.
Categorical and numeric variables by Fraud
- We see that the hours of 02:00 a.m., 03:00 a.m., 5:00 a.m., 07:00 a.m., 08:00 a.m.,11:00 a.m., 13:00 p.m., 18:00 p.m., 19:00 p.m., and 21:00 p.m. have a higher prevalence of fraud than other times.
- Fraud seems to increase slightly after 5 attempted logins.
- Fraud does not seem to vary too much with other variables
Account days and average transaction amount by Fraud
There is little relationship between transaction amounts and average age of accounts in days.
The fraudulent data points are visible in the entire range across the y and x axis.

International transactions by Fraud
Local transactions tend too be more fraudulent than international ones 6.7% vs 4.6%.

Location by Fraud
Hyderabad and Mumbai have a higher prevalence of fraud than other dice locations.

To handle the data cleaning and preparation, I developed two ETL systems using custom Python classes that created an automated method of handling both numerical and categorical variables.
This incororated the use of BaseEstimator and TranformerMixin as agruments to implement these classes into scikit-learn pipelines.
This allowed for reproducible fit and transform operations across training and testing sets.
These classes were called ETL_numeric and ETL_categorical.
For the custom ETL pipeline I created a dedicated class to store the transformation logic. This ensured that any cleaning applied to the training data—such as handling missing values or renaming columns was identically applied to the test data.
This is what allows us to create features based on "historical" user data (data from the train set) that we can use to determine if transactions from tne test set are suspicious.
It is implimented using the .fit() function on the train set and makes a profile for each user from the train set that we can use to compare to the test data (simluated incoming transcations) to see if whether or not they are fraud.
Involved calculating numeric interactions such as:
Z-scores
We aim to calculate z scores from each transaction amount. In order to do this we assume that transcations amounts follow a Guassian distribution. Using the transaction amount and avg transcation amount fields we calculate z scoes with the following methodology:
We will create a weighted standard devation score made up of two parts: user standard devation and group standard deviation.
Where k is an adjustible parameter that corresponds to number of users and n is a number of smoothing parameter.
This allows us to create a standard devation that is non zero so we dont have an undefined z score.
We can then get each user's z scores with the formula:
This allows us to consider how unusual the transaction amount is for the group the user belongs to and for the user based on historical transaction in the train set.
Location rarity and Global rarity
Is a score that we define based on historical data for each user (users in the test set) on how common (or uncommon) transactions from the 5 locations are.
This metric has two main interpretations:
- higher score means rarer location
- lower score means more common location
We calculate this rarity metric as:
-
$\text{number of location transactions}$ - number of transcations for each user in 1 of the 5 locations. -
$\alpha$ - smoothing parameter to prevent null entries -
$\text{total number of transactions}$ - total number of transcations performed by user in train set -
$K$ - number of locations in the data
In addition to the user specific rarity we also have global rarities that will be applied to new users that do not appear in the training set.
-
$\text{total number of transcations in location}$ - total number of transcations for every user in 1 of the 5 locations in the train set -
$\alpha$ - smoothing parameter to prevent null entries -
$\text{total number of transactions}$ - total number of transcations performed by all users in train set -
$K$ - number of locations in the data
Once we applied this method to the categorical features of location we then extended the rationale to other categorical fetaures such as: payment mode, device type and transaction type.
After we run of fit duntion we creat a dictionary of specific user inofrmstion and global information which consists of:
- How common each location is for each user
- How common each categorical values is for each user such as "how often does User ID 123 use mobile phones in the past
Once we have information about the history of each user in regards to categorical features we can now apply transformations on the test set to make predictions.
We engineeer a few more features that can help augment the calssification problem for each user. These features include:
Login aggression
Tells us the avrage number of logins per day in account age which correpsonds tp how aggressive any particular user is being in trying to log into their account.
- Newer accounts as well as hacked accounts are likely to have a larger number of login attempts.
-
$\text{aggr smooth}$ - a smootihg factor that can account for newe accounts with an age of 0 so we do not divide by 0.
Failed login aggression
Tells us average number of failed login attempts per day.
- A higher rate of error is likely to correlate with fraudulent cases.
-
$\text{failed aggr smooth}$ - a smootihg factor that can account for newe accounts with an age of 0 so we do not divide by 0.
ATO score
Tells us the average risk score and login attempts per day in account age
- We expect fraudulent new accounts to have a high number of logins in a short amount of time which will lead to a low ATO score. This is referred to as our velocty metric.
-
$\text{ATO smooth}$ - acts as our smoothing values to prevent division by 0.
Failure rate
Percentage of total login attempts that were failed.
- We expect that accounts that have fraudsters in them have a high failure rate
Cost per failure
The average amount of money associated with a failed transaction.
- A higher value indicates that a certain accounts has large amounts of money being moved around per fialure that could indicate fraudlent activity.
-
$\text{cpf smooth}$ - acts as our smoothing values to prevent division by 0.
IP age pressure
Ratio of IP score to account age.
- A new high value corresponds to a high risky new account that may indicate fraud.
-
$\text{ip age pressure smooth}$ - acts as our smoothing values to prevent division by 0.
Transaction amount average ratio
The percentage of each users average transaction amount that any new transaction is.
- The higher this ratio the greater suspicion of fraud in that transaction.
Finally we will scale the raw numerical values and apply dummy variables to categorical features.
We implemented OneHotEncoder and StandardScaler using ColumnTransformer to handle categorical and numerical features in a single pass.
Once we finish engineering the above features, we not need to automate the modelling process to apply the fir and trasnform methods we designed.
This will be done using imblearn library and the SMOTE ENN procedure.
Pipeline
Combined custom transformers into a streamlined automation to ensure all pre-processing steps are applied atomically during both training and inference.
This architecture prevents data leakage by encapsulating the fit() and transform() logic within a single executable object.
PipeLine #5
After experimenting with multiple pipelines \ using Logistic regression,our ETL categorical and numeric classes, the best pipeline that detetcted the most fraud was #5.
| Pipeline 5 | Metric |
|---|---|
| Balanced accuracy | 51.45% |
| Precision | 6.83% |
| Recall | 62.58% |
Insights
- We can see that the balanced accuracy is close to 50% indicating that the model is not much better than random guessing.
- We are able to detect a majority of fraud cases
- We have many false postiives in an effort to get a high recall so we have a very low precision.
- Our engineered features from failed login aggression and ATO score proved to provide the highest strength signal while others like z score provide little predictive power.
- as the failed login aggression increases by 1 failed login on per day , that decreases the chances of fraud by 0.74% on average holding other variables constant.
- as the login aggression increases by 1 login per day, that increases the chances of fraud by 0.4% on average holding other variables constant.
- as the ATO score increases by 1 per day, that increases the chances of fraud by 0.34% on average holding other variables constant.
- as the failure rate by 1 previous failed attempts, that increases the chances of fraud by 0.4%.
Overall, our pipeline seems to have a diffuclt time findign a string signla for fraud. Therefore we will investigate if the imbalanced neature of fraud may be the cause of our struggling model.
SMOTE-ENN
Courtesy of Geekforgeeks, there is a method we can use for highly unbalanced datasets called SMOTE.
This is a resampling technique that generates synthetic data for our minority non fraud class.
It interpolates between existing data to create completely new data points.
It helps prevent overfitting and allows models to learn patterns that predict minority class.
We made sure to generate data points using k-neighbors = 2 i.e generate data points based on the two closest neighbos to ensure that fraud signal is not diluted by many non fraud data points.
We implemented n_neighbors = 1 to generate one new data points from these two neighbors.
GridSearchCV and RandomizedSearchCV
After implementing SMOTE ENN, we aim to identify what parameters help us best detect fraud from the custom classes we created as well as from the SMOTE ENN procdure. We implemented Randomizsed search to idenitfy tehe best parameters of:
- our smoothing parameters such as
$\text{ATO smooth}$ - the
class weightsparameter in our Logitisc regression model - The number of classes to be used to calculate group based standard devation
$\sigma_{group}$ - We found the best paramaters to then be:
Final Pipeline
After implementing the SMOTE ENN, Randomized search parameters, and our ETL classes we ran our the train and test sets through the resultant pipeline and we found the following:
| Final Pipeline | Metric |
|---|---|
| Balanced accuracy | 50.92% |
| Precision | 6.79% |
| Recall | 45.58% |
We can see that even after implementing these procedures we have not been able to find much of a signal for fraud.
We can go back to the EDA stage and conduct further analysis to identfiy if we may have lost any details or missed key interactions that will point us in the right direction.
Additional Exploratory Data Analysis
Q-Q Plot
We compared the transaction amounts against a theoretical normal distribution. The result was a perfectly linear relationship, indicating the data follows a Uniform Distribution U(a,b).
- The gradient of the values plotted is greater than our y = x trend line whihch means that the varaince of the sample data is greater than for a standrd nomral equation.
In real-world finance, transaction amounts are typically right skewed (transactions of smaller amounts are more frequent thn larger amounts); this perfect uniformity suggested the data was stochastically generated from a uniform distribution.
Correlation Plot
Revealed what features may have been most important but all the features were very minimally correlated with each other
- This suggested that the features may have been generated independently and stitched with fraud labels applied randomly.
Therefore uponf uther inspectionm this seems to have been an issue with the data not necessarily with out modelling methods.
To be sure I went to check other submissions in the competition and see hwo other people were able to notice the fraud pattern.
Everyone also seemed to struggle to find a pattern with the maximum recall I found being 52% but precision, AUC and balanced accuracy indicated that all the models where not much better than random guessing.
| Final Pipeline | Metric |
|---|---|
| Balanced accuracy | 50.92% |
| Precision | 6.79% |
| Recall | 45.58% |
Despite the modular ETL classes and optimized pipelines, the model performance confirmed the findings of the R-based audit.
Interpretation of Output Signal Integrity: My custom transformers and engineering steps were unable to extract a signal because the fraud labels were stochastically independent of the features.
Performance: Precision and Recall metrics remained consistent with a "Zero-Signal" environment.
Final Verdict
The project was a successful exercise in Adversarial Discovery.
I did learn though that no matter what model you use, the old adage stays true: "Garbage in, garbage out".
Our recall improved drastically after we engineered good features with great predictive power.
After conducting further EDA particularly the qq plots, I think there may have also been a bottleneck in data quality.
Both transactions following a uniform distribution with such low correlations between all features suggested the data may have been independently generted column by column with fraud labels added later. It is highly unlikely that every transactin amount range is equally as dense as transactions tend to be right skewed or log normal.
The main limitations were therefore:
- data quality
- highly unbalanced dataset
I did learn a lot form this project even with the underwhelming results, mainly:
feature engineering.- various
scikit-learnandimbalanced-learntools like Logistic regression, SMOTE, Pipelines, Grid search and Randomised Search and Columntransformer and Function transformer to create your own transformers and fitters. numpylike arrays that can vectorise calculations.pandasfunctionality like groupby and getting dummies.- basic python functionality like dictionaries.
- overall project structure.
I would say this was a very good learning opportunity and I got to grow a lot for 1 weeks work. It proved that while the engineering was sound, the high Bayes Error Rate of the dataset made predictive modeling a lost cause.
|Fraud-Detection
│
|dataset
| ├──test_data
│ ├──train_data
│ └──whole_dataset
│
|├──modelling
│ ├──fraud_detection
│ └──train_test_split
│
|├──visualisations
│ ├──account_age_fraud
│ ├──amounts_fraud
│ ├──amount_density_fraud
│ ├──amount_qqplot
│ ├──average_amount_density_fraud
│ ├──average_amount_qqplot
│ ├──categorical_fraud
│ ├──correlation_fraud
│ ├──final_pipeline_metrics
│ ├──international_fraud
│ ├──ip_risk_fraud
│ ├──location_fraud
│ └──r-visualisations
|
|├──License
|├──R.history
|├──requirements
|└──README
- Carry out EDA to discovery what variables have the most relevance to fraud ✅
- Create custom fit and transform classes ✅
- Create custom pipelines to automate the modelling process ✅
- Create custom column transformer to apply one hot encoding and feature scaling ✅
- Use qq-plots to look into distribution of variables ✅