Skip to content

Israelamat/isrmltoolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ isrmltoolkit โ€” Outlier Detection & Treatment for ML

Python Machine Learning

isrmltoolkit is a Python toolkit designed to detect, handle, and visualize outliers in real-world datasets. Built for data scientists and ML practitioners, it combines statistical methods and machine learning algorithms into a simple, unified workflow.

  • โœ… Clean your data
  • โœ… Improve model performance
  • โœ… Understand anomalies

๐Ÿง  Core Methods

๐Ÿ“Š 1. Interquartile Range (IQR)

A robust statistical method based on data dispersion, resistant to extreme values.

$$IQR = Q_3 - Q_1$$ $$\text{Lower Bound} = Q_1 - 1.5 \cdot IQR$$ $$\text{Upper Bound} = Q_3 + 1.5 \cdot IQR$$

๐Ÿ‘‰ Values outside these bounds are flagged as outliers.

๐Ÿ“ 2. Z-Score

Measures how far a value deviates from the mean in terms of standard deviations.

$$Z = \frac{X - \mu}{\sigma}$$

  • X: Observed value
  • $\mu$: Mean
  • $\sigma$: Standard deviation

๐Ÿ‘‰ Typically, $|Z| > 3$ indicates an outlier.

๐Ÿงฎ 3. Mahalanobis Distance

Captures multivariate outliers by considering feature correlations.

$$D^2 = (X - \mu)^T \Sigma^{-1} (X - \mu)$$

  • $\Sigma^{-1}$: Inverse covariance matrix

๐Ÿ‘‰ High distance values indicate anomalous observations in multi-dimensional space.

๐ŸŒฒ 4. Isolation Forest

A tree-based ML algorithm that isolates anomalies instead of profiling normal data.

  • Uses random feature splits.
  • Fewer splits $\rightarrow$ higher anomaly likelihood.
  • ๐Ÿ‘‰ Highly efficient for high-dimensional datasets.

๐Ÿ”ง 5. Winsorization

Reduces the impact of extreme values without removing them by capping them at specific percentiles.

$$X' = \begin{cases} P_{low} & \text{if } X < P_{low} \ X & \text{if } P_{low} \le X \le P_{high} \ P_{high} & \text{if } X > P_{high} \end{cases}$$

๐Ÿ‘‰ Useful for stabilizing distributions and improving model robustness.


โšก Why isrmltoolkit?

  • ๐Ÿง  Multiple Strategies: Diverse outlier detection methods in one place.
  • โšก Real-world Ready: Designed for noisy, complex datasets.
  • ๐Ÿ“Š Pipeline Friendly: Built for both exploratory analysis and preprocessing.
  • ๐Ÿš€ Evolving: Actively updated with new features and algorithms.

๐Ÿš€ Installation

pip install isrmltoolkit

About

๐Ÿ“Š Outlier toolkit for data science & ML. Detect, clean & visualize using IQR, Z-score, Mahalanobis & Isolation Forest

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages