This project focuses on detecting malware using various machine learning techniques, including Logistic Regression, KNN, ANN, CNN, and Random Forest. The goal is to analyze a dataset (Malware_dataset.csv) and build models that can predict whether a given software is malicious or not based on its features.
- Project Overview
- Technologies Used
- Dataset
- Models Implemented
- Feature Extraction
- Model Evaluation
- Usage Instructions
- Dependencies Installation
- References
- Programming Languages: Python
- Libraries:
numpypandasmatplotlibseabornscikit-learntensorflow(for ANN and CNN)keras
- Tools:
- Jupyter Notebook for interactive development
- GitHub for version control
The dataset used in this project is the Malware_dataset.csv, which contains various attributes about software samples, such as byte sequences, file characteristics, and labels indicating whether the software is benign or malicious.
- Dataset Source: Kaggle (or specify your dataset source here).
- Features:
- Features might include information like file size, byte-level data, execution time, and others.
- The label indicates whether a software is benign or malicious.
Logistic Regression is a basic model that is often used for binary classification problems. It outputs probabilities to predict the class label.
KNN is a simple algorithm that classifies data points based on the majority class of their neighbors.
ANN is a deep learning model inspired by the structure of the human brain. It is used for tasks that require high-dimensional input data like images or sequence data.
CNNs are a class of deep learning models commonly used for image classification but are also used for sequence data such as time series or malware detection.
Random Forest is an ensemble learning technique that combines multiple decision trees to improve accuracy and avoid overfitting.
Feature extraction is the process of transforming raw data into a usable format for the machine learning models. Common techniques used for feature extraction include scaling, normalization, and encoding.
Model performance is evaluated using various metrics such as:
- Accuracy: Measures the percentage of correctly classified samples.
- Precision and Recall: Evaluate the classifier’s ability to handle positive and negative samples.
- F1-Score: The harmonic mean of precision and recall.
- Confusion Matrix: Provides a detailed breakdown of the true and false predictions made by the classifier.
