Spam Text Classifier

This project implements a text classification model trained on SMS/text message data to distinguish between spam and non-spam (ham) messages. The model applies natural language preprocessing and TF-IDF vectorization before training a Support Vector Classifier to make binary predictions on unlabeled message data, outputting results to a predictions.csv file.

Capabilities

Text Preprocessing: Converts text to lowercase, removes punctuation, and strips digits to normalize input data
TF-IDF Vectorization: Converts cleaned text into numerical feature vectors using TfidfVectorizer with English stop-word removal and a 5,000-feature limit
Support Vector Classification: Trains a LinearSVC model to classify messages as spam or ham based on learned text patterns
Prediction Export: Generates a predictions.csv file with binary labels (TRUE for spam, FALSE for ham) for each test message

Usage

Terminal MAC Run Script

bash run.sh

or directly:

python3 model.py

Required input files:

data_train_hw4_problem1.csv — Labeled training data with spam and text columns
data_test_hw4_problem1.csv — Unlabeled test data with text column

Output:

predictions.csv — Binary classification results with spam column (TRUE or FALSE)

Use Cases

Personal Message Filtering

Evaluate SMS or text message datasets to identify unwanted spam messages for personal or organizational message triage.

Research on Text Classification

Serve as a baseline or reference implementation for studying linear SVM-based text classification pipelines using TF-IDF features.

Research Purposes

Designed for research purposes. The repository includes a dummy dataset for demonstration and testing purposes. Penn State University (PSU), IST 557 Data Mining. Fall 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
data_train_hw4_problem1.csv		data_train_hw4_problem1.csv
model.py		model.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Text Classifier

Capabilities

Usage

Terminal MAC Run Script

Use Cases

Personal Message Filtering

Research on Text Classification

Research Purposes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spam Text Classifier

Capabilities

Usage

Terminal MAC Run Script

Use Cases

Personal Message Filtering

Research on Text Classification

Research Purposes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages