This project implements a text classification model trained on SMS/text message data to distinguish between spam and non-spam (ham) messages. The model applies natural language preprocessing and TF-IDF vectorization before training a Support Vector Classifier to make binary predictions on unlabeled message data, outputting results to a predictions.csv file.
- Text Preprocessing: Converts text to lowercase, removes punctuation, and strips digits to normalize input data
- TF-IDF Vectorization: Converts cleaned text into numerical feature vectors using
TfidfVectorizerwith English stop-word removal and a 5,000-feature limit - Support Vector Classification: Trains a
LinearSVCmodel to classify messages as spam or ham based on learned text patterns - Prediction Export: Generates a
predictions.csvfile with binary labels (TRUEfor spam,FALSEfor ham) for each test message
bash run.shor directly:
python3 model.pyRequired input files:
data_train_hw4_problem1.csv— Labeled training data withspamandtextcolumnsdata_test_hw4_problem1.csv— Unlabeled test data withtextcolumn
Output:
predictions.csv— Binary classification results withspamcolumn (TRUEorFALSE)
Evaluate SMS or text message datasets to identify unwanted spam messages for personal or organizational message triage.
Serve as a baseline or reference implementation for studying linear SVM-based text classification pipelines using TF-IDF features.
Designed for research purposes. The repository includes a dummy dataset for demonstration and testing purposes. Penn State University (PSU), IST 557 Data Mining. Fall 2025.