Skip to content

mrhashx/text-classification

Repository files navigation

Text Classification and Topic Categorization on 20Newsgroups Dataset using NLP

This repository features an advanced, end-to-end Natural Language Processing (NLP) and Machine Learning pipeline developed to classify unformatted text documents into 20 distinct newsgroup topics. Deployed on the classic 20Newsgroups dataset, this architecture comprehensively benchmarks probabilistic, linear, and maximum-margin classifiers.

The system incorporates rigorous text cleaning, sublinear term-frequency optimization, bigram feature extraction, and high-dimensional sparse matrix evaluation.


🚀 Pipeline Architecture & Methodology

1. Exploratory Data Analysis (EDA)

  • Analyzed dataset distribution across 20 distinct categories, verifying class balance.
  • Visualized document lengths (word count distributions) before and after preprocessing, which revealed a heavy right-skewed text length pattern typical of real-world communications.

2. Advanced NLP Preprocessing

To handle noisy and unformatted string inputs, an optimized textual normalization pipeline was built using:

  • Case Normalization: Universal conversion to lowercase to ensure feature uniformity.
  • Noise Filtering: Regex-driven removal of numerical tokens and string-translation for high-speed punctuation stripping.
  • Stopword Pruning: Elimination of high-frequency non-informative English tokens via NLTK.
  • Morphological Stemming: Implementation of the Porter Stemmer to reduce inflected or derived words to their core base form, drastically lowering the lexicon dimensionality.

3. Feature Engineering: Sublinear TF-IDF

Text strings were converted into structured numerical vectors using a highly tuned TfidfVectorizer:

  • N-gram Range: Set to (1, 2) to capture both single words (unigrams) and consecutive word pairs (bigrams), preserving local contextual semantics.
  • Vocabulary Constraint: Capped at 20,000 maximum features alongside a minimum document frequency ($min_df=3$) to discard rare noise tokens.
  • Sublinear TF Scaling: Applied $1 + \log(tf)$ instead of raw $tf$ frequency to prevent long documents from mathematically dominating the feature vectors.

📊 Performance Benchmarking & Evaluation

The processed feature matrix was evaluated using three architectures: Multinomial Naive Bayes (NB), Logistic Regression (LR), and Linear Support Vector Machine (Linear SVM).

Model Metrics Summary

  • Linear SVM demonstrated absolute empirical dominance, achieving an Accuracy of 85.34% (with weighted F1, Precision, and Recall scaling symmetrically).
  • Logistic Regression tracked closely behind, displaying robust capabilities in high-dimensional feature spaces with smooth log-loss optimization.
  • Multinomial Naive Bayes achieved fast convergence speeds, performing well as a solid probabilistic baseline.

Model Evaluation Metrics


📈 Visualizations & Error Analysis

1. Document Length & Preprocessing Density

The statistical distribution showcases how removing stopwords and tokenizing words condensed document volume, standardizing the variance and making text structures cleaner for the mathematical classifiers.

Document Length Distribution Preprocessing KDE Effect

2. High-Frequency Keywords & Confusion Matrix

  • Feature importance analysis via TF-IDF weight summation highlights dominant global tokens like subject, line, organ, and write.
  • The comprehensive Confusion Matrix computed for the top-performing Linear SVM highlights exceptional classification accuracy across highly distinct classes (e.g., rec.sport.hockey, space), while explicitly identifying minor misclassification overlaps between overlapping semantic concepts (such as comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware).

Top Keywords TF-IDF Linear SVM Confusion Matrix


💻 How to Run This Project

Follow these steps to clone the repository and execute the classification scripts locally:

1. Clone the Repository

git clone [https://github.com/mrhashx/text-classification-20newsgroups.git](https://github.com/mrhashx/text-classification-20newsgroups.git)
cd text-classification-20newsgroups

2. Install Required Frameworks

pip install numpy pandas matplotlib seaborn nltk scikit-learn

3. Run the Classifiers

To execute the script containing full exploratory plots and matrix visualizations:

python news_classifier_fixed_split.py

To execute the speed-optimized pipeline showcasing progress bars and automated text splitting:

python news_classifier_random_split.py

🛠️ Tech Stack & Libraries

Language: Python 3.x

Core Analytics: Pandas, NumPy

NLP Suite: NLTK (Stopwords, PorterStemmer)

Machine Learning Suite: Scikit-Learn (TfidfVectorizer, LinearSVC, LogisticRegression, MultinomialNB)

Visuals: Matplotlib, Seaborn

About

An advanced NLP and Machine Learning pipeline for high-dimensional text classification and topic categorization on the 20Newsgroups dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages