Text Classification and Topic Categorization on 20Newsgroups Dataset using NLP

This repository features an advanced, end-to-end Natural Language Processing (NLP) and Machine Learning pipeline developed to classify unformatted text documents into 20 distinct newsgroup topics. Deployed on the classic 20Newsgroups dataset, this architecture comprehensively benchmarks probabilistic, linear, and maximum-margin classifiers.

The system incorporates rigorous text cleaning, sublinear term-frequency optimization, bigram feature extraction, and high-dimensional sparse matrix evaluation.

🚀 Pipeline Architecture & Methodology

1. Exploratory Data Analysis (EDA)

Analyzed dataset distribution across 20 distinct categories, verifying class balance.
Visualized document lengths (word count distributions) before and after preprocessing, which revealed a heavy right-skewed text length pattern typical of real-world communications.

2. Advanced NLP Preprocessing

To handle noisy and unformatted string inputs, an optimized textual normalization pipeline was built using:

Case Normalization: Universal conversion to lowercase to ensure feature uniformity.
Noise Filtering: Regex-driven removal of numerical tokens and string-translation for high-speed punctuation stripping.
Stopword Pruning: Elimination of high-frequency non-informative English tokens via NLTK.
Morphological Stemming: Implementation of the Porter Stemmer to reduce inflected or derived words to their core base form, drastically lowering the lexicon dimensionality.

3. Feature Engineering: Sublinear TF-IDF

Text strings were converted into structured numerical vectors using a highly tuned TfidfVectorizer:

N-gram Range: Set to (1, 2) to capture both single words (unigrams) and consecutive word pairs (bigrams), preserving local contextual semantics.
Vocabulary Constraint: Capped at 20,000 maximum features alongside a minimum document frequency ($min_df=3$) to discard rare noise tokens.
Sublinear TF Scaling: Applied $1 + \log(tf)$ instead of raw $tf$ frequency to prevent long documents from mathematically dominating the feature vectors.

📊 Performance Benchmarking & Evaluation

The processed feature matrix was evaluated using three architectures: Multinomial Naive Bayes (NB), Logistic Regression (LR), and Linear Support Vector Machine (Linear SVM).

Model Metrics Summary

Linear SVM demonstrated absolute empirical dominance, achieving an Accuracy of 85.34% (with weighted F1, Precision, and Recall scaling symmetrically).
Logistic Regression tracked closely behind, displaying robust capabilities in high-dimensional feature spaces with smooth log-loss optimization.
Multinomial Naive Bayes achieved fast convergence speeds, performing well as a solid probabilistic baseline.

📈 Visualizations & Error Analysis

1. Document Length & Preprocessing Density

The statistical distribution showcases how removing stopwords and tokenizing words condensed document volume, standardizing the variance and making text structures cleaner for the mathematical classifiers.

2. High-Frequency Keywords & Confusion Matrix

Feature importance analysis via TF-IDF weight summation highlights dominant global tokens like subject, line, organ, and write.
The comprehensive Confusion Matrix computed for the top-performing Linear SVM highlights exceptional classification accuracy across highly distinct classes (e.g., rec.sport.hockey, space), while explicitly identifying minor misclassification overlaps between overlapping semantic concepts (such as comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware).

💻 How to Run This Project

Follow these steps to clone the repository and execute the classification scripts locally:

1. Clone the Repository

git clone [https://github.com/mrhashx/text-classification-20newsgroups.git](https://github.com/mrhashx/text-classification-20newsgroups.git)
cd text-classification-20newsgroups

2. Install Required Frameworks

pip install numpy pandas matplotlib seaborn nltk scikit-learn

3. Run the Classifiers

To execute the script containing full exploratory plots and matrix visualizations:

python news_classifier_fixed_split.py

To execute the speed-optimized pipeline showcasing progress bars and automated text splitting:

python news_classifier_random_split.py

🛠️ Tech Stack & Libraries

Language: Python 3.x

Core Analytics: Pandas, NumPy

NLP Suite: NLTK (Stopwords, PorterStemmer)

Machine Learning Suite: Scikit-Learn (TfidfVectorizer, LinearSVC, LogisticRegression, MultinomialNB)

Visuals: Matplotlib, Seaborn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
20news-bydate-test		20news-bydate-test
20news-bydate-train		20news-bydate-train
images		images
README.md		README.md
news_classifier_fixed_split.py		news_classifier_fixed_split.py
news_classifier_random_split.py		news_classifier_random_split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification and Topic Categorization on 20Newsgroups Dataset using NLP

🚀 Pipeline Architecture & Methodology

1. Exploratory Data Analysis (EDA)

2. Advanced NLP Preprocessing

3. Feature Engineering: Sublinear TF-IDF

📊 Performance Benchmarking & Evaluation

Model Metrics Summary

📈 Visualizations & Error Analysis

1. Document Length & Preprocessing Density

2. High-Frequency Keywords & Confusion Matrix

💻 How to Run This Project

1. Clone the Repository

2. Install Required Frameworks

3. Run the Classifiers

🛠️ Tech Stack & Libraries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Classification and Topic Categorization on 20Newsgroups Dataset using NLP

🚀 Pipeline Architecture & Methodology

1. Exploratory Data Analysis (EDA)

2. Advanced NLP Preprocessing

3. Feature Engineering: Sublinear TF-IDF

📊 Performance Benchmarking & Evaluation

Model Metrics Summary

📈 Visualizations & Error Analysis

1. Document Length & Preprocessing Density

2. High-Frequency Keywords & Confusion Matrix

💻 How to Run This Project

1. Clone the Repository

2. Install Required Frameworks

3. Run the Classifiers

🛠️ Tech Stack & Libraries

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages