This repository features an advanced, end-to-end Natural Language Processing (NLP) and Machine Learning pipeline developed to classify unformatted text documents into 20 distinct newsgroup topics. Deployed on the classic 20Newsgroups dataset, this architecture comprehensively benchmarks probabilistic, linear, and maximum-margin classifiers.
The system incorporates rigorous text cleaning, sublinear term-frequency optimization, bigram feature extraction, and high-dimensional sparse matrix evaluation.
- Analyzed dataset distribution across 20 distinct categories, verifying class balance.
- Visualized document lengths (word count distributions) before and after preprocessing, which revealed a heavy right-skewed text length pattern typical of real-world communications.
To handle noisy and unformatted string inputs, an optimized textual normalization pipeline was built using:
- Case Normalization: Universal conversion to lowercase to ensure feature uniformity.
- Noise Filtering: Regex-driven removal of numerical tokens and string-translation for high-speed punctuation stripping.
- Stopword Pruning: Elimination of high-frequency non-informative English tokens via NLTK.
- Morphological Stemming: Implementation of the Porter Stemmer to reduce inflected or derived words to their core base form, drastically lowering the lexicon dimensionality.
Text strings were converted into structured numerical vectors using a highly tuned TfidfVectorizer:
-
N-gram Range: Set to
(1, 2)to capture both single words (unigrams) and consecutive word pairs (bigrams), preserving local contextual semantics. -
Vocabulary Constraint: Capped at
20,000maximum features alongside a minimum document frequency ($min_df=3$ ) to discard rare noise tokens. -
Sublinear TF Scaling: Applied
$1 + \log(tf)$ instead of raw$tf$ frequency to prevent long documents from mathematically dominating the feature vectors.
The processed feature matrix was evaluated using three architectures: Multinomial Naive Bayes (NB), Logistic Regression (LR), and Linear Support Vector Machine (Linear SVM).
- Linear SVM demonstrated absolute empirical dominance, achieving an Accuracy of 85.34% (with weighted F1, Precision, and Recall scaling symmetrically).
- Logistic Regression tracked closely behind, displaying robust capabilities in high-dimensional feature spaces with smooth log-loss optimization.
- Multinomial Naive Bayes achieved fast convergence speeds, performing well as a solid probabilistic baseline.
The statistical distribution showcases how removing stopwords and tokenizing words condensed document volume, standardizing the variance and making text structures cleaner for the mathematical classifiers.
- Feature importance analysis via TF-IDF weight summation highlights dominant global tokens like subject, line, organ, and write.
- The comprehensive Confusion Matrix computed for the top-performing Linear SVM highlights exceptional classification accuracy across highly distinct classes (e.g., rec.sport.hockey, space), while explicitly identifying minor misclassification overlaps between overlapping semantic concepts (such as comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware).
Follow these steps to clone the repository and execute the classification scripts locally:
git clone [https://github.com/mrhashx/text-classification-20newsgroups.git](https://github.com/mrhashx/text-classification-20newsgroups.git)
cd text-classification-20newsgroupspip install numpy pandas matplotlib seaborn nltk scikit-learnTo execute the script containing full exploratory plots and matrix visualizations:
python news_classifier_fixed_split.pyTo execute the speed-optimized pipeline showcasing progress bars and automated text splitting:
python news_classifier_random_split.pyLanguage: Python 3.x
Core Analytics: Pandas, NumPy
NLP Suite: NLTK (Stopwords, PorterStemmer)
Machine Learning Suite: Scikit-Learn (TfidfVectorizer, LinearSVC, LogisticRegression, MultinomialNB)
Visuals: Matplotlib, Seaborn




