This study investigates the effectiveness of various machine learning models for multi-class text classification of Urdu news articles from renowned Pakistani media organizations such as ARY, Geo, Jang, Express and Dunya News.
After scraping 1500 articles from the websites of these media outlets, models such as Multinomial Naïve Bayes (MNB), Neural Networks, Logistic Regression, and Random Forest were evaluated for their ability to classify Urdu content into distinct categories.
MNB: 96.3% on internal test data and 98% on third-party test data.
Neural Networks: 95.6% on internal test data.
Logistic Regression: 94.6% on internal test data.
Random Forest: 84.2% on internal test data.
Scraping_NewsArticles: Webscraping code for specified media outlets.
Cleaning + EDA: Data cleaning, preprocessing and EDA.
Model1_MNB: Implementation of Multinomial Naïve Bayes.
Model2_NN: Implementation of Neural Network.
Model3_LogisticRegression: Implementation of Logistic Regression.
Model4_RandomForest: Implementation of Random Forest Classifier.
Research Paper: Comprehensive details of our study.