Discover hidden topics and themes from BBC News articles using Unsupervised NLP techniques: LDA and NMF. Explore insights, visualize topics, and compare model performance! 🚀
We use the BBC News Dataset (Kaggle).
The dataset contains news articles categorized into topics such as sports, politics, business, technology, and entertainment.
- Python 🐍
- Gensim
- Scikit-learn
- NLTK / spaCy
- pyLDAvis
- WordCloud
- Tokenization
- Lowercasing
- Stopword removal
We applied two unsupervised NLP techniques:
- Latent Dirichlet Allocation (LDA)
- Non-negative Matrix Factorization (NMF)
Both models extract dominant topics and display the most significant words per topic.
The interactive LDA visualization is saved as:
lda_visualization.html
Open it in a browser to explore topic-term relationships. 🌐
Each image visually represents the most significant words for the topic. 🌟
- LDA Coherence Score: 0.3530
- NMF Reconstruction Error: 45.8986
Compare model performance to understand which approach captures topics more clearly. 🔍
- Clone the repo:
git clone https://github.com/your-username/bbc-news-topic-modeling.git
cd bbc-news-topic-modeling-
Install required packages
-
Run the notebook or Python script:
python topic_modeling.pyWordClouds and LDA visualizations are saved automatically. ✅
This project is licensed under the MIT License. 📝









