Skip to content

varshavkumar12345/bda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ BDA Kafka Streaming Pipeline β€” Windows Setup & Run Guide

πŸ“Œ Overview

End-to-end real-time data pipeline:

Web Scraping (producer.py)
        ↓
      Kafka
        ↓
  Spark Cleaner (spark_cleaner.py)
        ↓
 cleaned_text / rejected_text (Kafka topics)
        ↓
 InfluxDB Consumer (influx_consumer.py)
        ↓
      InfluxDB
        ↓
      Grafana  β†’  http://localhost:3000

🧰 Prerequisites

1. Java (required for Kafka + Spark)

java -version

Expected: java version "11.x" or "17.x"


2. Python + Dependencies

python -m pip install kafka-python pyspark influxdb-client requests beautifulsoup4

3. Apache Kafka


4. Docker Desktop


▢️ How to Run (Step by Step)

βœ… Step 1 β€” Start Docker Desktop

  1. Press Win β†’ search Docker Desktop β†’ open it
  2. Wait for the 🐳 whale icon to appear in the system tray (bottom-right)
  3. Hover over it β€” wait until it says "Engine running"

⚠️ Docker Desktop must be fully running before Step 4.


βœ… Step 2 β€” Start Kafka (Terminal 1)

cd C:\kafka
bin\windows\kafka-server-start.bat config\server.properties

Expected log line:

[BrokerServer id=1] Starting broker

Keep this terminal open. Kafka must stay running.


βœ… Step 3 β€” Create Kafka Topics (Terminal 2 β€” run once)

cd C:\kafka
bin\windows\kafka-topics.bat --create --topic raw_text --bootstrap-server localhost:9092
bin\windows\kafka-topics.bat --create --topic cleaned_text --bootstrap-server localhost:9092
bin\windows\kafka-topics.bat --create --topic rejected_text --bootstrap-server localhost:9092

Expected output for each:

Created topic raw_text.

If you see Topic 'X' already exists β€” that's fine, skip it.


βœ… Step 4 β€” Start InfluxDB + Grafana via Docker (Terminal 3)

cd C:\Codes\OpenSource\bda
docker-compose up

Wait until you see:

bda_grafana  | logger=settings t=... msg="HTTP Server Listen" address=[::]:3000

Keep this terminal open. InfluxDB runs on localhost:8086, Grafana on localhost:3001.


βœ… Step 5 β€” Start the Web Scraping Producer (Terminal 4)

conda activate bda
cd C:\Codes\OpenSource\bda
python producer.py

Expected output (every 5 seconds):

Scraping: scrape_hacker_news …
  Got 30 items from hacker_news
  β†’ Sent [hacker_news]: AI startup raises $500M...
Scraping: scrape_wikipedia_random …
  β†’ Sent [wikipedia]: The Roman Empire was...
Scraping: scrape_bbc_news …
  β†’ Sent [bbc_news]: UK economy grows 0.3%...

Scrapes live from: Hacker News, Wikipedia (random), BBC News


βœ… Step 6 β€” Start the InfluxDB Consumer Bridge (Terminal 5)

If you are already running spark_cleaner.py, use the default Spark output topics:

conda activate bda
cd C:\Codes\OpenSource\bda
python influx_consumer.py

If you want to skip Spark and still feed Grafana, use the raw-text fallback mode:

conda activate bda
cd C:\Codes\OpenSource\bda
python influx_consumer.py --source raw_text

Expected output:

Starting Kafka β†’ InfluxDB bridge …
[VALID]    topic=cleaned_text   latency=0.021s  rej_rate=12.0%  |  The Roman Empire...
[REJECTED] topic=rejected_text  latency=0.018s  rej_rate=14.0%  |  The Roman Empire...

Note: --source raw_text lets the bridge clean/reject messages in Python and removes the hard Spark dependency for Grafana metrics.


βœ… Step 7 (Optional but Recommended) β€” Start Spark Cleaner (Terminal 6)

Requires Java 11/17 + PySpark 4.x (already installed in bda conda env). One-time setup: winutils.exe must be present at C:\hadoop\bin\winutils.exe (needed by Spark on Windows).

# One-time: download winutils.exe if not already done
New-Item -ItemType Directory -Force -Path "C:\hadoop\bin"
Invoke-WebRequest -Uri "https://github.com/kontext-tech/winutils/raw/master/hadoop-3.4.0/bin/winutils.exe" -OutFile "C:\hadoop\bin\winutils.exe"
Unblock-File "C:\hadoop\bin\winutils.exe"  # Unblock the downloaded file
# Clear SPARK_HOME so conda PySpark 4.1.1 is used (not any standalone install)
set SPARK_HOME=
set HADOOP_HOME=C:/hadoop
set PYSPARK_PYTHON=C:\Users\varsh\anaconda3\envs\bda\python.exe
set PYSPARK_DRIVER_PYTHON=C:\Users\varsh\anaconda3\envs\bda\python.exe

conda activate bda
cd C:\Codes\OpenSource\bda
python spark_cleaner.py

Note: The Kafka connector JAR (org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0) is declared inside spark_cleaner.py via spark.jars.packages and will be auto-downloaded on first run (~10 MB). Internet access required. If you see NativeIO$Windows.access0 error β†’ winutils.exe is missing or HADOOP_HOME is not set.

Without this, cleaned_text and rejected_text topics stay empty and Grafana will show no data.


βœ… Step 8 β€” Open Grafana Dashboard

  1. Open browser β†’ http://localhost:3001
  2. Login: admin / admin
  3. Go to: Dashboards β†’ BDA β†’ BDA Pipeline Monitor
  4. Dashboard auto-refreshes every 5 seconds

πŸ“Š Grafana Dashboard Panels

Panel Description
πŸ“ˆ Message Throughput Valid vs Rejected messages over time
πŸ”΄ Rejection Rate Rolling % of rejected messages (gauge)
βœ… Total Valid Cumulative valid message count
❌ Total Rejected Cumulative rejected message count
⏱️ Processing Latency Time from producer β†’ InfluxDB (seconds)

πŸ–₯️ Terminal Summary

Terminal Command Keep Open?
1 kafka-server-start.bat βœ… Yes
2 Create topics (3 commands) ❌ Close after
3 docker-compose up βœ… Yes (Grafana β†’ port 3001)
4 python producer.py βœ… Yes
5 python influx_consumer.py βœ… Yes
6 python spark_cleaner.py βœ… Yes (optional)

🧱 Project Structure

bda/
β”œβ”€β”€ producer.py           # Scrapes web data β†’ sends to Kafka raw_text
β”œβ”€β”€ spark_cleaner.py      # Spark: cleans & routes to cleaned_text / rejected_text
β”œβ”€β”€ cleaning.py           # Text cleaning utilities (used by Spark)
β”œβ”€β”€ influx_consumer.py    # Kafka consumer β†’ writes metrics to InfluxDB
β”œβ”€β”€ docker-compose.yml    # Spins up InfluxDB + Grafana
β”œβ”€β”€ requirements.txt      # Python dependencies
└── grafana/
    β”œβ”€β”€ provisioning/
    β”‚   β”œβ”€β”€ datasources/influxdb.yml   # Auto-configures InfluxDB datasource
    β”‚   └── dashboards/dashboard.yml  # Dashboard loader config
    └── dashboards/
        └── bda_pipeline.json         # Pre-built Grafana dashboard

⚠️ Common Issues

Error Fix
open //./pipe/dockerDesktopLinuxEngine Docker Desktop not running β€” open it first
Topic 'X' already exists Not an error β€” topic already created, continue
DEPRECATED: Log4j 1.x Harmless Kafka warning β€” ignore it
Grafana shows no data Spark cleaner not running β€” start Terminal 6
influx_consumer exits immediately InfluxDB not ready yet β€” wait for Docker step

πŸ” Verify Kafka Data (Optional Debug)

# Check raw incoming data
cd C:\kafka
bin\windows\kafka-console-consumer.bat --topic raw_text --from-beginning --bootstrap-server localhost:9092

# Check cleaned data
bin\windows\kafka-console-consumer.bat --topic cleaned_text --bootstrap-server localhost:9092

# Check rejected data
bin\windows\kafka-console-consumer.bat --topic rejected_text --bootstrap-server localhost:9092

About

Analyzed and cleaned continuous streaming data from web scrapping. Data cleaning and analysis is made using Spark through regular expression matching, data streaming is handled by Kafka. This project cleans and analyzes data continuously streamed from web sources to filter out the relevant data necessary for training NLP models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages