🚀 BDA Kafka Streaming Pipeline — Windows Setup & Run Guide

📌 Overview

End-to-end real-time data pipeline:

Web Scraping (producer.py)
        ↓
      Kafka
        ↓
  Spark Cleaner (spark_cleaner.py)
        ↓
 cleaned_text / rejected_text (Kafka topics)
        ↓
 InfluxDB Consumer (influx_consumer.py)
        ↓
      InfluxDB
        ↓
      Grafana  →  http://localhost:3000

🧰 Prerequisites

1. Java (required for Kafka + Spark)

java -version

Expected: java version "11.x" or "17.x"

2. Python + Dependencies

python -m pip install kafka-python pyspark influxdb-client requests beautifulsoup4

3. Apache Kafka

Download from: https://kafka.apache.org/downloads
Extract to C:\kafka
Verify: C:\kafka\bin\windows\ folder exists

4. Docker Desktop

Download from: https://www.docker.com/products/docker-desktop
Install and reboot if prompted

▶️ How to Run (Step by Step)

✅ Step 1 — Start Docker Desktop

Press Win → search Docker Desktop → open it
Wait for the 🐳 whale icon to appear in the system tray (bottom-right)
Hover over it — wait until it says "Engine running"

⚠️ Docker Desktop must be fully running before Step 4.

✅ Step 2 — Start Kafka (Terminal 1)

cd C:\kafka
bin\windows\kafka-server-start.bat config\server.properties

Expected log line:

[BrokerServer id=1] Starting broker

Keep this terminal open. Kafka must stay running.

✅ Step 3 — Create Kafka Topics (Terminal 2 — run once)

cd C:\kafka
bin\windows\kafka-topics.bat --create --topic raw_text --bootstrap-server localhost:9092
bin\windows\kafka-topics.bat --create --topic cleaned_text --bootstrap-server localhost:9092
bin\windows\kafka-topics.bat --create --topic rejected_text --bootstrap-server localhost:9092

Expected output for each:

Created topic raw_text.

If you see Topic 'X' already exists — that's fine, skip it.

✅ Step 4 — Start InfluxDB + Grafana via Docker (Terminal 3)

cd C:\Codes\OpenSource\bda
docker-compose up

Wait until you see:

bda_grafana  | logger=settings t=... msg="HTTP Server Listen" address=[::]:3000

Keep this terminal open. InfluxDB runs on localhost:8086, Grafana on localhost:3001.

✅ Step 5 — Start the Web Scraping Producer (Terminal 4)

conda activate bda
cd C:\Codes\OpenSource\bda
python producer.py

Expected output (every 5 seconds):

Scraping: scrape_hacker_news …
  Got 30 items from hacker_news
  → Sent [hacker_news]: AI startup raises $500M...
Scraping: scrape_wikipedia_random …
  → Sent [wikipedia]: The Roman Empire was...
Scraping: scrape_bbc_news …
  → Sent [bbc_news]: UK economy grows 0.3%...

Scrapes live from: Hacker News, Wikipedia (random), BBC News

✅ Step 6 — Start the InfluxDB Consumer Bridge (Terminal 5)

If you are already running spark_cleaner.py, use the default Spark output topics:

conda activate bda
cd C:\Codes\OpenSource\bda
python influx_consumer.py

If you want to skip Spark and still feed Grafana, use the raw-text fallback mode:

conda activate bda
cd C:\Codes\OpenSource\bda
python influx_consumer.py --source raw_text

Expected output:

Starting Kafka → InfluxDB bridge …
[VALID]    topic=cleaned_text   latency=0.021s  rej_rate=12.0%  |  The Roman Empire...
[REJECTED] topic=rejected_text  latency=0.018s  rej_rate=14.0%  |  The Roman Empire...

Note: --source raw_text lets the bridge clean/reject messages in Python and removes the hard Spark dependency for Grafana metrics.

✅ Step 7 (Optional but Recommended) — Start Spark Cleaner (Terminal 6)

Requires Java 11/17 + PySpark 4.x (already installed in bda conda env). One-time setup: winutils.exe must be present at C:\hadoop\bin\winutils.exe (needed by Spark on Windows).

# One-time: download winutils.exe if not already done
New-Item -ItemType Directory -Force -Path "C:\hadoop\bin"
Invoke-WebRequest -Uri "https://github.com/kontext-tech/winutils/raw/master/hadoop-3.4.0/bin/winutils.exe" -OutFile "C:\hadoop\bin\winutils.exe"
Unblock-File "C:\hadoop\bin\winutils.exe"  # Unblock the downloaded file

# Clear SPARK_HOME so conda PySpark 4.1.1 is used (not any standalone install)
set SPARK_HOME=
set HADOOP_HOME=C:/hadoop
set PYSPARK_PYTHON=C:\Users\varsh\anaconda3\envs\bda\python.exe
set PYSPARK_DRIVER_PYTHON=C:\Users\varsh\anaconda3\envs\bda\python.exe

conda activate bda
cd C:\Codes\OpenSource\bda
python spark_cleaner.py

Note: The Kafka connector JAR (org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0) is declared inside spark_cleaner.py via spark.jars.packages and will be auto-downloaded on first run (~10 MB). Internet access required. If you see NativeIO$Windows.access0 error → winutils.exe is missing or HADOOP_HOME is not set.

Without this, cleaned_text and rejected_text topics stay empty and Grafana will show no data.

✅ Step 8 — Open Grafana Dashboard

Open browser → http://localhost:3001
Login: admin / admin
Go to: Dashboards → BDA → BDA Pipeline Monitor
Dashboard auto-refreshes every 5 seconds

📊 Grafana Dashboard Panels

Panel	Description
📈 Message Throughput	Valid vs Rejected messages over time
🔴 Rejection Rate	Rolling % of rejected messages (gauge)
✅ Total Valid	Cumulative valid message count
❌ Total Rejected	Cumulative rejected message count
⏱️ Processing Latency	Time from producer → InfluxDB (seconds)

🖥️ Terminal Summary

Terminal	Command	Keep Open?
1	`kafka-server-start.bat`	✅ Yes
2	Create topics (3 commands)	❌ Close after
3	`docker-compose up`	✅ Yes (Grafana → port 3001)
4	`python producer.py`	✅ Yes
5	`python influx_consumer.py`	✅ Yes
6	`python spark_cleaner.py`	✅ Yes (optional)

🧱 Project Structure

bda/
├── producer.py           # Scrapes web data → sends to Kafka raw_text
├── spark_cleaner.py      # Spark: cleans & routes to cleaned_text / rejected_text
├── cleaning.py           # Text cleaning utilities (used by Spark)
├── influx_consumer.py    # Kafka consumer → writes metrics to InfluxDB
├── docker-compose.yml    # Spins up InfluxDB + Grafana
├── requirements.txt      # Python dependencies
└── grafana/
    ├── provisioning/
    │   ├── datasources/influxdb.yml   # Auto-configures InfluxDB datasource
    │   └── dashboards/dashboard.yml  # Dashboard loader config
    └── dashboards/
        └── bda_pipeline.json         # Pre-built Grafana dashboard

⚠️ Common Issues

Error	Fix
`open //./pipe/dockerDesktopLinuxEngine`	Docker Desktop not running — open it first
`Topic 'X' already exists`	Not an error — topic already created, continue
`DEPRECATED: Log4j 1.x`	Harmless Kafka warning — ignore it
Grafana shows no data	Spark cleaner not running — start Terminal 6
`influx_consumer` exits immediately	InfluxDB not ready yet — wait for Docker step

🔍 Verify Kafka Data (Optional Debug)

# Check raw incoming data
cd C:\kafka
bin\windows\kafka-console-consumer.bat --topic raw_text --from-beginning --bootstrap-server localhost:9092

# Check cleaned data
bin\windows\kafka-console-consumer.bat --topic cleaned_text --bootstrap-server localhost:9092

# Check rejected data
bin\windows\kafka-console-consumer.bat --topic rejected_text --bootstrap-server localhost:9092

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 BDA Kafka Streaming Pipeline — Windows Setup & Run Guide

📌 Overview

🧰 Prerequisites

1. Java (required for Kafka + Spark)

2. Python + Dependencies

3. Apache Kafka

4. Docker Desktop

▶️ How to Run (Step by Step)

✅ Step 1 — Start Docker Desktop

✅ Step 2 — Start Kafka (Terminal 1)

✅ Step 3 — Create Kafka Topics (Terminal 2 — run once)

✅ Step 4 — Start InfluxDB + Grafana via Docker (Terminal 3)

✅ Step 5 — Start the Web Scraping Producer (Terminal 4)

✅ Step 6 — Start the InfluxDB Consumer Bridge (Terminal 5)

✅ Step 7 (Optional but Recommended) — Start Spark Cleaner (Terminal 6)

✅ Step 8 — Open Grafana Dashboard

📊 Grafana Dashboard Panels

🖥️ Terminal Summary

🧱 Project Structure

⚠️ Common Issues

🔍 Verify Kafka Data (Optional Debug)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
checkpoints		checkpoints
grafana		grafana
README.md		README.md
cleaning.py		cleaning.py
docker-compose.yml		docker-compose.yml
influx_consumer.py		influx_consumer.py
producer.py		producer.py
requirements.txt		requirements.txt
spark_cleaner.py		spark_cleaner.py
temp.md		temp.md

Folders and files

Latest commit

History

Repository files navigation

🚀 BDA Kafka Streaming Pipeline — Windows Setup & Run Guide

📌 Overview

🧰 Prerequisites

1. Java (required for Kafka + Spark)

2. Python + Dependencies

3. Apache Kafka

4. Docker Desktop

▶️ How to Run (Step by Step)

✅ Step 1 — Start Docker Desktop

✅ Step 2 — Start Kafka (Terminal 1)

✅ Step 3 — Create Kafka Topics (Terminal 2 — run once)

✅ Step 4 — Start InfluxDB + Grafana via Docker (Terminal 3)

✅ Step 5 — Start the Web Scraping Producer (Terminal 4)

✅ Step 6 — Start the InfluxDB Consumer Bridge (Terminal 5)

✅ Step 7 (Optional but Recommended) — Start Spark Cleaner (Terminal 6)

✅ Step 8 — Open Grafana Dashboard

📊 Grafana Dashboard Panels

🖥️ Terminal Summary

🧱 Project Structure

⚠️ Common Issues

🔍 Verify Kafka Data (Optional Debug)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages