A role-centered knowledge graph built from interview-prep articles for:
- Machine Learning Engineer
- AI Engineer
- Data Scientist
- MLOps Engineer
- Data Engineer
- NLP Engineer
- Computer Vision Engineer
The project scrapes interview guides, roadmaps, question banks, and study resources, then turns them into a graph of roles, topics, skills, tools, algorithms, and recommended resources. It also exports a deployable static HTML graph for exploration.
This project answers questions like:
- What should I study for Machine Learning Engineer interviews?
- What overlaps between Data Scientist and Machine Learning Engineer prep?
- Which topics are central to MLOps or AI Engineer preparation?
- Which resources are repeatedly recommended for a given role?
Instead of treating the corpus as plain text, the pipeline builds a structured graph with relationships such as:
JOB_ROLE -> REQUIRES -> SKILLJOB_ROLE -> TESTS -> INTERVIEW_TOPICJOB_ROLE -> USES -> TOOL / ALGORITHMRESOURCE -> RECOMMENDED_FOR -> JOB_ROLE
From the current run in this repo:
109candidate URLs collected56kept after filtering189graph nodes363graph edges39communities after graph tightening
Kept role coverage:
- Machine Learning Engineer:
14 - Computer Vision Engineer:
8 - Data Scientist:
8 - NLP Engineer:
8 - AI Engineer:
7 - Data Engineer:
6 - MLOps Engineer:
5
The scrape notebook uses SerpAPI Google Search with role-focused interview-prep queries, then scrapes article text with trafilatura.
The corpus is filtered to remove:
- blocked domains
- obvious non-article or job-posting pages
- thin pages
- weak interview-signal content
- failed scrapes
The graph notebook uses:
LlamaIndex PropertyGraphIndex- a custom
GraphRAGExtractor - a custom
GraphRAGStore
Entity types:
JOB_ROLEINTERVIEW_TOPICSKILLTOOLALGORITHMRESOURCE
Relationship types:
TESTSREQUIRESUSESRELATED_TORECOMMENDED_FOR
The system runs community detection, summarizes each cluster, and uses those summaries for final question answering.
The final outputs are:
.
|-- scrape_interview_prep.ipynb
|-- build_interview_graphrag.ipynb
|-- interview_prep_corpus.csv
|-- interview_graph_data.json
|-- job_skill_graph.html
|-- search.csv
|-- raw.csv
`-- .env
- Python 3.10+
- SerpAPI key
- OpenAI API key
SERP_API_KEY=your_serpapi_key
OPENAI_API_KEY=your_openai_api_keygpt-4o-minifor entity and relationship extractiongpt-4o-minifor community summariesgpt-4ofor final query synthesis
- Job postings were too noisy for this use case, so the project pivoted to interview-prep articles.
- Many search results from sites like Reddit, LinkedIn, and Medium were hard to scrape reliably.
- Early graph versions were too fragmented, with too many weak nodes and tiny communities.
- Visualization quality mattered a lot: a technically correct graph was still hard to use until the layout became role-centered and interaction-first.
- better filtering of scraped pages
- stronger role coverage across the corpus
- graph tightening to remove low-signal structure
- fewer communities and less clutter
- a cleaner HTML graph with:
- larger role nodes
- role anchors that never disappear
- click-to-focus neighborhoods
- relation and type filters
- grouped node details
- SerpAPI
requests,BeautifulSoup,trafilaturapandas- LlamaIndex
PropertyGraphIndex - OpenAI
gpt-4o-mini,gpt-4o graspologic- D3.js
This project was built with development assistance from OpenAI's ChatGPT (GPT-5.4). It was used to help with implementation, debugging, and iteration during development.
