The data collection scripts are located under:
chatbot-core/data/collection/
These scripts gather information from four key sources:
- Jenkins official documentation
- Discourse community topics
- StackOverflow threads
- Jenkins Plugins documentation
This doc file describes how to run the individual data collection scripts for each source. The results will be stored in the raw directory.
The script docs_crawler.py recursively crawls the Jenkins official documentation starting from the base URL. It collects the main content from each documentation page and stores it in a structured JSON file.
- Input: No input required — the script starts crawling from the base documentation URL.
- Output: A
jenkins_docs.jsonfile saved underchatbot-core/data/raw/, containing page URLs as keys and their HTML content as values.
To run:
python data/collection/docs_crawler.pyNote: Make sure you're in the chatbot-core directory and your virtual environment is activated before running this or any script.
This data collection pipeline fetches community discussions from the Jenkins Discourse forum under the "Using Jenkins" category. It runs in three steps, using separate scripts:
Script: discourse_topics_retriever.py
Fetches all topics under the "Using Jenkins" category(including sub-category "Ask a question").
- Input: None — it exploits the Discourse API.
- Output:
discourse_topic_list.json(stored inchatbot-core/data/raw/)
To run:
python data/collection/discourse_topics_retriever.pyScript: collection_utils/filter_discourse_threads.py
Filters the previously collected topics, keeping only those with an accepted answer.
- Input:
discourse_topic_list.json - Output:
filtered_discourse_topics.json(stored inchatbot-core/data/raw)
To run:
python data/collection/collection_utils/filter_discourse_threads.pyScript: discourse_fetch_posts.py
Fetches all post content for each filtered topic, including the question and all replies.
- Input:
filtered_discourse_topics.json - Output:
topics_with_posts.json(stored inchatbot-core/data/raw)
To run:
python data/collection/discourse_fetch_posts.pyThis script processes data collected manually from StackOverflow using StackExchange Data Explorer. The query retrieves the top 1000 Jenkins-related threads with accepted answers, positive scores, and created in or after 2020.
Run the following SQL query in the StackExchange Data Explorer to export a list of high-quality Jenkins-related threads:
SELECT TOP 1000
q.Id AS [Question ID],
q.Title AS [Question Title],
q.Body AS [Question Body],
q.Tags AS [Tags],
q.CreationDate,
q.Score AS [Question Score],
a.Id AS [Accepted Answer ID],
a.Score AS [Answer Score],
a.Body AS [Answer Body]
FROM Posts q
INNER JOIN Posts a ON q.AcceptedAnswerId = a.Id
WHERE
q.Tags LIKE '%<jenkins>%'
AND q.Score >= 1
AND q.CreationDate >= '2020-01-01'
ORDER BY q.Score DESCThe result can be downloaded as a CSV file and has to be placed in the following path: chatbot-core/data/raw/QueryResults.csv
Script: collection_utils/convert_stack_threads.py
This script reads the exported CSV and converts it into a JSON format. The resulting JSON file will contain a list of question-answer pairs with metadata.
- Input:
QueryResults.csv - Output:
stack_overflow_threads.json(stored inchatbot-core/data/raw)
To run:
python data/collection/collection_utils/convert_stack_threads.pyThis pipeline fetches the documentation content of the Jenkins plugins hosted on https://plugins.jenkins.io/.
The collection process consists of two main scripts:
Script: fetch_list_plugins.py
Fetches the list of all available plugins from the Jenkins update site and saves their names (without .hpi extension).
- Output:
plugin_names.json(stored inchatbot-core/data/raw)
To run:
python data/collection/fetch_list_plugins.pyScript: jenkins_plugin_fetch.py
Uses the list of plugin names to fetch documentation content from each plugin's page on plugins.jenkins.io. It extracts the main <div class="content"> section from each page, which contains the content we are interested in.
- Input:
plugin_names.json - Output:
plugin_docs.json(stored inchatbot-core/data/raw)
To run:
python data/collection/jenkins_plugin_fetch.py