Issue Details
The current project pipeline includes stages for data collection, preprocessing, and chunking for multiple information sources such as Jenkins core documentation, Discourse threads, and StackOverflow discussions. However, the embedding stage only processes chunks generated from plugin documentation.
Evidence from the Current Code
In embed_chunks.py, the list of chunk files used for embedding includes only chunks_plugin_docs.json, while other chunk files generated by the pipeline are present in the repository but commented out in the CHUNK_FILES configuration.
Since the embedding step processes only plugin documentation chunks, the stored vector index contains information exclusively from that source. Any other datasets generated earlier in the pipeline never reach the vector store.
Behaviour
When users ask questions that should ideally be answered using Jenkins core documentation or community discussions, the system does not retrieve relevant information from those sources.
Instead, one of two outcomes occurs. The retrieval system may return irrelevant plugin documentation chunks that are only loosely related to the query, or it may return no relevant results at all.
Proposed Fix
1. Embed All Relevant Corpora
Update the CHUNK_FILES configuration in embed_chunks.py so that it includes all available chunk sources, rather than embedding only plugin documentation.
chunks_docs.json
chunks_plugin_docs.json
chunks_discourse_docs.json
chunks_stackoverflow_threads.json
After updating the configuration, re-run the embedding and storage step to regenerate the FAISS index and metadata so that embeddings from all corpora are included.
2. Ensure Metadata Distinguishes Each Corpus
While storing metadata in the .pkl file, ensure that each metadata record includes a consistent discriminator field identifying the source of the chunk.
For example, introduce a field such as:
data_source (or source_type)
Contribution
I would like to contribute to resolve this issue
Issue Details
The current project pipeline includes stages for data collection, preprocessing, and chunking for multiple information sources such as Jenkins core documentation, Discourse threads, and StackOverflow discussions. However, the embedding stage only processes chunks generated from plugin documentation.
Evidence from the Current Code
In
embed_chunks.py, the list of chunk files used for embedding includes onlychunks_plugin_docs.json, while other chunk files generated by the pipeline are present in the repository but commented out in theCHUNK_FILESconfiguration.Since the embedding step processes only plugin documentation chunks, the stored vector index contains information exclusively from that source. Any other datasets generated earlier in the pipeline never reach the vector store.
Behaviour
When users ask questions that should ideally be answered using Jenkins core documentation or community discussions, the system does not retrieve relevant information from those sources.
Instead, one of two outcomes occurs. The retrieval system may return irrelevant plugin documentation chunks that are only loosely related to the query, or it may return no relevant results at all.
Proposed Fix
1. Embed All Relevant Corpora
Update the
CHUNK_FILESconfiguration inembed_chunks.pyso that it includes all available chunk sources, rather than embedding only plugin documentation.chunks_docs.jsonchunks_plugin_docs.jsonchunks_discourse_docs.jsonchunks_stackoverflow_threads.jsonAfter updating the configuration, re-run the embedding and storage step to regenerate the FAISS index and metadata so that embeddings from all corpora are included.
2. Ensure Metadata Distinguishes Each Corpus
While storing metadata in the
.pklfile, ensure that each metadata record includes a consistent discriminator field identifying the source of the chunk.For example, introduce a field such as:
data_source(orsource_type)Contribution
I would like to contribute to resolve this issue