Skip to content

Non-plugin sources (Jenkins core docs / Discourse / StackOverflow) are not embedded, so retrieval can’t use them #262

@IND-Anshuman

Description

@IND-Anshuman

Issue Details

The current project pipeline includes stages for data collection, preprocessing, and chunking for multiple information sources such as Jenkins core documentation, Discourse threads, and StackOverflow discussions. However, the embedding stage only processes chunks generated from plugin documentation.

Evidence from the Current Code

In embed_chunks.py, the list of chunk files used for embedding includes only chunks_plugin_docs.json, while other chunk files generated by the pipeline are present in the repository but commented out in the CHUNK_FILES configuration.
Since the embedding step processes only plugin documentation chunks, the stored vector index contains information exclusively from that source. Any other datasets generated earlier in the pipeline never reach the vector store.

Behaviour

When users ask questions that should ideally be answered using Jenkins core documentation or community discussions, the system does not retrieve relevant information from those sources.
Instead, one of two outcomes occurs. The retrieval system may return irrelevant plugin documentation chunks that are only loosely related to the query, or it may return no relevant results at all.

Proposed Fix

1. Embed All Relevant Corpora

Update the CHUNK_FILES configuration in embed_chunks.py so that it includes all available chunk sources, rather than embedding only plugin documentation.

  • chunks_docs.json
  • chunks_plugin_docs.json
  • chunks_discourse_docs.json
  • chunks_stackoverflow_threads.json

After updating the configuration, re-run the embedding and storage step to regenerate the FAISS index and metadata so that embeddings from all corpora are included.

2. Ensure Metadata Distinguishes Each Corpus

While storing metadata in the .pkl file, ensure that each metadata record includes a consistent discriminator field identifying the source of the chunk.
For example, introduce a field such as:
data_source (or source_type)

Contribution

I would like to contribute to resolve this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions