Non-plugin sources (Jenkins core docs / Discourse / StackOverflow) are not embedded, so retrieval can’t use them

## Issue Details

The current project pipeline includes stages for data collection, preprocessing, and chunking for multiple information sources such as Jenkins core documentation, Discourse threads, and StackOverflow discussions. However, the embedding stage only processes chunks generated from plugin documentation.


### Evidence from the Current Code

In `embed_chunks.py`, the list of chunk files used for embedding includes only `chunks_plugin_docs.json`, while other chunk files generated by the pipeline are present in the repository but commented out in the `CHUNK_FILES` configuration. 
Since the embedding step processes only plugin documentation chunks, the stored vector index contains information exclusively from that source. Any other datasets generated earlier in the pipeline never reach the vector store.

### Behaviour
When users ask questions that should ideally be answered using Jenkins core documentation or community discussions, the system does not retrieve relevant information from those sources.
Instead, one of two outcomes occurs. The retrieval system may return **irrelevant plugin documentation chunks** that are only loosely related to the query, or it may return **no relevant results at all**. 

### Proposed Fix

#### 1. Embed All Relevant Corpora

Update the `CHUNK_FILES` configuration in `embed_chunks.py` so that it includes **all available chunk sources**, rather than embedding only plugin documentation.

- `chunks_docs.json`
- `chunks_plugin_docs.json`
- `chunks_discourse_docs.json`
- `chunks_stackoverflow_threads.json`

After updating the configuration, **re-run the embedding and storage step** to regenerate the FAISS index and metadata so that embeddings from all corpora are included.

#### 2. Ensure Metadata Distinguishes Each Corpus

While storing metadata in the `.pkl` file, ensure that each metadata record includes a **consistent discriminator field** identifying the source of the chunk.
For example, introduce a field such as:
`data_source` (or `source_type`)

### Contribution
I would like to contribute to resolve this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-plugin sources (Jenkins core docs / Discourse / StackOverflow) are not embedded, so retrieval can’t use them #262

Issue Details

Evidence from the Current Code

Behaviour

Proposed Fix

1. Embed All Relevant Corpora

2. Ensure Metadata Distinguishes Each Corpus

Contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Non-plugin sources (Jenkins core docs / Discourse / StackOverflow) are not embedded, so retrieval can’t use them #262

Description

Issue Details

Evidence from the Current Code

Behaviour

Proposed Fix

1. Embed All Relevant Corpora

2. Ensure Metadata Distinguishes Each Corpus

Contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions