Problem
vectorstore_utils.py serializes and deserializes FAISS metadata using pickle.dump() and pickle.load():
# rag/vectorstore/vectorstore_utils.py:57-78
def save_metadata(metadata, path):
with open(path, "wb") as f:
pickle.dump(metadata, f)
def load_metadata(path):
with open(path, "rb") as f:
return pickle.load(f) # <-- arbitrary code execution
Python pickle executes arbitrary code during deserialization. If an attacker can replace the .pkl file on disk - via a supply chain attack on the data pipeline, a compromised scraping source, or a path traversal bug elsewhere - they get full RCE with the server process privileges.
The metadata being stored is just a list of dicts with string values (id, text, source, code_blocks). There is no reason to use pickle for this structure.
Impact
- Arbitrary code execution on the server if any
.pkl file is modified
- The data pipeline pulls from external URLs (plugin docs, Discourse, etc.) and writes these files - any compromise upstream propagates
pickle has been on the do-not-use-with-untrusted-data list since Python 2.x
Proposed Fix
Replace pickle with json in vectorstore_utils.py:
import json
def save_metadata(metadata, path):
with open(path, "w", encoding="utf-8") as f:
json.dump(metadata, f)
def load_metadata(path):
with open(path, "r", encoding="utf-8") as f:
return json.load(f)
Then update the embedding pipeline (store_embeddings.py) to write .json instead of .pkl, and update retriever_utils.py to read from the new path. Existing .pkl files need a one-time migration or rebuild.
Acceptance Criteria
References
chatbot-core/rag/vectorstore/vectorstore_utils.py lines 57-78
chatbot-core/rag/vectorstore/store_embeddings.py lines 14-15
chatbot-core/rag/retriever/retriever_utils.py (reads the metadata)
- Python docs on pickle security: https://docs.python.org/3/library/pickle.html
Problem
vectorstore_utils.pyserializes and deserializes FAISS metadata usingpickle.dump()andpickle.load():Python pickle executes arbitrary code during deserialization. If an attacker can replace the
.pklfile on disk - via a supply chain attack on the data pipeline, a compromised scraping source, or a path traversal bug elsewhere - they get full RCE with the server process privileges.The metadata being stored is just a list of dicts with string values (
id,text,source,code_blocks). There is no reason to use pickle for this structure.Impact
.pklfile is modifiedpicklehas been on the do-not-use-with-untrusted-data list since Python 2.xProposed Fix
Replace
picklewithjsoninvectorstore_utils.py:Then update the embedding pipeline (
store_embeddings.py) to write.jsoninstead of.pkl, and updateretriever_utils.pyto read from the new path. Existing.pklfiles need a one-time migration or rebuild.Acceptance Criteria
pickle.loadandpickle.dumpremoved fromvectorstore_utils.pystore_embeddings.pywrites.jsonmetadata filesretriever_utils.pyreads.jsonmetadata filestests/unit/rag/vectorstore/updated and passingReferences
chatbot-core/rag/vectorstore/vectorstore_utils.pylines 57-78chatbot-core/rag/vectorstore/store_embeddings.pylines 14-15chatbot-core/rag/retriever/retriever_utils.py(reads the metadata)