A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
Paper · Dataset · Quickstart · Evaluation · Citation
Personalized Deep Research (PDR) integrates user-specific context into every stage of the deep research workflow—from planning and query formulation to iterative retrieval and report generation. Unlike generic "one-size-fits-all" systems, PDR produces outputs that align with each user's preferred style, structure, and topical focus.
This repository provides the official implementation of PDR, including:
- Four core modules: Profile Extraction, Personalized Question Development, Dynamic Dual-Stage Retrieval, and Personalized Report Generation
- PDR Dataset: The first benchmark for personalized deep research across four realistic task categories
- PDR-Eval: A hybrid evaluation framework combining lexical metrics with LLM-as-judge assessment for factuality and personalization
Figure 1: Overview of the Personalized Deep Research (PDR) framework. The four core stages—profile extraction, personalized question development, dynamic dual-stage retrieval, and personalized report generation—seamlessly integrate user context throughout the pipeline.
| Module | Description |
|---|---|
| Profile Extraction | Transforms heterogeneous user data (drafts, emails, reports, logs) into structured profiles capturing demographics, traits, and preferences via an LLM-powered understanding agent. |
| Personalized Question Development | Decomposes user prompts into intent-aligned sub-queries, tailoring research sub-goals to individual profiles without manual intervention. |
| Dynamic Dual-Stage Retrieval | Unifies private (internal) and public (external) search; integrates chunk-filtering, a decision agent for adaptive stopping, and iterative query refinement. |
| Personalized Report Generation | Synthesizes retrieved evidence with profile signals to produce style-consistent, factually grounded reports. |
| PDR-Eval | Combines ROUGE-1, ROUGE-L, METEOR with LLM-as-judge scoring (Comprehensiveness, Readability, Content Personalization, Presentation Personalization). |
PDR/
├── deepsearcher/ # Core PDR library
│ ├── agent/ # RAG agents (DeepSearch, NaiveRAG)
│ ├── embedding/ # Embedding providers (FastEmbed, OpenAI, etc.)
│ ├── loader/ # File loaders & web crawlers
│ ├── llm/ # LLM providers (DeepSeek, OpenAI, etc.)
│ ├── vector_db/ # Vector stores (Milvus, Qdrant)
│ ├── personalized_understanding.py
│ ├── online_query.py
│ └── offline_loading.py
├── data/ # Dataset structure (see Datasets)
│ ├── abstract/
│ ├── report/
│ ├── speech/
│ └── topic/
├── examples/ # Demo scripts
│ └── demo.py
├── evaluation/ # PDR-Eval (LLM-as-judge prompts & scripts)
├── local_retriever/ # Local retrieval server for public search
├── assets/ # Figures and supplementary materials
├── config.yaml # Configuration
└── requirements.txt
-
Clone and install dependencies
git clone https://github.com/Xiaopengli1/PDR.git cd PDR pip install -r requirements.txt -
Install FastEmbed for local embedding models:
pip install fastembed
-
Configure LLM and embedding in
config.yaml(e.g., DeepSeek API, OpenAI, or local models such as Qwen-3-14B).
Step 1: Launch the local retriever (for public/external search)
bash local_retriever/retrieval_launch.shStep 2: Run Deep Research
python examples/demo.pyThe script produces JSONL output with per-sample metrics (ROUGE, METEOR, F1) and generation results, suitable for downstream PDR-Eval assessment.
The PDR Dataset is designed for personalized deep research evaluation. Owing to its size, the dataset is hosted externally.
Dataset download: PDR Dataset on Google Drive
Download the folder and place the contents under data/ to match the expected structure below.
Figure 2: Dataset construction pipeline. Each task provides user queries, personalized files, and ground-truth reports written by the same user.
| Task | Description | Source |
|---|---|---|
| Task 1: Abstract Generation | Emulate author's rhetorical patterns while preserving scientific accuracy | LongLaMP (Citation Network V14) |
| Task 2: Topic Writing | Capture creative style, sarcasm, and subreddit-specific terminology | LongLaMP (Reddit TL;DR) |
| Task 3: Report Generation | Match tone, organization, and formatting conventions | Substack (anonymous authors) |
| Task 4: Speech-Script Generation | Reproduce material selection, logical flow, and sentence structure | TED & Medium |
Each author/task folder contains:
knowledge_base/— Private user documents (drafts, papers, notes)profile.txt— Extracted profile materialinput.txt— Task promptoutput.txt— Ground-truth report by the same user
We provide PDR-Eval, a hybrid evaluation framework assessing both factual quality and personalization.
Figure 3: PDR-Eval combines lexical overlap (ROUGE, METEOR), quality evaluation (Comprehensiveness, Readability), and personalization evaluation (Content & Presentation Personalization).
| Category | Metrics |
|---|---|
| Lexical Overlap | ROUGE-1, ROUGE-L, METEOR |
| Quality (LLM-as-Judge) | Comprehensiveness (Comp.), Readability (Read.) |
| Personalization (LLM-as-Judge) | Content Personalization (C.P.), Presentation Personalization (P.P.) |
cd evaluation
# Use script.py with your JSONL output from examples/demo.py
python script.pyEvaluation prompts are provided in evaluation/evaluation_prompt.py.
If you find PDR useful, please cite our paper:
This project is released under the license specified in LICENSE.txt.
PDR builds upon deep-searcher for retrieval infrastructure. We thank the open-source community and our collaborators at City University of Hong Kong and Huawei Noah's Ark Lab.


