pip install -e ".[dev]"
export OPENAI_API_KEY=your-key
# Evaluate 3+ models — oracle is built automatically from predictions
gistbench run -d data.csv -m gpt-4o --results-db results.db
gistbench run -d data.csv -m gpt-4o-mini --results-db results.db
gistbench run -d data.csv -m gpt-4-turbo --results-db results.db # scores all 3 modelsSee INSTRUCTIONS.md for full documentation.
Validate the full pipeline (extraction → IG → IS → taxonomy → scoring) end to end. Both the bundled mock dataset and the real synthetic split ship with bundled oracles, so a single model run is enough.
export OPENAI_API_KEY=your-key
# Default: both datasets (mock + synthetic), gpt-4o-mini, 5 users, no report
gistbench smoke-test
# Pick a single dataset and write a timestamped report into reports/
gistbench smoke-test --datasets mock --report-dir reports
gistbench smoke-test --datasets synthetic -n 10 --report-dir reports
# Or pin to a specific filename
gistbench smoke-test --datasets mock --report report.md
# Local model via Ollama / vLLM / LM Studio (no API key needed)
gistbench smoke-test --base-url http://localhost:11434/v1 --models llama3The command exits non-zero if any case fails.
GISTBench evaluates how well LLMs understand users from their engagement history by measuring two complementary axes:
- Interest Groundedness (IG): Are the extracted interests actually supported by the user's engagement data?
- Interest Specificity (IS): Can the model cite the specific items that support each interest?
The final score is the harmonic mean of IG and IS.
- Run 3+ models on the same dataset — each model extracts interests and cites evidence
- Oracle is built automatically from the union of verified interests across all models
- All models are rescored using the cross-model oracle
| Dataset | Content | Signals |
|---|---|---|
synthetic |
Videos | explicit+, implicit+, implicit- |
kuairec |
Videos | explicit+, implicit+, implicit- |
mind |
News | explicit+, implicit- |
amazon_digital_music |
Songs | explicit+, implicit+, explicit- |
yelp |
Stores | explicit+, implicit+, explicit- |
goodreads |
Books | explicit+, implicit+, explicit- |
Works with any OpenAI-compatible API (OpenAI, Ollama, vLLM, etc.) and custom datasets.
The data is released under CC-BY-NC 4.0 and is intended for benchmarking purposes only.
The object_text portion of the data are outputs of Llama 3.2 and subject to the Llama 3.2 license. If you use this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include "Llama" at the beginning of any such AI model name.
| Field | Description |
|---|---|
interaction_type |
Actual interaction type of the user x item from the aggregate content pool, aggregated to 3 types. |
user_id |
Anonymized user identifier (1-N). |
object_id |
Anonymized object identifier (1-N). |
interaction_time |
Anonymized interaction time. |
object_text |
Generated by a VLM to summarize the video frame by frame then using Llama 3.2 70B to create hashtags from the video summary. |
Built with ❤️ at Meta — Recommendation Systems (MRS)