Skip to content

facebookresearch/GISTBench

GitHub Stars Status: Released License: CC-BY-NC 4.0

GISTBench

Evaluating LLM user understanding via evidence-based interest verification

Star on GitHub   Watch on GitHub


Quick Start

pip install -e ".[dev]"
export OPENAI_API_KEY=your-key

# Evaluate 3+ models — oracle is built automatically from predictions
gistbench run -d data.csv -m gpt-4o      --results-db results.db
gistbench run -d data.csv -m gpt-4o-mini --results-db results.db
gistbench run -d data.csv -m gpt-4-turbo --results-db results.db  # scores all 3 models

See INSTRUCTIONS.md for full documentation.

Smoke Test

Validate the full pipeline (extraction → IG → IS → taxonomy → scoring) end to end. Both the bundled mock dataset and the real synthetic split ship with bundled oracles, so a single model run is enough.

export OPENAI_API_KEY=your-key

# Default: both datasets (mock + synthetic), gpt-4o-mini, 5 users, no report
gistbench smoke-test

# Pick a single dataset and write a timestamped report into reports/
gistbench smoke-test --datasets mock --report-dir reports
gistbench smoke-test --datasets synthetic -n 10 --report-dir reports

# Or pin to a specific filename
gistbench smoke-test --datasets mock --report report.md

# Local model via Ollama / vLLM / LM Studio (no API key needed)
gistbench smoke-test --base-url http://localhost:11434/v1 --models llama3

The command exits non-zero if any case fails.

What is GISTBench?

GISTBench evaluates how well LLMs understand users from their engagement history by measuring two complementary axes:

  • Interest Groundedness (IG): Are the extracted interests actually supported by the user's engagement data?
  • Interest Specificity (IS): Can the model cite the specific items that support each interest?

The final score is the harmonic mean of IG and IS.

How It Works

  1. Run 3+ models on the same dataset — each model extracts interests and cites evidence
  2. Oracle is built automatically from the union of verified interests across all models
  3. All models are rescored using the cross-model oracle

GISTBench Pipeline

Supported Datasets

Dataset Content Signals
synthetic Videos explicit+, implicit+, implicit-
kuairec Videos explicit+, implicit+, implicit-
mind News explicit+, implicit-
amazon_digital_music Songs explicit+, implicit+, explicit-
yelp Stores explicit+, implicit+, explicit-
goodreads Books explicit+, implicit+, explicit-

Works with any OpenAI-compatible API (OpenAI, Ollama, vLLM, etc.) and custom datasets.

License

The data is released under CC-BY-NC 4.0 and is intended for benchmarking purposes only.

The object_text portion of the data are outputs of Llama 3.2 and subject to the Llama 3.2 license. If you use this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include "Llama" at the beginning of any such AI model name.

Data Description

Field Description
interaction_type Actual interaction type of the user x item from the aggregate content pool, aggregated to 3 types.
user_id Anonymized user identifier (1-N).
object_id Anonymized object identifier (1-N).
interaction_time Anonymized interaction time.
object_text Generated by a VLM to summarize the video frame by frame then using Llama 3.2 70B to create hashtags from the video summary.

Built with ❤️ at Meta — Recommendation Systems (MRS)

About

GISTBench Dataset and Evaluation Code

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages