GISTBench

Evaluating LLM user understanding via evidence-based interest verification

Quick Start

pip install -e ".[dev]"
export OPENAI_API_KEY=your-key

# Evaluate 3+ models — oracle is built automatically from predictions
gistbench run -d data.csv -m gpt-4o      --results-db results.db
gistbench run -d data.csv -m gpt-4o-mini --results-db results.db
gistbench run -d data.csv -m gpt-4-turbo --results-db results.db  # scores all 3 models

See INSTRUCTIONS.md for full documentation.

Link to Dataset

Smoke Test

Validate the full pipeline (extraction → IG → IS → taxonomy → scoring) end to end. Both the bundled mock dataset and the real synthetic split ship with bundled oracles, so a single model run is enough.

export OPENAI_API_KEY=your-key

# Default: both datasets (mock + synthetic), gpt-4o-mini, 5 users, no report
gistbench smoke-test

# Pick a single dataset and write a timestamped report into reports/
gistbench smoke-test --datasets mock --report-dir reports
gistbench smoke-test --datasets synthetic -n 10 --report-dir reports

# Or pin to a specific filename
gistbench smoke-test --datasets mock --report report.md

# Local model via Ollama / vLLM / LM Studio (no API key needed)
gistbench smoke-test --base-url http://localhost:11434/v1 --models llama3

The command exits non-zero if any case fails.

What is GISTBench?

GISTBench evaluates how well LLMs understand users from their engagement history by measuring two complementary axes:

Interest Groundedness (IG): Are the extracted interests actually supported by the user's engagement data?
Interest Specificity (IS): Can the model cite the specific items that support each interest?

The final score is the harmonic mean of IG and IS.

How It Works

Run 3+ models on the same dataset — each model extracts interests and cites evidence
Oracle is built automatically from the union of verified interests across all models
All models are rescored using the cross-model oracle

Supported Datasets

Dataset	Content	Signals
`synthetic`	Videos	explicit+, implicit+, implicit-
`kuairec`	Videos	explicit+, implicit+, implicit-
`mind`	News	explicit+, implicit-
`amazon_digital_music`	Songs	explicit+, implicit+, explicit-
`yelp`	Stores	explicit+, implicit+, explicit-
`goodreads`	Books	explicit+, implicit+, explicit-

Works with any OpenAI-compatible API (OpenAI, Ollama, vLLM, etc.) and custom datasets.

License

The data is released under CC-BY-NC 4.0 and is intended for benchmarking purposes only.

The object_text portion of the data are outputs of Llama 3.2 and subject to the Llama 3.2 license. If you use this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include "Llama" at the beginning of any such AI model name.

Data Description

Field	Description
`interaction_type`	Actual interaction type of the user x item from the aggregate content pool, aggregated to 3 types.
`user_id`	Anonymized user identifier (1-N).
`object_id`	Anonymized object identifier (1-N).
`interaction_time`	Anonymized interaction time.
`object_text`	Generated by a VLM to summarize the video frame by frame then using Llama 3.2 70B to create hashtags from the video summary.

_{Built with ❤️ at Meta — Recommendation Systems (MRS)}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gistbench		gistbench
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GISTBench

Evaluating LLM user understanding via evidence-based interest verification

Quick Start

Link to Dataset

Smoke Test

What is GISTBench?

How It Works

Supported Datasets

License

Data Description

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GISTBench

Evaluating LLM user understanding via evidence-based interest verification

Quick Start

Link to Dataset

Smoke Test

What is GISTBench?

How It Works

Supported Datasets

License

Data Description

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages