Skip to content

feat: persistent baseline cache (CACHEON_BASELINE_CACHE_DIR)#82

Open
ai-hpc wants to merge 1 commit into
latent-to:mainfrom
ai-hpc:feat/baseline-cache
Open

feat: persistent baseline cache (CACHEON_BASELINE_CACHE_DIR)#82
ai-hpc wants to merge 1 commit into
latent-to:mainfrom
ai-hpc:feat/baseline-cache

Conversation

@ai-hpc

@ai-hpc ai-hpc commented Jun 8, 2026

Copy link
Copy Markdown

Summary

  • Adds CACHEON_BASELINE_CACHE_DIR env var to validator/config.py
  • validator/gpu_eval.py: loads BaselineCache JSON before calling run_baseline(); saves after a fresh run
  • tests/baseline_cache/README.md: format docs, usage, and key derivation explanation

Motivation

The vLLM v0.22.0 baseline container takes ~12 minutes per eval round (10 min model load + 2 min inference). With this cache, repeated runs against the same (block_hash, baseline_digest, PROMPT_ENGINE_VERSION) skip the container entirely — critical for fast iteration during miner development.

Usage

python3 scripts/run_validator_eval.py \
  --block-hash 0x<hash> \
  --miner-image <image> \
  --model-volume /models \
  --baseline-cache-dir ./tests/baseline_cache

First run saves tests/baseline_cache/<key>.json. Subsequent runs load it, printing:

INFO Loaded baseline from cache: key=5ea34e493462daee file=... (10 prompts)

Test plan

  • Run eval once with --baseline-cache-dir → verify JSON written + log line
  • Run again → verify baseline container NOT started + same results

Adds opt-in BaselineCache persistence that saves/loads the vLLM baseline
results to disk, eliminating the 12-min baseline container on repeated
runs with the same block_hash + baseline_digest + prompt version.

- validator/config.py: add CACHEON_BASELINE_CACHE_DIR env var
- validator/gpu_eval.py: load cache before run_baseline(), save after
- tests/baseline_cache/README.md: format docs + usage examples
@ai-hpc ai-hpc force-pushed the feat/baseline-cache branch from ab985f6 to 40215cb Compare June 8, 2026 15:20
Run a full validator eval with `--baseline-cache-dir` pointed here:

```bash
python3 scripts/run_validator_eval.py \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That script doesn't exist in the repo. gpu_eval is an env-var-driven container entrypoint (python -m validator.gpu_eval)

Comment thread validator/gpu_eval.py
_upload_progress(state_dir)
return 4

# ------------------------------------------------------------------

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This never hits in production because each block is 12s. the comment is a bit misleading.

This is fine if the intent is local dev/CI only (the README does say so), but the inline comment in gpu_eval.py reads like a prod optimization: # Baseline cache: skip the 12-min vLLM container if a cached result # exists for this (block_hash, baseline_digest, prompt_version) key

@xavierlyu

Copy link
Copy Markdown
Collaborator

The baseline is the teacher-forcing reference that every challenger is scored against. Loading it from an unsigned JSON file on disk (gated only by a single env var) means anyone who can write that file controls scoring. If CACHEON_BASELINE_CACHE_DIR is accidentally set on a real validator (shared .env, misconfigured deploy), it silently loads whatever is in that directory with no validation that the prompts match the current block. For a subnet where this feeds directly into weight-setting, that's a meaningful risk

@xavierlyu

Copy link
Copy Markdown
Collaborator

i think tests/baseline_cache/ should be gitignored

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants