feat(oracle): unified tier-driven corpus runner + oracle.yml (C3)#177
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
| Branch | feat/oracle-unified-runner |
| Testbed | ubuntu-latest |
⚠️ WARNING: No Threshold found!Without a Threshold, no Alerts will ever be generated.
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the--ci-only-thresholdsflag.
🚨 2 Alerts
| Benchmark | Measure Units | View | Benchmark Result (Result Δ%) | Upper Boundary (Limit %) |
|---|---|---|---|---|
| rag_pipeline::pipeline::index cargo_resolver:resolver_config() | Instructions instructions x 1e6 | 📈 plot 🚷 threshold 🚨 alert (🔔) | 800.17 x 1e6(+39.29%)Baseline: 574.45 x 1e6 | 585.94 x 1e6 (136.56%) |
| rag_pipeline::pipeline::query_cold cargo_resolver:resolver_built_config() | Instructions instructions x 1e6 | 📈 plot 🚷 threshold 🚨 alert (🔔) | 154.66 x 1e6(+15.59%)Baseline: 133.80 x 1e6 | 136.47 x 1e6 (113.33%) |
Click to view all benchmark results
| Benchmark | Estimated Cycles | cycles x 1e6 | Instructions | Benchmark Result instructions x 1e6 (Result Δ%) | Upper Boundary instructions x 1e6 (Limit %) | L1 Hits | hits x 1e6 | LL Hits | hits x 1e6 | RAM Hits | hits x 1e3 | Total read+write | reads/writes x 1e6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| rag_pipeline::pipeline::index cargo_resolver:resolver_config() | 📈 view plot | 1,213.35 x 1e6 | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 800.17 x 1e6(+39.29%)Baseline: 574.45 x 1e6 | 585.94 x 1e6 (136.56%) | 📈 view plot | 1,115.38 x 1e6 | 📈 view plot | 17.40 x 1e6 | 📈 view plot | 312.91 x 1e3 | 📈 view plot | 1,133.10 x 1e6 |
| rag_pipeline::pipeline::query_cold cargo_resolver:resolver_built_config() | 📈 view plot | 241.18 x 1e6 | 📈 view plot 🚷 view threshold 🚨 view alert (🔔) | 154.66 x 1e6(+15.59%)Baseline: 133.80 x 1e6 | 136.47 x 1e6 (113.33%) | 📈 view plot | 230.99 x 1e6 | 📈 view plot | 1.96 x 1e6 | 📈 view plot | 11.51 x 1e3 | 📈 view plot | 232.96 x 1e6 |
| rag_pipeline::pipeline::query_warm cargo_resolver:resolver_index() | 📈 view plot | 232.82 x 1e6 | 📈 view plot 🚷 view threshold | 149.15 x 1e6(-8.98%)Baseline: 163.86 x 1e6 | 167.13 x 1e6 (89.24%) | 📈 view plot | 223.20 x 1e6 | 📈 view plot | 1.85 x 1e6 | 📈 view plot | 10.11 x 1e3 | 📈 view plot | 225.06 x 1e6 |
|
| Branch | feat/oracle-unified-runner |
| Testbed | ubuntu-latest |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result seconds (s) (Result Δ%) | Upper Boundary seconds (s) (Limit %) |
|---|---|---|---|
| index_time/full_rebuild_cargo | 📈 view plot 🚷 view threshold | 4.96 s(+33.39%)Baseline: 3.72 s | 5.77 s (85.99%) |
- Roll back an unhealthy run (P2): a corpus that fails its health gate had already committed edge_oracle/monikers/oracle_runs, so it became the authoritative latest run and surfaced untrustworthy Compiler verdicts in later status/query despite the non-zero exit. `oracle report` now rolls the run back (delete verdicts + monikers + the oracle_runs row) atomically inside the same write lock when the gate fails — new oracle::rollback_run + store::delete_oracle_run + IndexDatabase::rollback_oracle_run. Verified e2e: after a failed gate, edge_oracle/oracle_runs/monikers are empty and status reports no verdicts. - Validate the checkout matches the corpus before stamping its profile (P2): fail closed unless the active checkout's target bindings (language -> dirs) exactly equal the corpus profile's bindings, so `oracle report --corpus X` can't stamp X's corpus_profile_hash onto a different population. New ensure_checkout_matches_corpus + unit test. The third comment (honor timeout_minutes in the report run) is already handled one layer up: timeout_minutes is the corpus wall-clock budget the runner owns (documented on CorpusHealth), and tools/oracle-run.sh (C3, #177) wraps the `oracle report` invocation in `timeout ${timeout_minutes}m`.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d99df400c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| timeout "${TIMEOUT_MINUTES}m" \ | ||
| "$RAG_RAT_BIN" --json oracle report --corpus "$CORPUS" --corpora "$CORPORA" ) > "$REPORT_OUT" |
There was a problem hiding this comment.
Apply the corpus timeout to the whole run
The profile timeout is documented as the wall-clock budget for the whole corpus run, but this timeout only starts after cloning, all prepare steps, and rag-rat index --full have already completed. If cargo fetch, the kernel make -j, or indexing hangs or runs far past the corpus budget, the small PR jobs can sit until the default Actions timeout and the heavy job until its 360-minute job timeout instead of failing at the configured 8/45/120 minutes.
Useful? React with 👍 / 👎.
| # Mirror the engine's vacuous-1.0 convention for an empty denominator. | ||
| return 100.0 if total == 0 else 100.0 * numerator / total | ||
|
|
||
| name = f"{report['corpus_id']}/oracle" |
There was a problem hiding this comment.
Include profile identity in the Bencher series
This benchmark name is only keyed by corpus_id, so when a profile changes while keeping the same id (for example bumping rust-cargo/linux-kernel rev, bindings, or prepare steps), Bencher will append the new measurements to the old series even though tools/oracle-corpora.toml explicitly treats those reports as incomparable via corpus_profile_hash. The deleted scripts at least carried the tag in the benchmark name; this should include the rev/profile hash (and ideally tool version) or otherwise force a new series on corpus changes.
Useful? React with 👍 / 👎.
| while IFS= read -r prepare_cmd; do | ||
| [ -n "$prepare_cmd" ] || continue | ||
| echo "oracle-run: prepare> $prepare_cmd" >&2 | ||
| ( cd "$CHECKOUT" && bash -c "$prepare_cmd" ) |
There was a problem hiding this comment.
For the linux-kernel profile this executes the declared make -j literally; I checked GNU make --help, and -j [N] means infinite jobs when no argument is supplied. The old kernel oracle script bounded the build with -j$(nproc), so the heavy self-hosted run can now oversubscribe CPU/RAM with unbounded kernel compile jobs before it ever reaches the oracle report.
Useful? React with 👍 / 👎.
| while IFS= read -r prepare_cmd; do | ||
| [ -n "$prepare_cmd" ] || continue | ||
| echo "oracle-run: prepare> $prepare_cmd" >&2 | ||
| ( cd "$CHECKOUT" && bash -c "$prepare_cmd" ) |
There was a problem hiding this comment.
Preserve Python virtualenv for scip-python
The py-requests profile installs the package into .venv, but each prepare command runs in a child shell and the later oracle subprocess inherits none of that environment. Sourcegraph's scip-python usage notes say to activate the virtualenv before running scip-python index (https://github.com/sourcegraph/scip-python#usage), so this small-tier leg runs against the global Python environment and can emit few or no dependency monikers despite the install step.
Useful? React with 👍 / 👎.
| on: | ||
| pull_request: | ||
| paths: | ||
| - 'crates/**' |
There was a problem hiding this comment.
Trigger the oracle on workspace dependency changes
This PR gate is meant to run when the rag-rat binary can change, but the path filter only covers crates/** and the oracle tool files. Root Cargo.toml defines workspace dependencies and Cargo.lock pins the actual dependency versions, so a dependency/profile update that touches only those root files skips the small oracle matrix entirely on pull requests and can merge parser/oracle behavior changes without this health gate.
Useful? React with 👍 / 👎.
| # produced by the same indexer build the heavy/Bencher tier uses. | ||
| SCIP_CLANG_VERSION: v0.4.0 | ||
| SCIP_PYTHON_VERSION: 0.6.6 | ||
| RUST_ANALYZER_URL: https://github.com/rust-lang/rust-analyzer/releases/latest/download/rust-analyzer-x86_64-unknown-linux-gnu.gz |
There was a problem hiding this comment.
Pin the rust-analyzer download
The small tier is described as a pinned toolchain, but this downloads releases/latest on every run. When rust-analyzer publishes a new build, the rust-semver PR leg can start using a different SCIP emitter than the thresholds and heavy image were validated with, producing failures or report changes unrelated to the PR while the other tools remain version-pinned.
Useful? React with 👍 / 👎.
Six P2s: - Whole-run timeout (oracle-run.sh): the corpus wall-clock budget wrapped only `oracle report`, so a hung clone/prepare/index sat until the Actions/job timeout. The runner now re-execs itself once under `timeout -k 60s <budget>m`, covering clone + prepare + index + report; an EXIT trap still removes the checkout on a timeout/gate-fail. - Preserve the virtualenv (oracle-run.sh): activate a prepare-created `.venv` (VIRTUAL_ENV + PATH) before index/report so scip-python (pyright) resolves against the project's installed deps, not the global interpreter. - Bencher series identity (oracle-report-bmf.py): the benchmark name was keyed only by corpus_id, so a profile/tool-version change (which makes reports incomparable) appended to the old series. Now keyed by corpus_id@<profile_hash12>+<tool_version> so an incomparable change starts a fresh series. - Bound kernel make (oracle-corpora.toml): `make -j` is unlimited jobs; pinned to `make -j$(nproc)` so the heavy run doesn't oversubscribe the box. Recomputed the golden linux-kernel profile hash. - Trigger on dep changes (oracle.yml): added root Cargo.toml/Cargo.lock to the PR + push path filters so a workspace-dep bump can't skip the gate. - Pin rust-analyzer (oracle.yml): install it as a rustup component (pinned to the stable toolchain) instead of downloading releases/latest each run.
|
Addressed in 202cbf0 — all six. P2 — Apply the corpus timeout to the whole run ✅ The runner now re-execs itself once under P2 — Preserve Python virtualenv for scip-python ✅ After the prepare steps, the runner activates a P2 — Include profile identity in the Bencher series ✅ P2 — Bound kernel make parallelism ✅ P2 — Trigger the oracle on workspace dependency changes ✅ Added root P2 — Pin the rust-analyzer download ✅ rust-analyzer is now installed as a rustup component (
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 202cbf0a73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| echo "oracle-run: oracle report --corpus $CORPUS" >&2 | ||
| set +e | ||
| ( cd "$CHECKOUT" && RAG_RAT_COMMIT="$RAG_RAT_COMMIT" \ | ||
| "$RAG_RAT_BIN" --json oracle report --corpus "$CORPUS" --corpora "$CORPORA" ) > "$REPORT_OUT" |
There was a problem hiding this comment.
Add the missing oracle report CLI before invoking it
In this tree the oracle CLI only exposes run and status (crates/rag-rat-cli/src/cli.rs), so every small and heavy corpus reaches this line and then exits with an unknown report subcommand after cloning/indexing. The new workflow depends on this runner for all corpora, so the PR gate and Bencher path cannot produce a valid report until the report subcommand is included or this calls an existing command.
Useful? React with 👍 / 👎.
| scip-python) | ||
| npm install -g "@sourcegraph/scip-python@${SCIP_PYTHON_VERSION}" | ||
| scip-python --version ;; |
There was a problem hiding this comment.
Add scip-python support before running py-requests
For the py-requests small-matrix leg, this branch installs scip-python, but the checked-in oracle registry still only accepts rust-analyzer and scip-clang (OracleTool::ALL/from_db_str), so once the report command exists that leg cannot map the profile's tool = "scip-python" to a runnable backend and will fail the required PR matrix. Either exclude the Python corpus here or land the scip-python backend with the workflow that starts scheduling it.
Useful? React with 👍 / 👎.
|
The two new P1s ("add the
Merge order is #175 → #176 → #177: once the first two land on The earlier P2 round (pin rust-analyzer, trigger on |
… C2 resolution report (C2-CLI) (#175) * feat(oracle): `oracle report --corpus <id>` — run a corpus + emit its C2 resolution report (C2-CLI) Adds the CLI surface over the C2 report contract: load a corpus profile from tools/oracle-corpora.toml, run the oracle (produce a .scip with the corpus's tool, or consume a pre-built --scip), assemble the typed OracleResolutionReport, and emit it as JSON/TOON. Then apply the per-corpus health gate — a violated threshold exits non-zero even when the oracle command itself succeeded. - The report is printed to stdout unconditionally, before the gate, so a Δ glue script can consume it even for a failing run; violations go to stderr. - Unlike `oracle run`, a missing/unrunnable tool is a hard error here (not the exit-0 Blocked UX): this is a measurement runner over a corpus whose tool CI is expected to have installed, so a silent skip must not pass green. - run + resolution_report run under one write lock; .scip production stays outside it (the #82 P3 lock-free-production posture). - rag_rat_commit provenance reads $RAG_RAT_COMMIT (CI's git SHA), falling back to the crate version off CI. * fix(oracle): address Codex review on the report command (#175) - Roll back an unhealthy run (P2): a corpus that fails its health gate had already committed edge_oracle/monikers/oracle_runs, so it became the authoritative latest run and surfaced untrustworthy Compiler verdicts in later status/query despite the non-zero exit. `oracle report` now rolls the run back (delete verdicts + monikers + the oracle_runs row) atomically inside the same write lock when the gate fails — new oracle::rollback_run + store::delete_oracle_run + IndexDatabase::rollback_oracle_run. Verified e2e: after a failed gate, edge_oracle/oracle_runs/monikers are empty and status reports no verdicts. - Validate the checkout matches the corpus before stamping its profile (P2): fail closed unless the active checkout's target bindings (language -> dirs) exactly equal the corpus profile's bindings, so `oracle report --corpus X` can't stamp X's corpus_profile_hash onto a different population. New ensure_checkout_matches_corpus + unit test. The third comment (honor timeout_minutes in the report run) is already handled one layer up: timeout_minutes is the corpus wall-clock budget the runner owns (documented on CorpusHealth), and tools/oracle-run.sh (C3, #177) wraps the `oracle report` invocation in `timeout ${timeout_minutes}m`. * fix(oracle): reject custom target filters in oracle report's corpus check (#175) Codex follow-up: ensure_checkout_matches_corpus compared only language -> directory set, so a [[target]] with the same language+dirs but custom include/exclude filters slipped through — the indexer applies those filters, so the report could stamp the corpus_profile_hash onto a filtered subset/superset. Now also require each target to carry the simple [target_bindings] form's default filters (include = ["**/*.<ext>"], empty exclude); any custom filter fails closed. Extended the unit test with custom-exclude and narrowed-include cases. * fix(oracle): make oracle report's run provisional, not rollback-after-commit (#175) Codex round 2 found my post-hoc rollback couldn't restore state: run::run's authoritative clear destroys the prior (tool,version) verdicts + the tool's monikers at the run's START, so deleting the failed run afterward left a prior healthy run with no verdicts/monikers (NoData), and the version-keyed delete could erase a prior healthy same-version run. Fix: the report path no longer commits-then-maybe-deletes. run::run is split into run() (commit-on-success wrapper) + run_in_tx() (the body). New oracle::run_oracle_report runs run_in_tx + report assembly + the health gate inside ONE transaction and commits ONLY if healthy; an unhealthy run drops the transaction, rolling back the whole pass INCLUDING the authoritative clear — so the previous healthy run's verdicts/monikers/run-row are fully preserved. Removes rollback_run / delete_oracle_run / rollback_oracle_run / the finalize_corpus_report helper. Verified e2e: a healthy run then a gate-failing run on the same checkout leaves the healthy run's verdicts intact (oracle status still reports them).
Replaces the per-language oracle-rust.yml / oracle-kernel.yml demos with one declarative, tier-driven runner over tools/oracle-corpora.toml (#164, C3). - tools/oracle-run.sh: run the oracle for ONE corpus end to end — read its profile, shallow-clone the repo at its pinned rev, run its prepare steps, index it with rag-rat, then `rag-rat oracle report` (which runs the oracle + applies the per-corpus health gate). Exits non-zero on a health violation while still writing the report JSON, so a Δ glue script can consume it. - tools/oracle-corpus.py: stdlib (tomllib) reader the bash runner shells out to for the pre-index fields (repo/rev/prepare/bindings) + tier corpus listing. - tools/oracle-report-bmf.py: report JSON -> Bencher BMF glue for the heavy tier (rag-rat emits JSON only; presentation/Bencher shaping is a glue concern). - .github/workflows/oracle.yml: small tier on PRs + main (GitHub-hosted matrix, per-corpus tool install, report artifact, health gate as the PR gate); heavy tier on release/dispatch (self-hosted bigmem, serial, pushed to Bencher). - Deletes the superseded oracle-rust.yml/oracle-kernel.yml workflows and their rust-scip-oracle.sh/kernel-c-oracle.sh scripts; updates docs/benchmarks.md. Verified end to end locally on the small tier: rust-semver (rust-analyzer, 1056 edges resolved 412->936) and c-cjson (cmake compdb + scip-clang, 3941 edges resolved 2742->3408) both run clean through the runner and pass the gate.
Six P2s: - Whole-run timeout (oracle-run.sh): the corpus wall-clock budget wrapped only `oracle report`, so a hung clone/prepare/index sat until the Actions/job timeout. The runner now re-execs itself once under `timeout -k 60s <budget>m`, covering clone + prepare + index + report; an EXIT trap still removes the checkout on a timeout/gate-fail. - Preserve the virtualenv (oracle-run.sh): activate a prepare-created `.venv` (VIRTUAL_ENV + PATH) before index/report so scip-python (pyright) resolves against the project's installed deps, not the global interpreter. - Bencher series identity (oracle-report-bmf.py): the benchmark name was keyed only by corpus_id, so a profile/tool-version change (which makes reports incomparable) appended to the old series. Now keyed by corpus_id@<profile_hash12>+<tool_version> so an incomparable change starts a fresh series. - Bound kernel make (oracle-corpora.toml): `make -j` is unlimited jobs; pinned to `make -j$(nproc)` so the heavy run doesn't oversubscribe the box. Recomputed the golden linux-kernel profile hash. - Trigger on dep changes (oracle.yml): added root Cargo.toml/Cargo.lock to the PR + push path filters so a workspace-dep bump can't skip the gate. - Pin rust-analyzer (oracle.yml): install it as a rustup component (pinned to the stable toolchain) instead of downloading releases/latest each run.
202cbf0 to
e9226e3
Compare
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Part of the multi-language SCIP-oracle runner epic (#164). Replaces the two per-language demo workflows with one declarative, tier-driven runner over
tools/oracle-corpora.toml— the C3 piece that ties the corpus profiles (#171), the report command (#175), and the scip-python backend (#176) into a CI-runnable whole.What
tools/oracle-run.sh— run the oracle for one corpus end to end: read its profile → shallow-clone the repo at its pinned rev → run itspreparesteps → render the checkout'srag-rat.tomlfrom itsbindings→rag-rat index --full→rag-rat oracle report. The health gate lives inoracle report; the script propagates its exit code (non-zero on a violation) and always leaves the report JSON behind so a Δ glue script can consume even a failing run.tools/oracle-corpus.py— stdlib (tomllib) reader the bash runner shells out to for the pre-index fields (repo/rev/prepare/bindings) and--list-tierfor the CI matrix.oracle reportreads tool/bindings/health from the same file itself — the helper is just bash's TOML eyes.tools/oracle-report-bmf.py— report JSON → Bencher BMF glue for the heavy tier (rag-rat emits JSON only; Bencher/markdown shaping is a glue concern)..github/workflows/oracle.yml— tier-driven:--list-tier small, installs each corpus's SCIP tool by itstoolfield, runs the runner, uploads the report artifact. The health gate is the PR gate.max-parallel: 1), bench image, pushed to Bencher as the headline resolution series.oracle-rust.yml/oracle-kernel.ymlworkflows and theirrust-scip-oracle.sh/kernel-c-oracle.shscripts (full migration, no shims); updatesdocs/benchmarks.md.Verification
End-to-end locally on the small tier (with the #175
oracle reportbinary):Both clone → prepare → index → report → gate clean; the
corpus_profile_hashmatched the golden. The health gate was separately confirmed to fail non-zero on a violated threshold (in #175).bash -n,py_compile, and YAML parse all pass.Stacking
oracle.yml's small tier needs theoracle reportcommand (#175) and, forpy-requests, the scip-python backend (#176). The script + helpers are independent of those at the file level (disjoint paths), but the workflow'spy-requestsleg goes green only once #176 lands andrust-*/c-*legs only once #175 lands. Merge order: #175 + #176, then this.