CNTRLPLANE-3339: Add promptfoo eval framework for SME agents by enxebre · Pull Request #8419 · openshift/hypershift

enxebre · 2026-05-05T11:26:31Z

Summary

Supersedes #8382

Adds a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions. Preferred promptfoo over a custom Go harness initially for its wider feature coverage including skills evaluation, red team security scanning, built-in web UI, and JUnit XML output for CI.

6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style
Git worktree isolation for patch-based tests so evals don't modify the working copy
make eval-agents target with EVAL_FILTER, EVAL_OUTPUT, EVAL_REPEAT, and EVAL_PASS_RATE_THRESHOLD support
Update api-sme agent with mandatory linter instruction
Add field grouping rule and best practices references to api/AGENTS.md

Test plan

make eval-agents EVAL_FILTER=api-sme passes
make eval-agents runs all 6 scenarios
make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML
Patch-based tests use worktree isolation and don't modify working copy
make eval-agents EVAL_REPEAT=3 EVAL_PASS_RATE_THRESHOLD=80 runs 3 trials with 80% threshold

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Introduced agent evaluation framework with new make eval-agents command for testing agent behavior against predefined scenarios with configurable filtering and output options.
Documentation
- Updated agent development guidelines with API type change best practices and mandatory pre-review steps.
- Added comprehensive evaluation workflow documentation including setup, execution, and result viewing instructions.

- Add mandatory linter instruction to api-sme agent - Add field grouping rule to api/AGENTS.md - Add best practices section pointing to etcdbackup and karpenter types Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

openshift-merge-bot · 2026-05-05T11:26:34Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-05T11:26:35Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-05T11:26:35Z

@enxebre: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

Add a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions

6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style

Git worktree isolation for patch-based tests so evals don't modify the working copy

make eval-agents target with EVAL_FILTER and EVAL_OUTPUT support

Update api-sme agent with mandatory linter instruction

Add field grouping rule and best practices references to api/AGENTS.md

Test plan

make eval-agents EVAL_FILTER=api-sme passes

make eval-agents runs all 6 scenarios

make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML

Patch-based tests use worktree isolation and don't modify working copy

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-05T11:27:06Z

📝 Walkthrough

Walkthrough

Adds a promptfoo-based agent evaluation framework and a Makefile eval-agents target. New files: test/eval/promptfooconfig.yaml, test/eval/README.md, test/eval/hooks.js, and test/eval/run-agent.sh to run agents via the Claude CLI, create per-test git worktrees from patches, and evaluate outputs with llm-rubric. Makefile variables EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION and the .PHONY eval-agents target were added. Documentation updates: api-sme now mandates including make api-lint-fix output and following ../api/AGENTS.md; api/AGENTS.md adds API type-change guidelines and requires listed checks to pass before PRs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Make as "make eval-agents"
    participant PromptFoo as "promptfoo (npx)"
    participant Hooks as "hooks.js"
    participant Git as "git"
    participant AgentScript as "run-agent.sh"
    participant Claude as "Claude CLI"
    participant Rubric as "llm-rubric"

    User->>Make: run eval-agents (EVAL_* envs)
    Make->>PromptFoo: npx promptfoo eval (test/eval)
    PromptFoo->>Hooks: beforeEach(context)
    Hooks->>Git: create worktree & apply patch
    Git-->>Hooks: worktreePath
    Hooks-->>PromptFoo: context with worktreePath
    PromptFoo->>AgentScript: exec with prompt + context
    AgentScript->>Claude: exec claude --agent --model --allowed-tools
    Claude-->>AgentScript: LLM response
    AgentScript-->>PromptFoo: agent output
    PromptFoo->>Rubric: evaluate assertions (llm-rubric)
    Rubric-->>PromptFoo: results
    PromptFoo->>Hooks: afterEach(context)
    Hooks->>Git: remove worktree
    Git-->>Hooks: removed
    PromptFoo-->>User: report results / optional output file

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and clearly describes the primary change: adding a promptfoo evaluation framework for SME agents, which aligns with the main objective of implementing a new evaluation system for agent testing.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR adds promptfoo eval framework, not Ginkgo tests. No Ginkgo tests are added or modified. Promptfoo test descriptions are static and deterministic.
Test Structure And Quality	✅ Passed	Custom check requires reviewing Ginkgo test code. This PR adds a promptfoo evaluation framework with JavaScript, YAML, Bash, and Markdown files—no Ginkgo tests present. Check is not applicable.
Microshift Test Compatibility	✅ Passed	This PR does not add Ginkgo e2e tests. All files are documentation, configuration, or helper scripts. The MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR adds promptfoo evaluation framework (JavaScript, Bash, YAML) and documentation. SNO check only applies to Ginkgo tests.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds only test/evaluation infrastructure. No deployment manifests, operators, controllers, or scheduling constraints are introduced. Check not applicable.
Ote Binary Stdout Contract	✅ Passed	Check not applicable: PR modifies no OTE binaries or Go test suite code. Changes are documentation, Makefile, and a promptfoo evaluation framework (JavaScript/Bash/YAML).
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR adds an evaluation framework for AI agents using promptfoo, not Ginkgo e2e tests. No Go test files with Ginkgo patterns are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-05T11:27:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]
~~api/OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 624-627: The Makefile invocation uses npx promptfoo@latest which
makes test/eval non-deterministic; replace the `@latest` usage by pinning a
known-good version (e.g., [email protected]) in the Makefile invocation or
alternatively add a package.json in test/eval that lists the pinned promptfoo
version and run npx without `@latest`; update the Makefile line that calls "npx
promptfoo@latest eval" (and any related variables like EVAL_FILTER/EVAL_OUTPUT)
to reference the fixed version or rely on the local package.json so runs are
reproducible.

In `@test/eval/hooks.js`:
- Around line 15-22: The catch block can leave an orphaned worktree if
execSync(`git worktree add ...`) succeeds but execSync(`git apply ...`) fails;
modify the try block to set a flag (e.g., worktreeAdded = true) immediately
after calling git worktree add and before git apply, and in the catch block when
worktreeAdded is true run a cleanup call (execSync(`git worktree remove --force
"${worktreeDir}"`, { cwd: repoRoot })) to remove the created worktree, then log
the error and ensure context.test.vars.worktreePath is not set when cleanup
occurs; update references to worktreeDir, repoRoot,
context.test.vars.worktreePath, and the try/catch around execSync calls
accordingly.
- Around line 13-14: Replace the millisecond-based worktree name generation that
uses Date.now() in test/eval/hooks.js (the worktreeName variable used to build
worktreeDir) with a collision-resistant suffix (e.g., crypto.randomUUID() or a
short random hex from crypto.randomBytes()) so parallel eval workers won’t
produce identical paths; update the construction of worktreeName to include the
UUID/random suffix and ensure any cleanup or lookup logic that uses
worktreeName/worktreeDir continues to use the new value.
- Around line 16-17: Replace the shell-interpolated execSync calls with
execFileSync using argv arrays to avoid shell parsing: change the calls that use
execSync(`git worktree add "${worktreeDir}" HEAD`, { cwd: repoRoot, stdio:
'pipe' }) and execSync(`git apply "${fullPath}"`, { cwd: worktreeDir, stdio:
'pipe' }) to execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], {
cwd: repoRoot, stdio: 'pipe' }) and execFileSync('git', ['apply', fullPath], {
cwd: worktreeDir, stdio: 'pipe' }) respectively (and apply the same pattern to
the other occurrence around line 32); keep the cwd and stdio options intact.

In `@test/eval/README.md`:
- Around line 30-31: Update the README line describing "Patch-based tests" to
accurately reflect that patches are applied to an isolated temporary worktree
rather than the main working copy: change the wording around the
`beforeEach`/`afterEach` hooks to say they create a temporary worktree, apply
the patch there, run the test, and then remove/clean up the temp worktree (or
revert the temp worktree) instead of claiming the working copy is patched and
reverted.
- Around line 8-10: The README's prerequisites list is missing python3 which is
required because run-agent.sh invokes python3 to parse context; update
test/eval/README.md prerequisites to include "python3" (or "python3 (required
for run-agent.sh)") so users install/verify Python 3 before running the tests or
run-agent.sh.

In `@test/eval/run-agent.sh`:
- Line 33: Replace the use of echo when piping the PROMPT to the Claude process
to avoid backslash-escape interpretation and shell portability issues: use
printf '%s' to emit the exact bytes of the PROMPT and pipe that into the exec
claude "${ARGS[@]}" invocation (keep the PROMPT variable and ARGS array usage
unchanged).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 38c6bf0f-8388-4e9a-8662-4a8de94adbe7

📥 Commits

Reviewing files that changed from the base of the PR and between 6b39d47 and a30598b.

⛔ Files ignored due to path filters (1)

test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**

📒 Files selected for processing (7)

.claude/agents/api-sme.md
Makefile
api/AGENTS.md
test/eval/README.md
test/eval/hooks.js
test/eval/promptfooconfig.yaml
test/eval/run-agent.sh

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/eval/hooks.js`:
- Around line 29-30: afterEach is reading the wrong scope for the test-scoped
worktree; change the lookup to use context.test.vars.worktreePath (matching
beforeEach) so cleanup runs correctly: in the afterEach handler replace any use
of context.vars with context.test.vars and ensure worktreeDir is assigned from
context.test.vars.worktreePath before attempting removal (refer to the afterEach
function and the worktreeDir variable).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bbfe7311-87d8-44df-a6df-74c3358c9bf6

📥 Commits

Reviewing files that changed from the base of the PR and between a30598b and f104388.

⛔ Files ignored due to path filters (1)

test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**

📒 Files selected for processing (5)

Makefile
test/eval/README.md
test/eval/hooks.js
test/eval/promptfooconfig.yaml
test/eval/run-agent.sh

✅ Files skipped from review due to trivial changes (2)

test/eval/README.md
test/eval/run-agent.sh

🚧 Files skipped from review as they are similar to previous changes (1)

test/eval/promptfooconfig.yaml

codecov · 2026-05-05T11:39:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.57%. Comparing base (5eaee74) to head (ecdf221).
⚠️ Report is 69 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8419      +/-   ##
==========================================
+ Coverage   36.42%   36.57%   +0.15%     
==========================================
  Files         765      770       +5     
  Lines       93302    93483     +181     
==========================================
+ Hits        33981    34195     +214     
- Misses      56606    56647      +41     
+ Partials     2715     2641      -74

see 54 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`30.41% <ø> (+0.03%)`	⬆️
cpo-hostedcontrolplane	`36.50% <ø> (-0.59%)`	⬇️
cpo-other	`37.73% <ø> (+2.03%)`	⬆️
hypershift-operator	`47.85% <ø> (-0.03%)`	⬇️
other	`27.77% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot · 2026-05-05T12:12:45Z

@enxebre: This pull request references CNTRLPLANE-3339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Supersedes #8382

Adds a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions. Preferred promptfoo over a custom Go harness initially for its wider feature coverage including skills evaluation, red team security scanning, built-in web UI, and JUnit XML output for CI.

6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style

Git worktree isolation for patch-based tests so evals don't modify the working copy

make eval-agents target with EVAL_FILTER, EVAL_OUTPUT, EVAL_REPEAT, and EVAL_PASS_RATE_THRESHOLD support

Update api-sme agent with mandatory linter instruction

Add field grouping rule and best practices references to api/AGENTS.md

Test plan

make eval-agents EVAL_FILTER=api-sme passes

make eval-agents runs all 6 scenarios

make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML

Patch-based tests use worktree isolation and don't modify working copy

make eval-agents EVAL_REPEAT=3 EVAL_PASS_RATE_THRESHOLD=80 runs 3 trials with 80% threshold

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

- Add promptfoo config with test scenarios for 5 SME agents and conventions - Add git worktree isolation for patch-based tests - Add eval-agents Makefile target - Add testdata with patch for api-sme scenario Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/eval/README.md (1)

41-47: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document all supported eval controls in the env-var table.

The table is missing EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION, which are now part of the Makefile interface.

Suggested doc update

 | Env Var | Default | Description |
 |---------|---------|-------------|
 | `EVAL_MODEL` | `claude-opus-4-6` | Model for agent invocation |
 | `EVAL_FILTER` | (all) | Filter tests by description pattern |
 | `EVAL_OUTPUT` | (none) | Output file (.json, .xml, .html) |
+| `EVAL_REPEAT` | `1` | Number of repeated eval runs |
+| `EVAL_PASS_RATE_THRESHOLD` | `100` | Minimum required pass rate (%) |
+| `PROMPTFOO_VERSION` | `0.121.9` | promptfoo version used by `make eval-agents` |
 | `ANTHROPIC_VERTEX_PROJECT_ID` | - | GCP project for Vertex AI auth |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/eval/README.md` around lines 41 - 47, The env-var table in
test/eval/README.md is missing three supported controls; add entries for
EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION with short defaults
and descriptions so the Makefile interface is fully documented. Specifically,
add rows for `EVAL_REPEAT` (default like `1`, description: number of times each
test runs), `EVAL_PASS_RATE_THRESHOLD` (default like `0.8` or `80%`,
description: minimum pass rate to consider suite successful), and
`PROMPTFOO_VERSION` (default like `latest`, description: version of the prompt
tooling to use), matching the existing table format and style alongside the
other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and `EVAL_OUTPUT`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/eval/hooks.js`:
- Around line 13-30: The hook currently logs failures and continues, which lets
tests run against an unpatched repo; change it to fail fast by throwing after
cleanup so the test run stops: inside the try/catch around git worktree/apply
(symbols: execFileSync, worktreeCreated, worktreeDir,
context.test.vars.worktreePath) rethrow the caught error (or throw a new Error
with the original error message) after attempting the worktree removal and
logging, and also handle the case where the patch file is missing (the
fs.existsSync(fullPath) branch) by throwing an error instead of silently
continuing. Ensure any created worktree is removed before rethrowing.

---

Outside diff comments:
In `@test/eval/README.md`:
- Around line 41-47: The env-var table in test/eval/README.md is missing three
supported controls; add entries for EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and
PROMPTFOO_VERSION with short defaults and descriptions so the Makefile interface
is fully documented. Specifically, add rows for `EVAL_REPEAT` (default like `1`,
description: number of times each test runs), `EVAL_PASS_RATE_THRESHOLD`
(default like `0.8` or `80%`, description: minimum pass rate to consider suite
successful), and `PROMPTFOO_VERSION` (default like `latest`, description:
version of the prompt tooling to use), matching the existing table format and
style alongside the other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and
`EVAL_OUTPUT`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6b9277bb-2374-426f-9837-b3bf7146d8f8

📥 Commits

Reviewing files that changed from the base of the PR and between 96e350a and ecdf221.

⛔ Files ignored due to path filters (1)

test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**

📒 Files selected for processing (5)

Makefile
test/eval/README.md
test/eval/hooks.js
test/eval/promptfooconfig.yaml
test/eval/run-agent.sh

🚧 Files skipped from review as they are similar to previous changes (2)

test/eval/run-agent.sh
test/eval/promptfooconfig.yaml

coderabbitai · 2026-05-05T12:21:40Z

+      if (fs.existsSync(fullPath)) {
+        const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
+        const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
+        let worktreeCreated = false;
+        try {
+          execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
+          worktreeCreated = true;
+          execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
+          console.log(`Created worktree and applied patch: ${worktreeDir}`);
+          context.test.vars.worktreePath = worktreeDir;
+        } catch (e) {
+          if (worktreeCreated) {
+            try {
+              execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
+            } catch (_) {}
+          }
+          console.error(`Failed to create worktree or apply patch: ${e.message}`);
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when patch setup fails instead of silently continuing.

On Line 13 and Lines 23-30, missing patch/setup only logs and allows execution to continue, which can run evals against the unpatched repo and produce false positives.

Suggested fix

if (patchPath) { const fullPath = path.resolve(__dirname, patchPath); - if (fs.existsSync(fullPath)) { - const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`; - const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName); - let worktreeCreated = false; - try { - execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' }); - worktreeCreated = true; - execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' }); - console.log(`Created worktree and applied patch: ${worktreeDir}`); - context.test.vars.worktreePath = worktreeDir; - } catch (e) { - if (worktreeCreated) { - try { - execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' }); - } catch (_) {} - } - console.error(`Failed to create worktree or apply patch: ${e.message}`); - } - } + if (!fs.existsSync(fullPath)) { + throw new Error(`Patch file not found: ${fullPath}`); + } + const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`; + const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName); + let worktreeCreated = false; + try { + execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' }); + worktreeCreated = true; + execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' }); + console.log(`Created worktree and applied patch: ${worktreeDir}`); + context.test.vars.worktreePath = worktreeDir; + } catch (e) { + if (worktreeCreated) { + try { + execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' }); + } catch (_) {} + } + delete context.test.vars.worktreePath; + throw e; + } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/eval/hooks.js` around lines 13 - 30, The hook currently logs failures and continues, which lets tests run against an unpatched repo; change it to fail fast by throwing after cleanup so the test run stops: inside the try/catch around git worktree/apply (symbols: execFileSync, worktreeCreated, worktreeDir, context.test.vars.worktreePath) rethrow the caught error (or throw a new Error with the original error message) after attempting the worktree removal and logging, and also handle the case where the patch file is missing (the fs.existsSync(fullPath) branch) by throwing an error instead of silently continuing. Ensure any created worktree is removed before rethrowing.

theobarberbany · 2026-05-05T14:43:17Z

+      - type: llm-rubric
+        value: "The output states that version-dependent behavior should be decided in the CPO based on the hosted cluster release version, not in the HO"
+      - type: llm-rubric
+        value: "The output explains that HO and CPO can run different versions and the HO must not assume which CPO version is running"


One thing I've been struggling with on the openshift/api evals - that you may suffer with here is false positives.

Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?

I can pretty consistently get it to catch the issues i wanted it to, but not to invent more that don't exist.

I think for SME experts it may matter less than eg an api review command, but if we have agents suffering from false positive issues I don't think people will trust them.

Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?

It has builtin support for this through its assertion library: factuality, llm-rubric, weights and thresholds, g-eval, custom assertion functions, composite derived metrics, and cost/latency, Red team plugins for hallucination...

So we can tune expectations over time as we learn what the agents reliably catch vs what's flaky.

Some refs:
https://www.promptfoo.dev/docs/configuration/expected-outputs/#assertion-types
https://www.promptfoo.dev/docs/configuration/expected-outputs/#model-assisted-eval-metrics
https://www.promptfoo.dev/docs/configuration/expected-outputs/#custom-assertion-scoring
https://www.promptfoo.dev/docs/configuration/expected-outputs/#creating-derived-metrics
https://www.promptfoo.dev/docs/guides/llm-as-a-judge/#evaluation-approaches
https://github.com/promptfoo/promptfoo/blob/main/examples/eval-rag/promptfooconfig.yaml
https://www.promptfoo.dev/docs/red-team/troubleshooting/false-positives/

Ah awesome! I'll have a dig :)

What i've been manually building is pretty much the same as llm-rubric, I'd be curious to try and see if this yields better results.

Although, the main issue I've been hitting is I think a one shot claude code command might not be expressive enough for e.g api review, which sucks.

openshift-ci · 2026-05-11T16:15:02Z

@enxebre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`ecdf221`	link	true	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2026-05-11T16:15:13Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-11T01:30:13Z

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

hypershift-jira-solve-ci · 2026-06-11T03:30:32Z

I have all the evidence needed. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-security
Build ID: 2053866789401530368
Target: security
PR: #8419 — CNTRLPLANE-3339: Add promptfoo eval framework for SME agents
Author: enxebre
Branch: evals-promptfoo

Test Failure Analysis

Error

Auto-merging api/AGENTS.md
CONFLICT (content): Merge conflict in api/AGENTS.md
Automatic merge failed; fix conflicts and then commit the result.
# Error: exit status 1

Summary

Both job failures (ci/prow/security with state failure and tide with state error) are caused by a git merge conflict in the file api/AGENTS.md. The Prow CI clone step attempted to merge the PR branch (ecdf22170a) onto the base branch main (8f279f9ce) and failed because both branches modified overlapping regions of api/AGENTS.md. The security job never reached the actual test step — it failed during the initial source checkout. The tide error is a direct consequence: tide cannot merge a PR that has conflicts with the target branch.

Root Cause

The root cause is a content conflict in api/AGENTS.md between PR #8419 and PR #8478, which was merged to main after PR #8419 was last pushed.

Timeline of events:

2026-05-05T12:16:04Z — PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 (evals-promptfoo) was last pushed. It modifies api/AGENTS.md by reorganizing sections: removing the "Key make targets" block from its original location, restructuring "API Dependencies" and "Serialization" sections, and adding new "Best Practices and Patterns" and "Field Grouping" sections.
2026-05-11T17:31:47Z — PR #8478 (update-agents-on-webhook, "NO-JIRA: Document CEL over webhooks policy for AI agents") was merged to main by openshift-merge-bot. This PR also modified api/AGENTS.md in overlapping regions — it documented the CEL over webhooks policy for AI agents.
2026-05-11T15:56:16Z — The security job for PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 was triggered. During the clone step, Prow attempted a git merge --no-ff of the PR commit onto main base SHA 8f279f9ce. The merge failed with a content conflict in api/AGENTS.md because both PRs modified the same regions of the file.

The Prow CI job exited at the clone/merge step with exit status 1 — no CI container (ci-operator) was ever started, and no test code was executed. The tide error is simply the merge bot recognizing that the PR's mergeable_state is dirty (GitHub reports "mergeable": false), so automatic merge is impossible.

This is not a product bug or test issue — it is a branch hygiene problem requiring a rebase.

Recommendations

Rebase PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 onto current main — The PR author (enxebre) needs to rebase the evals-promptfoo branch onto the latest main and resolve the conflict in api/AGENTS.md. The conflict is between this PR's reorganization of the document and PR NO-JIRA: Document CEL over webhooks policy for AI agents #8478's addition of the CEL-over-webhooks policy.
Force-push the rebased branch — After resolving the conflict locally, push the updated branch. This will automatically retrigger Prow CI jobs including the security check.
No action needed on tide — The tide error will self-resolve once the merge conflict is cleared and required checks pass. Tide will automatically attempt to merge once the PR is in a mergeable state.
Consider coordinating documentation changes — Both this PR and NO-JIRA: Document CEL over webhooks policy for AI agents #8478 modify api/AGENTS.md, a shared AI agent conventions file. Future changes to this high-contention file should be coordinated or rebased promptly to avoid stale branches.

Evidence

Evidence	Detail
Conflicting file	`api/AGENTS.md`
PR #8419 last push	2026-05-05T12:16:04Z (6 days before the CI run)
Conflicting PR #8478	NO-JIRA: Document CEL over webhooks policy for AI agents, merged 2026-05-11T17:31:47Z
Build log error	`CONFLICT (content): Merge conflict in api/AGENTS.md` / `Automatic merge failed; fix conflicts and then commit the result.` / `# Error: exit status 1`
GitHub mergeable state	`"mergeable": false`, `"mergeable_state": "dirty"`, `"rebaseable": false`
Failure stage	Clone/checkout phase — ci-operator never started
Tide error cause	PR cannot be merged due to conflict — tide reports `state: ERROR`
Base SHA at CI time	`8f279f9ce6f4c20a4de05d706fb9322262989dca`
PR commit SHA	`ecdf22170a65685545a3d778dceeb5a3ccce0bd9`

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2026

enxebre force-pushed the evals-promptfoo branch from a30598b to f104388 Compare May 5, 2026 11:29

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread Makefile Outdated

Comment thread test/eval/hooks.js Outdated

Comment thread test/eval/hooks.js

Comment thread test/eval/hooks.js Outdated

Comment thread test/eval/README.md

Comment thread test/eval/README.md Outdated

Comment thread test/eval/run-agent.sh Outdated

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread test/eval/hooks.js Outdated

enxebre force-pushed the evals-promptfoo branch from f104388 to 96e350a Compare May 5, 2026 11:47

enxebre changed the title ~~NO-JIRA: Add promptfoo eval framework for SME agents~~ CNTRLPLANE-3339: Add promptfoo eval framework for SME agents May 5, 2026

enxebre mentioned this pull request May 5, 2026

CNTRLPLANE-3339: add eval-agents job for openshift/hypershift openshift/release#78630

Open

2 tasks

enxebre force-pushed the evals-promptfoo branch from 96e350a to ecdf221 Compare May 5, 2026 12:16

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

enxebre mentioned this pull request May 5, 2026

CNTRLPLANE-3339: add agent and convention eval framework #8382

Closed

4 tasks

enxebre removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026

theobarberbany reviewed May 5, 2026

View reviewed changes

enxebre mentioned this pull request May 11, 2026

NO-JIRA: docs: update api-sme agent and api/AGENTS.md conventions #8477

Merged

1 task

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026

openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2026

Conversation

enxebre commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 5, 2026

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

openshift-ci-robot commented May 5, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Uh oh!

openshift-ci Bot commented May 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci-robot commented May 5, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

theobarberbany May 5, 2026

Choose a reason for hiding this comment

Uh oh!

enxebre May 7, 2026

Choose a reason for hiding this comment

Uh oh!

theobarberbany May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

openshift-ci Bot commented May 11, 2026

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

enxebre commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading

openshift-ci-robot commented May 5, 2026 •

edited by openshift-ci Bot

Loading

theobarberbany May 8, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented Jun 11, 2026 •

edited by openshift-ci Bot

Loading