Skip to content

CNTRLPLANE-3339: Add promptfoo eval framework for SME agents#8419

Draft
enxebre wants to merge 2 commits into
openshift:mainfrom
enxebre:evals-promptfoo
Draft

CNTRLPLANE-3339: Add promptfoo eval framework for SME agents#8419
enxebre wants to merge 2 commits into
openshift:mainfrom
enxebre:evals-promptfoo

Conversation

@enxebre

@enxebre enxebre commented May 5, 2026

Copy link
Copy Markdown
Member

Summary

Supersedes #8382

Adds a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions. Preferred promptfoo over a custom Go harness initially for its wider feature coverage including skills evaluation, red team security scanning, built-in web UI, and JUnit XML output for CI.

  • 6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style
  • Git worktree isolation for patch-based tests so evals don't modify the working copy
  • make eval-agents target with EVAL_FILTER, EVAL_OUTPUT, EVAL_REPEAT, and EVAL_PASS_RATE_THRESHOLD support
  • Update api-sme agent with mandatory linter instruction
  • Add field grouping rule and best practices references to api/AGENTS.md

Test plan

  • make eval-agents EVAL_FILTER=api-sme passes
  • make eval-agents runs all 6 scenarios
  • make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML
  • Patch-based tests use worktree isolation and don't modify working copy
  • make eval-agents EVAL_REPEAT=3 EVAL_PASS_RATE_THRESHOLD=80 runs 3 trials with 80% threshold

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Introduced agent evaluation framework with new make eval-agents command for testing agent behavior against predefined scenarios with configurable filtering and output options.
  • Documentation

    • Updated agent development guidelines with API type change best practices and mandatory pre-review steps.
    • Added comprehensive evaluation workflow documentation including setup, execution, and result viewing instructions.

- Add mandatory linter instruction to api-sme agent
- Add field grouping rule to api/AGENTS.md
- Add best practices section pointing to etcdbackup and karpenter types

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
@openshift-ci

openshift-ci Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

Copy link
Copy Markdown

@enxebre: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Add a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions
  • 6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style
  • Git worktree isolation for patch-based tests so evals don't modify the working copy
  • make eval-agents target with EVAL_FILTER and EVAL_OUTPUT support
  • Update api-sme agent with mandatory linter instruction
  • Add field grouping rule and best practices references to api/AGENTS.md

Test plan

  • make eval-agents EVAL_FILTER=api-sme passes
  • make eval-agents runs all 6 scenarios
  • make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML
  • Patch-based tests use worktree isolation and don't modify working copy

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels May 5, 2026
@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds a promptfoo-based agent evaluation framework and a Makefile eval-agents target. New files: test/eval/promptfooconfig.yaml, test/eval/README.md, test/eval/hooks.js, and test/eval/run-agent.sh to run agents via the Claude CLI, create per-test git worktrees from patches, and evaluate outputs with llm-rubric. Makefile variables EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION and the .PHONY eval-agents target were added. Documentation updates: api-sme now mandates including make api-lint-fix output and following ../api/AGENTS.md; api/AGENTS.md adds API type-change guidelines and requires listed checks to pass before PRs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Make as "make eval-agents"
    participant PromptFoo as "promptfoo (npx)"
    participant Hooks as "hooks.js"
    participant Git as "git"
    participant AgentScript as "run-agent.sh"
    participant Claude as "Claude CLI"
    participant Rubric as "llm-rubric"

    User->>Make: run eval-agents (EVAL_* envs)
    Make->>PromptFoo: npx promptfoo eval (test/eval)
    PromptFoo->>Hooks: beforeEach(context)
    Hooks->>Git: create worktree & apply patch
    Git-->>Hooks: worktreePath
    Hooks-->>PromptFoo: context with worktreePath
    PromptFoo->>AgentScript: exec with prompt + context
    AgentScript->>Claude: exec claude --agent --model --allowed-tools
    Claude-->>AgentScript: LLM response
    AgentScript-->>PromptFoo: agent output
    PromptFoo->>Rubric: evaluate assertions (llm-rubric)
    Rubric-->>PromptFoo: results
    PromptFoo->>Hooks: afterEach(context)
    Hooks->>Git: remove worktree
    Git-->>Hooks: removed
    PromptFoo-->>User: report results / optional output file
Loading
🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the primary change: adding a promptfoo evaluation framework for SME agents, which aligns with the main objective of implementing a new evaluation system for agent testing.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR adds promptfoo eval framework, not Ginkgo tests. No Ginkgo tests are added or modified. Promptfoo test descriptions are static and deterministic.
Test Structure And Quality ✅ Passed Custom check requires reviewing Ginkgo test code. This PR adds a promptfoo evaluation framework with JavaScript, YAML, Bash, and Markdown files—no Ginkgo tests present. Check is not applicable.
Microshift Test Compatibility ✅ Passed This PR does not add Ginkgo e2e tests. All files are documentation, configuration, or helper scripts. The MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR adds promptfoo evaluation framework (JavaScript, Bash, YAML) and documentation. SNO check only applies to Ginkgo tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only test/evaluation infrastructure. No deployment manifests, operators, controllers, or scheduling constraints are introduced. Check not applicable.
Ote Binary Stdout Contract ✅ Passed Check not applicable: PR modifies no OTE binaries or Go test suite code. Changes are documentation, Makefile, and a promptfoo evaluation framework (JavaScript/Bash/YAML).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds an evaluation framework for AI agents using promptfoo, not Ginkgo e2e tests. No Go test files with Ginkgo patterns are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2026
@enxebre enxebre force-pushed the evals-promptfoo branch from a30598b to f104388 Compare May 5, 2026 11:29

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 624-627: The Makefile invocation uses npx promptfoo@latest which
makes test/eval non-deterministic; replace the `@latest` usage by pinning a
known-good version (e.g., [email protected]) in the Makefile invocation or
alternatively add a package.json in test/eval that lists the pinned promptfoo
version and run npx without `@latest`; update the Makefile line that calls "npx
promptfoo@latest eval" (and any related variables like EVAL_FILTER/EVAL_OUTPUT)
to reference the fixed version or rely on the local package.json so runs are
reproducible.

In `@test/eval/hooks.js`:
- Around line 15-22: The catch block can leave an orphaned worktree if
execSync(`git worktree add ...`) succeeds but execSync(`git apply ...`) fails;
modify the try block to set a flag (e.g., worktreeAdded = true) immediately
after calling git worktree add and before git apply, and in the catch block when
worktreeAdded is true run a cleanup call (execSync(`git worktree remove --force
"${worktreeDir}"`, { cwd: repoRoot })) to remove the created worktree, then log
the error and ensure context.test.vars.worktreePath is not set when cleanup
occurs; update references to worktreeDir, repoRoot,
context.test.vars.worktreePath, and the try/catch around execSync calls
accordingly.
- Around line 13-14: Replace the millisecond-based worktree name generation that
uses Date.now() in test/eval/hooks.js (the worktreeName variable used to build
worktreeDir) with a collision-resistant suffix (e.g., crypto.randomUUID() or a
short random hex from crypto.randomBytes()) so parallel eval workers won’t
produce identical paths; update the construction of worktreeName to include the
UUID/random suffix and ensure any cleanup or lookup logic that uses
worktreeName/worktreeDir continues to use the new value.
- Around line 16-17: Replace the shell-interpolated execSync calls with
execFileSync using argv arrays to avoid shell parsing: change the calls that use
execSync(`git worktree add "${worktreeDir}" HEAD`, { cwd: repoRoot, stdio:
'pipe' }) and execSync(`git apply "${fullPath}"`, { cwd: worktreeDir, stdio:
'pipe' }) to execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], {
cwd: repoRoot, stdio: 'pipe' }) and execFileSync('git', ['apply', fullPath], {
cwd: worktreeDir, stdio: 'pipe' }) respectively (and apply the same pattern to
the other occurrence around line 32); keep the cwd and stdio options intact.

In `@test/eval/README.md`:
- Around line 30-31: Update the README line describing "Patch-based tests" to
accurately reflect that patches are applied to an isolated temporary worktree
rather than the main working copy: change the wording around the
`beforeEach`/`afterEach` hooks to say they create a temporary worktree, apply
the patch there, run the test, and then remove/clean up the temp worktree (or
revert the temp worktree) instead of claiming the working copy is patched and
reverted.
- Around line 8-10: The README's prerequisites list is missing python3 which is
required because run-agent.sh invokes python3 to parse context; update
test/eval/README.md prerequisites to include "python3" (or "python3 (required
for run-agent.sh)") so users install/verify Python 3 before running the tests or
run-agent.sh.

In `@test/eval/run-agent.sh`:
- Line 33: Replace the use of echo when piping the PROMPT to the Claude process
to avoid backslash-escape interpretation and shell portability issues: use
printf '%s' to emit the exact bytes of the PROMPT and pipe that into the exec
claude "${ARGS[@]}" invocation (keep the PROMPT variable and ARGS array usage
unchanged).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 38c6bf0f-8388-4e9a-8662-4a8de94adbe7

📥 Commits

Reviewing files that changed from the base of the PR and between 6b39d47 and a30598b.

⛔ Files ignored due to path filters (1)
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
📒 Files selected for processing (7)
  • .claude/agents/api-sme.md
  • Makefile
  • api/AGENTS.md
  • test/eval/README.md
  • test/eval/hooks.js
  • test/eval/promptfooconfig.yaml
  • test/eval/run-agent.sh

Comment thread Makefile Outdated
Comment thread test/eval/hooks.js Outdated
Comment thread test/eval/hooks.js
Comment thread test/eval/hooks.js Outdated
Comment thread test/eval/README.md
Comment thread test/eval/README.md Outdated
Comment thread test/eval/run-agent.sh Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/eval/hooks.js`:
- Around line 29-30: afterEach is reading the wrong scope for the test-scoped
worktree; change the lookup to use context.test.vars.worktreePath (matching
beforeEach) so cleanup runs correctly: in the afterEach handler replace any use
of context.vars with context.test.vars and ensure worktreeDir is assigned from
context.test.vars.worktreePath before attempting removal (refer to the afterEach
function and the worktreeDir variable).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: bbfe7311-87d8-44df-a6df-74c3358c9bf6

📥 Commits

Reviewing files that changed from the base of the PR and between a30598b and f104388.

⛔ Files ignored due to path filters (1)
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • Makefile
  • test/eval/README.md
  • test/eval/hooks.js
  • test/eval/promptfooconfig.yaml
  • test/eval/run-agent.sh
✅ Files skipped from review due to trivial changes (2)
  • test/eval/README.md
  • test/eval/run-agent.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/eval/promptfooconfig.yaml

Comment thread test/eval/hooks.js Outdated
@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 36.57%. Comparing base (5eaee74) to head (ecdf221).
⚠️ Report is 69 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8419      +/-   ##
==========================================
+ Coverage   36.42%   36.57%   +0.15%     
==========================================
  Files         765      770       +5     
  Lines       93302    93483     +181     
==========================================
+ Hits        33981    34195     +214     
- Misses      56606    56647      +41     
+ Partials     2715     2641      -74     

see 54 files with indirect coverage changes

Flag Coverage Δ
cmd-support 30.41% <ø> (+0.03%) ⬆️
cpo-hostedcontrolplane 36.50% <ø> (-0.59%) ⬇️
cpo-other 37.73% <ø> (+2.03%) ⬆️
hypershift-operator 47.85% <ø> (-0.03%) ⬇️
other 27.77% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@enxebre enxebre force-pushed the evals-promptfoo branch from f104388 to 96e350a Compare May 5, 2026 11:47
@enxebre enxebre changed the title NO-JIRA: Add promptfoo eval framework for SME agents CNTRLPLANE-3339: Add promptfoo eval framework for SME agents May 5, 2026
@openshift-ci-robot

openshift-ci-robot commented May 5, 2026

Copy link
Copy Markdown

@enxebre: This pull request references CNTRLPLANE-3339 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Supersedes #8382

Adds a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions. Preferred promptfoo over a custom Go harness initially for its wider feature coverage including skills evaluation, red team security scanning, built-in web UI, and JUnit XML output for CI.

  • 6 test scenarios: api-sme (patch-based with linter), cloud-provider-sme, control-plane-sme, data-plane-sme, hcp-architect-sme, and a conventions test for Go test style
  • Git worktree isolation for patch-based tests so evals don't modify the working copy
  • make eval-agents target with EVAL_FILTER, EVAL_OUTPUT, EVAL_REPEAT, and EVAL_PASS_RATE_THRESHOLD support
  • Update api-sme agent with mandatory linter instruction
  • Add field grouping rule and best practices references to api/AGENTS.md

Test plan

  • make eval-agents EVAL_FILTER=api-sme passes
  • make eval-agents runs all 6 scenarios
  • make eval-agents EVAL_OUTPUT=results.xml produces JUnit XML
  • Patch-based tests use worktree isolation and don't modify working copy
  • make eval-agents EVAL_REPEAT=3 EVAL_PASS_RATE_THRESHOLD=80 runs 3 trials with 80% threshold

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

- Add promptfoo config with test scenarios for 5 SME agents and conventions
- Add git worktree isolation for patch-based tests
- Add eval-agents Makefile target
- Add testdata with patch for api-sme scenario

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@enxebre enxebre force-pushed the evals-promptfoo branch from 96e350a to ecdf221 Compare May 5, 2026 12:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/eval/README.md (1)

41-47: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document all supported eval controls in the env-var table.

The table is missing EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION, which are now part of the Makefile interface.

Suggested doc update
 | Env Var | Default | Description |
 |---------|---------|-------------|
 | `EVAL_MODEL` | `claude-opus-4-6` | Model for agent invocation |
 | `EVAL_FILTER` | (all) | Filter tests by description pattern |
 | `EVAL_OUTPUT` | (none) | Output file (.json, .xml, .html) |
+| `EVAL_REPEAT` | `1` | Number of repeated eval runs |
+| `EVAL_PASS_RATE_THRESHOLD` | `100` | Minimum required pass rate (%) |
+| `PROMPTFOO_VERSION` | `0.121.9` | promptfoo version used by `make eval-agents` |
 | `ANTHROPIC_VERTEX_PROJECT_ID` | - | GCP project for Vertex AI auth |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/eval/README.md` around lines 41 - 47, The env-var table in
test/eval/README.md is missing three supported controls; add entries for
EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION with short defaults
and descriptions so the Makefile interface is fully documented. Specifically,
add rows for `EVAL_REPEAT` (default like `1`, description: number of times each
test runs), `EVAL_PASS_RATE_THRESHOLD` (default like `0.8` or `80%`,
description: minimum pass rate to consider suite successful), and
`PROMPTFOO_VERSION` (default like `latest`, description: version of the prompt
tooling to use), matching the existing table format and style alongside the
other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and `EVAL_OUTPUT`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/eval/hooks.js`:
- Around line 13-30: The hook currently logs failures and continues, which lets
tests run against an unpatched repo; change it to fail fast by throwing after
cleanup so the test run stops: inside the try/catch around git worktree/apply
(symbols: execFileSync, worktreeCreated, worktreeDir,
context.test.vars.worktreePath) rethrow the caught error (or throw a new Error
with the original error message) after attempting the worktree removal and
logging, and also handle the case where the patch file is missing (the
fs.existsSync(fullPath) branch) by throwing an error instead of silently
continuing. Ensure any created worktree is removed before rethrowing.

---

Outside diff comments:
In `@test/eval/README.md`:
- Around line 41-47: The env-var table in test/eval/README.md is missing three
supported controls; add entries for EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and
PROMPTFOO_VERSION with short defaults and descriptions so the Makefile interface
is fully documented. Specifically, add rows for `EVAL_REPEAT` (default like `1`,
description: number of times each test runs), `EVAL_PASS_RATE_THRESHOLD`
(default like `0.8` or `80%`, description: minimum pass rate to consider suite
successful), and `PROMPTFOO_VERSION` (default like `latest`, description:
version of the prompt tooling to use), matching the existing table format and
style alongside the other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and
`EVAL_OUTPUT`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6b9277bb-2374-426f-9837-b3bf7146d8f8

📥 Commits

Reviewing files that changed from the base of the PR and between 96e350a and ecdf221.

⛔ Files ignored due to path filters (1)
  • test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diff is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • Makefile
  • test/eval/README.md
  • test/eval/hooks.js
  • test/eval/promptfooconfig.yaml
  • test/eval/run-agent.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/eval/run-agent.sh
  • test/eval/promptfooconfig.yaml

Comment thread test/eval/hooks.js
Comment on lines +13 to +30
if (fs.existsSync(fullPath)) {
const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
let worktreeCreated = false;
try {
execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
worktreeCreated = true;
execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
console.log(`Created worktree and applied patch: ${worktreeDir}`);
context.test.vars.worktreePath = worktreeDir;
} catch (e) {
if (worktreeCreated) {
try {
execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
} catch (_) {}
}
console.error(`Failed to create worktree or apply patch: ${e.message}`);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when patch setup fails instead of silently continuing.

On Line 13 and Lines 23-30, missing patch/setup only logs and allows execution to continue, which can run evals against the unpatched repo and produce false positives.

Suggested fix
     if (patchPath) {
       const fullPath = path.resolve(__dirname, patchPath);
-      if (fs.existsSync(fullPath)) {
-        const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
-        const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
-        let worktreeCreated = false;
-        try {
-          execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
-          worktreeCreated = true;
-          execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
-          console.log(`Created worktree and applied patch: ${worktreeDir}`);
-          context.test.vars.worktreePath = worktreeDir;
-        } catch (e) {
-          if (worktreeCreated) {
-            try {
-              execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
-            } catch (_) {}
-          }
-          console.error(`Failed to create worktree or apply patch: ${e.message}`);
-        }
-      }
+      if (!fs.existsSync(fullPath)) {
+        throw new Error(`Patch file not found: ${fullPath}`);
+      }
+      const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
+      const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
+      let worktreeCreated = false;
+      try {
+        execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
+        worktreeCreated = true;
+        execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
+        console.log(`Created worktree and applied patch: ${worktreeDir}`);
+        context.test.vars.worktreePath = worktreeDir;
+      } catch (e) {
+        if (worktreeCreated) {
+          try {
+            execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
+          } catch (_) {}
+        }
+        delete context.test.vars.worktreePath;
+        throw e;
+      }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/eval/hooks.js` around lines 13 - 30, The hook currently logs failures
and continues, which lets tests run against an unpatched repo; change it to fail
fast by throwing after cleanup so the test run stops: inside the try/catch
around git worktree/apply (symbols: execFileSync, worktreeCreated, worktreeDir,
context.test.vars.worktreePath) rethrow the caught error (or throw a new Error
with the original error message) after attempting the worktree removal and
logging, and also handle the case where the patch file is missing (the
fs.existsSync(fullPath) branch) by throwing an error instead of silently
continuing. Ensure any created worktree is removed before rethrowing.

@enxebre enxebre removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2026
- type: llm-rubric
value: "The output states that version-dependent behavior should be decided in the CPO based on the hosted cluster release version, not in the HO"
- type: llm-rubric
value: "The output explains that HO and CPO can run different versions and the HO must not assume which CPO version is running"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've been struggling with on the openshift/api evals - that you may suffer with here is false positives.

Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?

I can pretty consistently get it to catch the issues i wanted it to, but not to invent more that don't exist.

I think for SME experts it may matter less than eg an api review command, but if we have agents suffering from false positive issues I don't think people will trust them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?

It has builtin support for this through its assertion library: factuality, llm-rubric, weights and thresholds, g-eval, custom assertion functions, composite derived metrics, and cost/latency, Red team plugins for hallucination...

So we can tune expectations over time as we learn what the agents reliably catch vs what's flaky.

Some refs:
https://www.promptfoo.dev/docs/configuration/expected-outputs/#assertion-types
https://www.promptfoo.dev/docs/configuration/expected-outputs/#model-assisted-eval-metrics
https://www.promptfoo.dev/docs/configuration/expected-outputs/#custom-assertion-scoring
https://www.promptfoo.dev/docs/configuration/expected-outputs/#creating-derived-metrics
https://www.promptfoo.dev/docs/guides/llm-as-a-judge/#evaluation-approaches
https://github.com/promptfoo/promptfoo/blob/main/examples/eval-rag/promptfooconfig.yaml
https://www.promptfoo.dev/docs/red-team/troubleshooting/false-positives/

@theobarberbany theobarberbany May 8, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah awesome! I'll have a dig :)

What i've been manually building is pretty much the same as llm-rubric, I'd be curious to try and see if this yields better results.

Although, the main issue I've been hitting is I think a one shot claude code command might not be expressive enough for e.g api review, which sucks.

@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

@enxebre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security ecdf221 link true /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026
@openshift-ci

openshift-ci Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2026
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

I have all the evidence needed. Here is the complete analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-security
  • Build ID: 2053866789401530368
  • Target: security
  • PR: #8419CNTRLPLANE-3339: Add promptfoo eval framework for SME agents
  • Author: enxebre
  • Branch: evals-promptfoo

Test Failure Analysis

Error

Auto-merging api/AGENTS.md
CONFLICT (content): Merge conflict in api/AGENTS.md
Automatic merge failed; fix conflicts and then commit the result.
# Error: exit status 1

Summary

Both job failures (ci/prow/security with state failure and tide with state error) are caused by a git merge conflict in the file api/AGENTS.md. The Prow CI clone step attempted to merge the PR branch (ecdf22170a) onto the base branch main (8f279f9ce) and failed because both branches modified overlapping regions of api/AGENTS.md. The security job never reached the actual test step — it failed during the initial source checkout. The tide error is a direct consequence: tide cannot merge a PR that has conflicts with the target branch.

Root Cause

The root cause is a content conflict in api/AGENTS.md between PR #8419 and PR #8478, which was merged to main after PR #8419 was last pushed.

Timeline of events:

  1. 2026-05-05T12:16:04Z — PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 (evals-promptfoo) was last pushed. It modifies api/AGENTS.md by reorganizing sections: removing the "Key make targets" block from its original location, restructuring "API Dependencies" and "Serialization" sections, and adding new "Best Practices and Patterns" and "Field Grouping" sections.

  2. 2026-05-11T17:31:47Z — PR #8478 (update-agents-on-webhook, "NO-JIRA: Document CEL over webhooks policy for AI agents") was merged to main by openshift-merge-bot. This PR also modified api/AGENTS.md in overlapping regions — it documented the CEL over webhooks policy for AI agents.

  3. 2026-05-11T15:56:16Z — The security job for PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 was triggered. During the clone step, Prow attempted a git merge --no-ff of the PR commit onto main base SHA 8f279f9ce. The merge failed with a content conflict in api/AGENTS.md because both PRs modified the same regions of the file.

The Prow CI job exited at the clone/merge step with exit status 1 — no CI container (ci-operator) was ever started, and no test code was executed. The tide error is simply the merge bot recognizing that the PR's mergeable_state is dirty (GitHub reports "mergeable": false), so automatic merge is impossible.

This is not a product bug or test issue — it is a branch hygiene problem requiring a rebase.

Recommendations
  1. Rebase PR CNTRLPLANE-3339: Add promptfoo eval framework for SME agents #8419 onto current main — The PR author (enxebre) needs to rebase the evals-promptfoo branch onto the latest main and resolve the conflict in api/AGENTS.md. The conflict is between this PR's reorganization of the document and PR NO-JIRA: Document CEL over webhooks policy for AI agents #8478's addition of the CEL-over-webhooks policy.

  2. Force-push the rebased branch — After resolving the conflict locally, push the updated branch. This will automatically retrigger Prow CI jobs including the security check.

  3. No action needed on tide — The tide error will self-resolve once the merge conflict is cleared and required checks pass. Tide will automatically attempt to merge once the PR is in a mergeable state.

  4. Consider coordinating documentation changes — Both this PR and NO-JIRA: Document CEL over webhooks policy for AI agents #8478 modify api/AGENTS.md, a shared AI agent conventions file. Future changes to this high-contention file should be coordinated or rebased promptly to avoid stale branches.

Evidence
Evidence Detail
Conflicting file api/AGENTS.md
PR #8419 last push 2026-05-05T12:16:04Z (6 days before the CI run)
Conflicting PR #8478 NO-JIRA: Document CEL over webhooks policy for AI agents, merged 2026-05-11T17:31:47Z
Build log error CONFLICT (content): Merge conflict in api/AGENTS.md / Automatic merge failed; fix conflicts and then commit the result. / # Error: exit status 1
GitHub mergeable state "mergeable": false, "mergeable_state": "dirty", "rebaseable": false
Failure stage Clone/checkout phase — ci-operator never started
Tide error cause PR cannot be merged due to conflict — tide reports state: ERROR
Base SHA at CI time 8f279f9ce6f4c20a4de05d706fb9322262989dca
PR commit SHA ecdf22170a65685545a3d778dceeb5a3ccce0bd9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants