CNTRLPLANE-3339: Add promptfoo eval framework for SME agents#8419
CNTRLPLANE-3339: Add promptfoo eval framework for SME agents#8419enxebre wants to merge 2 commits into
Conversation
- Add mandatory linter instruction to api-sme agent - Add field grouping rule to api/AGENTS.md - Add best practices section pointing to etcdbackup and karpenter types Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@enxebre: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughAdds a promptfoo-based agent evaluation framework and a Makefile Sequence Diagram(s)sequenceDiagram
participant User
participant Make as "make eval-agents"
participant PromptFoo as "promptfoo (npx)"
participant Hooks as "hooks.js"
participant Git as "git"
participant AgentScript as "run-agent.sh"
participant Claude as "Claude CLI"
participant Rubric as "llm-rubric"
User->>Make: run eval-agents (EVAL_* envs)
Make->>PromptFoo: npx promptfoo eval (test/eval)
PromptFoo->>Hooks: beforeEach(context)
Hooks->>Git: create worktree & apply patch
Git-->>Hooks: worktreePath
Hooks-->>PromptFoo: context with worktreePath
PromptFoo->>AgentScript: exec with prompt + context
AgentScript->>Claude: exec claude --agent --model --allowed-tools
Claude-->>AgentScript: LLM response
AgentScript-->>PromptFoo: agent output
PromptFoo->>Rubric: evaluate assertions (llm-rubric)
Rubric-->>PromptFoo: results
PromptFoo->>Hooks: afterEach(context)
Hooks->>Git: remove worktree
Git-->>Hooks: removed
PromptFoo-->>User: report results / optional output file
🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Makefile`:
- Around line 624-627: The Makefile invocation uses npx promptfoo@latest which
makes test/eval non-deterministic; replace the `@latest` usage by pinning a
known-good version (e.g., [email protected]) in the Makefile invocation or
alternatively add a package.json in test/eval that lists the pinned promptfoo
version and run npx without `@latest`; update the Makefile line that calls "npx
promptfoo@latest eval" (and any related variables like EVAL_FILTER/EVAL_OUTPUT)
to reference the fixed version or rely on the local package.json so runs are
reproducible.
In `@test/eval/hooks.js`:
- Around line 15-22: The catch block can leave an orphaned worktree if
execSync(`git worktree add ...`) succeeds but execSync(`git apply ...`) fails;
modify the try block to set a flag (e.g., worktreeAdded = true) immediately
after calling git worktree add and before git apply, and in the catch block when
worktreeAdded is true run a cleanup call (execSync(`git worktree remove --force
"${worktreeDir}"`, { cwd: repoRoot })) to remove the created worktree, then log
the error and ensure context.test.vars.worktreePath is not set when cleanup
occurs; update references to worktreeDir, repoRoot,
context.test.vars.worktreePath, and the try/catch around execSync calls
accordingly.
- Around line 13-14: Replace the millisecond-based worktree name generation that
uses Date.now() in test/eval/hooks.js (the worktreeName variable used to build
worktreeDir) with a collision-resistant suffix (e.g., crypto.randomUUID() or a
short random hex from crypto.randomBytes()) so parallel eval workers won’t
produce identical paths; update the construction of worktreeName to include the
UUID/random suffix and ensure any cleanup or lookup logic that uses
worktreeName/worktreeDir continues to use the new value.
- Around line 16-17: Replace the shell-interpolated execSync calls with
execFileSync using argv arrays to avoid shell parsing: change the calls that use
execSync(`git worktree add "${worktreeDir}" HEAD`, { cwd: repoRoot, stdio:
'pipe' }) and execSync(`git apply "${fullPath}"`, { cwd: worktreeDir, stdio:
'pipe' }) to execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], {
cwd: repoRoot, stdio: 'pipe' }) and execFileSync('git', ['apply', fullPath], {
cwd: worktreeDir, stdio: 'pipe' }) respectively (and apply the same pattern to
the other occurrence around line 32); keep the cwd and stdio options intact.
In `@test/eval/README.md`:
- Around line 30-31: Update the README line describing "Patch-based tests" to
accurately reflect that patches are applied to an isolated temporary worktree
rather than the main working copy: change the wording around the
`beforeEach`/`afterEach` hooks to say they create a temporary worktree, apply
the patch there, run the test, and then remove/clean up the temp worktree (or
revert the temp worktree) instead of claiming the working copy is patched and
reverted.
- Around line 8-10: The README's prerequisites list is missing python3 which is
required because run-agent.sh invokes python3 to parse context; update
test/eval/README.md prerequisites to include "python3" (or "python3 (required
for run-agent.sh)") so users install/verify Python 3 before running the tests or
run-agent.sh.
In `@test/eval/run-agent.sh`:
- Line 33: Replace the use of echo when piping the PROMPT to the Claude process
to avoid backslash-escape interpretation and shell portability issues: use
printf '%s' to emit the exact bytes of the PROMPT and pipe that into the exec
claude "${ARGS[@]}" invocation (keep the PROMPT variable and ARGS array usage
unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 38c6bf0f-8388-4e9a-8662-4a8de94adbe7
⛔ Files ignored due to path filters (1)
test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**
📒 Files selected for processing (7)
.claude/agents/api-sme.mdMakefileapi/AGENTS.mdtest/eval/README.mdtest/eval/hooks.jstest/eval/promptfooconfig.yamltest/eval/run-agent.sh
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/eval/hooks.js`:
- Around line 29-30: afterEach is reading the wrong scope for the test-scoped
worktree; change the lookup to use context.test.vars.worktreePath (matching
beforeEach) so cleanup runs correctly: in the afterEach handler replace any use
of context.vars with context.test.vars and ensure worktreeDir is assigned from
context.test.vars.worktreePath before attempting removal (refer to the afterEach
function and the worktreeDir variable).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: bbfe7311-87d8-44df-a6df-74c3358c9bf6
⛔ Files ignored due to path filters (1)
test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**
📒 Files selected for processing (5)
Makefiletest/eval/README.mdtest/eval/hooks.jstest/eval/promptfooconfig.yamltest/eval/run-agent.sh
✅ Files skipped from review due to trivial changes (2)
- test/eval/README.md
- test/eval/run-agent.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/eval/promptfooconfig.yaml
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8419 +/- ##
==========================================
+ Coverage 36.42% 36.57% +0.15%
==========================================
Files 765 770 +5
Lines 93302 93483 +181
==========================================
+ Hits 33981 34195 +214
- Misses 56606 56647 +41
+ Partials 2715 2641 -74 see 54 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@enxebre: This pull request references CNTRLPLANE-3339 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
- Add promptfoo config with test scenarios for 5 SME agents and conventions - Add git worktree isolation for patch-based tests - Add eval-agents Makefile target - Add testdata with patch for api-sme scenario Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/eval/README.md (1)
41-47:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDocument all supported eval controls in the env-var table.
The table is missing
EVAL_REPEAT,EVAL_PASS_RATE_THRESHOLD, andPROMPTFOO_VERSION, which are now part of the Makefile interface.Suggested doc update
| Env Var | Default | Description | |---------|---------|-------------| | `EVAL_MODEL` | `claude-opus-4-6` | Model for agent invocation | | `EVAL_FILTER` | (all) | Filter tests by description pattern | | `EVAL_OUTPUT` | (none) | Output file (.json, .xml, .html) | +| `EVAL_REPEAT` | `1` | Number of repeated eval runs | +| `EVAL_PASS_RATE_THRESHOLD` | `100` | Minimum required pass rate (%) | +| `PROMPTFOO_VERSION` | `0.121.9` | promptfoo version used by `make eval-agents` | | `ANTHROPIC_VERTEX_PROJECT_ID` | - | GCP project for Vertex AI auth |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/eval/README.md` around lines 41 - 47, The env-var table in test/eval/README.md is missing three supported controls; add entries for EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and PROMPTFOO_VERSION with short defaults and descriptions so the Makefile interface is fully documented. Specifically, add rows for `EVAL_REPEAT` (default like `1`, description: number of times each test runs), `EVAL_PASS_RATE_THRESHOLD` (default like `0.8` or `80%`, description: minimum pass rate to consider suite successful), and `PROMPTFOO_VERSION` (default like `latest`, description: version of the prompt tooling to use), matching the existing table format and style alongside the other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and `EVAL_OUTPUT`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/eval/hooks.js`:
- Around line 13-30: The hook currently logs failures and continues, which lets
tests run against an unpatched repo; change it to fail fast by throwing after
cleanup so the test run stops: inside the try/catch around git worktree/apply
(symbols: execFileSync, worktreeCreated, worktreeDir,
context.test.vars.worktreePath) rethrow the caught error (or throw a new Error
with the original error message) after attempting the worktree removal and
logging, and also handle the case where the patch file is missing (the
fs.existsSync(fullPath) branch) by throwing an error instead of silently
continuing. Ensure any created worktree is removed before rethrowing.
---
Outside diff comments:
In `@test/eval/README.md`:
- Around line 41-47: The env-var table in test/eval/README.md is missing three
supported controls; add entries for EVAL_REPEAT, EVAL_PASS_RATE_THRESHOLD, and
PROMPTFOO_VERSION with short defaults and descriptions so the Makefile interface
is fully documented. Specifically, add rows for `EVAL_REPEAT` (default like `1`,
description: number of times each test runs), `EVAL_PASS_RATE_THRESHOLD`
(default like `0.8` or `80%`, description: minimum pass rate to consider suite
successful), and `PROMPTFOO_VERSION` (default like `latest`, description:
version of the prompt tooling to use), matching the existing table format and
style alongside the other env vars such as `EVAL_MODEL`, `EVAL_FILTER`, and
`EVAL_OUTPUT`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6b9277bb-2374-426f-9837-b3bf7146d8f8
⛔ Files ignored due to path filters (1)
test/eval/testdata/sme-agents/api-sme/01-api-design-review/patch.diffis excluded by!**/testdata/**
📒 Files selected for processing (5)
Makefiletest/eval/README.mdtest/eval/hooks.jstest/eval/promptfooconfig.yamltest/eval/run-agent.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- test/eval/run-agent.sh
- test/eval/promptfooconfig.yaml
| if (fs.existsSync(fullPath)) { | ||
| const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`; | ||
| const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName); | ||
| let worktreeCreated = false; | ||
| try { | ||
| execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' }); | ||
| worktreeCreated = true; | ||
| execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' }); | ||
| console.log(`Created worktree and applied patch: ${worktreeDir}`); | ||
| context.test.vars.worktreePath = worktreeDir; | ||
| } catch (e) { | ||
| if (worktreeCreated) { | ||
| try { | ||
| execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' }); | ||
| } catch (_) {} | ||
| } | ||
| console.error(`Failed to create worktree or apply patch: ${e.message}`); | ||
| } |
There was a problem hiding this comment.
Fail fast when patch setup fails instead of silently continuing.
On Line 13 and Lines 23-30, missing patch/setup only logs and allows execution to continue, which can run evals against the unpatched repo and produce false positives.
Suggested fix
if (patchPath) {
const fullPath = path.resolve(__dirname, patchPath);
- if (fs.existsSync(fullPath)) {
- const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
- const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
- let worktreeCreated = false;
- try {
- execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
- worktreeCreated = true;
- execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
- console.log(`Created worktree and applied patch: ${worktreeDir}`);
- context.test.vars.worktreePath = worktreeDir;
- } catch (e) {
- if (worktreeCreated) {
- try {
- execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
- } catch (_) {}
- }
- console.error(`Failed to create worktree or apply patch: ${e.message}`);
- }
- }
+ if (!fs.existsSync(fullPath)) {
+ throw new Error(`Patch file not found: ${fullPath}`);
+ }
+ const worktreeName = `eval-${Date.now()}-${crypto.randomUUID()}`;
+ const worktreeDir = path.join(require('os').tmpdir(), 'hypershift-eval', worktreeName);
+ let worktreeCreated = false;
+ try {
+ execFileSync('git', ['worktree', 'add', worktreeDir, 'HEAD'], { cwd: repoRoot, stdio: 'pipe' });
+ worktreeCreated = true;
+ execFileSync('git', ['apply', fullPath], { cwd: worktreeDir, stdio: 'pipe' });
+ console.log(`Created worktree and applied patch: ${worktreeDir}`);
+ context.test.vars.worktreePath = worktreeDir;
+ } catch (e) {
+ if (worktreeCreated) {
+ try {
+ execFileSync('git', ['worktree', 'remove', worktreeDir, '--force'], { cwd: repoRoot, stdio: 'pipe' });
+ } catch (_) {}
+ }
+ delete context.test.vars.worktreePath;
+ throw e;
+ }
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/eval/hooks.js` around lines 13 - 30, The hook currently logs failures
and continues, which lets tests run against an unpatched repo; change it to fail
fast by throwing after cleanup so the test run stops: inside the try/catch
around git worktree/apply (symbols: execFileSync, worktreeCreated, worktreeDir,
context.test.vars.worktreePath) rethrow the caught error (or throw a new Error
with the original error message) after attempting the worktree removal and
logging, and also handle the case where the patch file is missing (the
fs.existsSync(fullPath) branch) by throwing an error instead of silently
continuing. Ensure any created worktree is removed before rethrowing.
| - type: llm-rubric | ||
| value: "The output states that version-dependent behavior should be decided in the CPO based on the hosted cluster release version, not in the HO" | ||
| - type: llm-rubric | ||
| value: "The output explains that HO and CPO can run different versions and the HO must not assume which CPO version is running" |
There was a problem hiding this comment.
One thing I've been struggling with on the openshift/api evals - that you may suffer with here is false positives.
Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?
I can pretty consistently get it to catch the issues i wanted it to, but not to invent more that don't exist.
I think for SME experts it may matter less than eg an api review command, but if we have agents suffering from false positive issues I don't think people will trust them.
There was a problem hiding this comment.
Do we cover if the SME/agent returns something sounding roughly plausible, but not true? How do we assert this in this framework?
It has builtin support for this through its assertion library: factuality, llm-rubric, weights and thresholds, g-eval, custom assertion functions, composite derived metrics, and cost/latency, Red team plugins for hallucination...
So we can tune expectations over time as we learn what the agents reliably catch vs what's flaky.
Some refs:
https://www.promptfoo.dev/docs/configuration/expected-outputs/#assertion-types
https://www.promptfoo.dev/docs/configuration/expected-outputs/#model-assisted-eval-metrics
https://www.promptfoo.dev/docs/configuration/expected-outputs/#custom-assertion-scoring
https://www.promptfoo.dev/docs/configuration/expected-outputs/#creating-derived-metrics
https://www.promptfoo.dev/docs/guides/llm-as-a-judge/#evaluation-approaches
https://github.com/promptfoo/promptfoo/blob/main/examples/eval-rag/promptfooconfig.yaml
https://www.promptfoo.dev/docs/red-team/troubleshooting/false-positives/
There was a problem hiding this comment.
Ah awesome! I'll have a dig :)
What i've been manually building is pretty much the same as llm-rubric, I'd be curious to try and see if this yields better results.
Although, the main issue I've been hitting is I think a one shot claude code command might not be expressive enough for e.g api review, which sucks.
|
@enxebre: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Stale PRs are closed after 21d of inactivity. If this PR is still relevant, comment to refresh it or remove the stale label. If this PR is safe to close now please do so with /lifecycle stale |
|
I have all the evidence needed. Here is the complete analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth job failures ( Root CauseThe root cause is a content conflict in Timeline of events:
The Prow CI job exited at the clone/merge step with This is not a product bug or test issue — it is a branch hygiene problem requiring a rebase. Recommendations
Evidence
|
Summary
Supersedes #8382
Adds a promptfoo-based eval framework for testing SME agent definitions and AGENTS.md conventions. Preferred promptfoo over a custom Go harness initially for its wider feature coverage including skills evaluation, red team security scanning, built-in web UI, and JUnit XML output for CI.
make eval-agentstarget withEVAL_FILTER,EVAL_OUTPUT,EVAL_REPEAT, andEVAL_PASS_RATE_THRESHOLDsupportTest plan
make eval-agents EVAL_FILTER=api-smepassesmake eval-agentsruns all 6 scenariosmake eval-agents EVAL_OUTPUT=results.xmlproduces JUnit XMLmake eval-agents EVAL_REPEAT=3 EVAL_PASS_RATE_THRESHOLD=80runs 3 trials with 80% threshold🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
make eval-agentscommand for testing agent behavior against predefined scenarios with configurable filtering and output options.Documentation