Skip to content

CNTRLPLANE-3633: docs: add presubmit e2e triage guide#8741

Open
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3633
Open

CNTRLPLANE-3633: docs: add presubmit e2e triage guide#8741
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:CNTRLPLANE-3633

Conversation

@bryan-cox

@bryan-cox bryan-cox commented Jun 16, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

Adds a step-by-step triage runbook for PR presubmit e2e failures. Developers see a red X on their PR and often don't know where to start — this guide walks them through it from top to bottom.

Creates a new Triage subsection under the CI docs (docs/content/how-to/ci/triage/) with:

  • Presubmit Failures — A 6-step runbook: find the failing job → find the failed step → find the specific error → check Prow job history → determine if it's related to your change → take action
  • Daily CI Health — Moved from checking-ci.md into the triage section (content unchanged)
  • Mentions Claude Code skills (/e2e-analyze, ci:analyze-prow-job-test-failure, ci:analyze-prow-job-install-failure) for automated analysis
  • Cross-links to existing v2-testing/debugging.md for artifact deep-dives

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3633

Special notes for your reviewer:

  • The checking-ci.md file is moved (not copied) to triage/daily-health.md — content is unchanged, only the path changed
  • The mkdocs nav is updated to reflect the new Triage subsection
  • Docs build verified locally with mkdocs build — no errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Documentation
    • Added a new “CI Triage” landing page with links to presubmit e2e failure troubleshooting and daily CI health monitoring.
    • Published a detailed presubmit triage guide with an interactive decision flow and step-by-step diagnostics for GitHub Actions and Prow (including create/run/destroy failure types), plus clear guidance on when to fix-and-push, rerun, or escalate.
    • Updated the CI how-to navigation to include a dedicated “Triage” section and streamline related CI troubleshooting entries.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 16, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 16, 2026

Copy link
Copy Markdown

@bryan-cox: This pull request references CNTRLPLANE-3633 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Adds a step-by-step triage runbook for PR presubmit e2e failures. Developers see a red X on their PR and often don't know where to start — this guide walks them through it from top to bottom.

Creates a new Triage subsection under the CI docs (docs/content/how-to/ci/triage/) with:

  • Presubmit Failures — A 6-step runbook: find the failing job → find the failed step → find the specific error → check Prow job history → determine if it's related to your change → take action
  • Daily CI Health — Moved from checking-ci.md into the triage section (content unchanged)
  • Mentions Claude Code skills (/e2e-analyze, ci:analyze-prow-job-test-failure, ci:analyze-prow-job-install-failure) for automated analysis
  • Cross-links to existing v2-testing/debugging.md for artifact deep-dives

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-3633

Special notes for your reviewer:

  • The checking-ci.md file is moved (not copied) to triage/daily-health.md — content is unchanged, only the path changed
  • The mkdocs nav is updated to reflect the new Triage subsection
  • Docs build verified locally with mkdocs build — no errors

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new CI triage documentation section is added for HyperShift. The landing page (triage/index.md) provides entry points to presubmit failure troubleshooting and daily CI health monitoring. A comprehensive guide (triage/presubmit-failures.md) contains an interactive mermaid flowchart routing users by CI system (GitHub Actions, Prow non-e2e, Konflux, Prow e2e), automated analysis commands for Claude Code, per-system diagnostics with detailed procedures for Prow e2e pipeline stages (create-guests, run-tests, destroy-guests), Konflux failure patterns, job history analysis to distinguish widespread versus PR-specific failures, code-change tracing within the test codebase, escalation procedures with required Slack details, and references to deeper debugging resources. The docs/mkdocs.yml file is updated to create a new Triage subsection under How-to CI guides with the landing page and presubmit failures link, the mermaid2 plugin is configured with loose security level for diagram rendering, and the navigation tree is reorganized accordingly.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a presubmit e2e triage guide to the documentation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains only documentation files (markdown and mkdocs.yml), not test code. No Ginkgo tests are present, so the test naming check does not apply.
Test Structure And Quality ✅ Passed PR contains only documentation changes (Markdown and YAML files); no Ginkgo test code to review, making this check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds only documentation (markdown guides and mkdocs config). The custom check applies to "deployment manifests, operator code, or controllers," none of which are present in this documentati...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds only documentation files (markdown and config YAML), not Ginkgo e2e tests. The custom check for IPv6/disconnected network compatibility only applies to new test code, which this PR doe...
No-Weak-Crypto ✅ Passed PR contains only documentation (Markdown and YAML config), no cryptographic code or weak crypto algorithms are present.
Container-Privileges ✅ Passed This PR contains only documentation files (markdown) and MkDocs site configuration. No container/K8s manifests or security-critical configurations are present; the check is not applicable.
No-Sensitive-Data-In-Logs ✅ Passed The PR adds CI triage documentation with no logging statements containing passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data. All content is instructional guidance...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Jun 16, 2026
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8741 June 16, 2026 16:42 Inactive

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/mkdocs.yml`:
- Around line 88-91: The 'Triage' subsection within the CI section of the
mkdocs.yml navigation is not in alphabetical order. Move the entire 'Triage'
entry (including its nested 'Presubmit Failures' and 'Daily CI Health'
subsections) to its correct alphabetical position within the CI section, which
should be after 'Sync Community Fork' and before 'V2 E2E Testing'. This will
ensure the navigation structure passes the verify-docs-nav-order.py validation
check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1d387a8c-f823-48ca-aea9-24cefc035369

📥 Commits

Reviewing files that changed from the base of the PR and between 392fd5a and d6bb8df.

📒 Files selected for processing (4)
  • docs/content/how-to/ci/triage/daily-health.md
  • docs/content/how-to/ci/triage/index.md
  • docs/content/how-to/ci/triage/presubmit-failures.md
  • docs/mkdocs.yml

Comment thread docs/mkdocs.yml Outdated
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8741 June 16, 2026 17:39 Inactive
@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

I now have the complete root cause. Here is my analysis:

Test Failure Analysis Complete

Job Information

  • Prow Job: verify / Verify (GitHub Actions reusable workflow verify-reusable.yaml)
  • Build ID: 27636363003 / Job 81724544641
  • PR: #8741CNTRLPLANE-3633: docs: add presubmit e2e triage guide
  • Branch: CNTRLPLANE-3633
  • Failed Step: Step 7 — git update-index --refresh (dirty-tree check)

Test Failure Analysis

Error

docs/content/reference/aggregated-docs.md: needs update
Process completed with exit code 1.

Summary

The Verify job runs make generate update and then checks that the working tree is clean (no uncommitted changes). The PR added and renamed markdown files under docs/content/ but did not regenerate docs/content/reference/aggregated-docs.md. The make update target includes a docs-aggregate step that runs hack/tools/docs-aggregator/main.go, which walks all .md files under docs/content/, concatenates them, and writes the result to aggregated-docs.md. Because the PR's new/renamed doc files changed the set of inputs to the aggregator, the CI-regenerated aggregated-docs.md differs from what was committed, causing the dirty-tree check to fail.

Root Cause

The PR modifies three documentation files under docs/content/:

  1. Renamed: docs/content/how-to/ci/checking-ci.mddocs/content/how-to/ci/triage/daily-health.md
  2. Added: docs/content/how-to/ci/triage/index.md (new CI Triage index page)
  3. Added: docs/content/how-to/ci/triage/presubmit-failures.md (new presubmit failures guide)
  4. Modified: docs/mkdocs.yml (navigation update)

The repository has an auto-generated file docs/content/reference/aggregated-docs.md that is produced by the docs-aggregate Makefile target. This target runs hack/tools/docs-aggregator/main.go, which walks all .md files under docs/content/, sorts them, strips markdown links/images, and concatenates them into a single blob (intended for AI tools like NotebookLM).

The make update target (called in CI) includes docs-aggregate as its final step:

update: api-deps workspace-sync deps api api-docs clients docs-aggregate

When CI ran make generate update, the aggregator picked up the new/renamed markdown files and regenerated aggregated-docs.md with different content than what was committed. The subsequent git update-index --refresh detected this uncommitted change and failed with exit code 1.

The fix is straightforward: the PR author needs to run make update (or make docs-aggregate) locally and commit the updated aggregated-docs.md.

Recommendations
  1. Run make update locally and commit the regenerated docs/content/reference/aggregated-docs.md:

    make docs-aggregate
    git add docs/content/reference/aggregated-docs.md
    git commit -m "regenerate aggregated-docs.md"

    Alternatively, make update runs the full update pipeline including docs-aggregate.

  2. Re-push to the PR branch to trigger a new CI run that should pass the dirty-tree check.

  3. For future docs PRs: Any PR that adds, removes, renames, or modifies .md files under docs/content/ must regenerate aggregated-docs.md by running make update or make docs-aggregate before pushing.

Evidence
Evidence Detail
Failed step Step 7: git update-index --refresh — dirty working tree check
Error message docs/content/reference/aggregated-docs.md: needs update
Makefile target update: api-deps workspace-sync deps api api-docs clients docs-aggregate
Aggregator tool hack/tools/docs-aggregator/main.go — walks all docs/content/**/*.md and concatenates into aggregated-docs.md
PR changed files Renamed checking-ci.mdtriage/daily-health.md; added triage/index.md and triage/presubmit-failures.md
Preceding steps passed make generate update, make staticcheck, make fmt, make vet all succeeded — only the post-generation dirty-tree check failed
Workflow file .github/workflows/verify-reusable.yaml — runs make generate update then checks for uncommitted changes

@bryan-cox bryan-cox marked this pull request as ready for review June 16, 2026 19:24
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci openshift-ci Bot requested review from clebs and enxebre June 16, 2026 19:24
@bryan-cox bryan-cox force-pushed the CNTRLPLANE-3633 branch 4 times, most recently from 3124b35 to ea9bc65 Compare June 16, 2026 19:40
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8741 June 16, 2026 19:43 Inactive

@mgencur mgencur left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! this is very useful. Thanks for putting this together.
Left a couple of comments.


If you use Claude Code, these skills can automate most of the investigation below:

- `/e2e-analyze <prow-job-url> <artifacts-dir>` — Downloads build logs and artifacts, analyzes the failure, and outputs a structured error/summary/evidence report. This is a repo-local skill available to anyone with the hypershift repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e2e-analyze command was the initial one and we created a more advanced variant ci:analyze-prow-job-test-failure. At the beginning they were same but then ci:analyze-prow-job-test-failure evolved further and included support for hypershift hosted cluster. I would probably use only that one and not e2e-analyze.
With that said, we could probably remove e2e-analyze from Hypershift repo and in this new guide, mention the openshift-eng/ai-helpers repository where the recommended commands exist.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — removed the e2e-analyze mention and pointed to openshift-eng/ai-helpers as the recommended source for the CI plugin skills.


prow_e2e -->|"Cluster creation<br/>failed"| create["Check Artifacts for JUnit XML<br/>or search log for the error"]
prow_e2e -->|"Test failed"| tests["Find failed test name in<br/>JUnit XML or Ginkgo output"]
prow_e2e -->|"Teardown failed"| destroy["/retest — rarely your code"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this box should have the "Same failure again"/ "Passes" edge leading to "Escalate" / "Done - was a flake" box? (similar to the edges going from "/retest once")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea — added "Same failure again" → Escalate and "Passes" → Done edges from the teardown/retest box.

@clebs

clebs commented Jun 17, 2026

Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification

No second-stage tests were triggered for this PR.

This can happen when:

  • The changed files don't match any pipeline_run_if_changed patterns
  • All files match pipeline_skip_if_only_changed patterns
  • No pipeline-controlled jobs are defined for the main branch

Use /test ? to see all available tests.

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8741 June 17, 2026 10:40 Inactive
Create a new Triage subsection under CI docs with a step-by-step
runbook for diagnosing PR presubmit e2e failures. Move the daily
CI health check page into the triage section.
@mgencur

mgencur commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification

No second-stage tests were triggered for this PR.

This can happen when:

  • The changed files don't match any pipeline_run_if_changed patterns
  • All files match pipeline_skip_if_only_changed patterns
  • No pipeline-controlled jobs are defined for the main branch

Use /test ? to see all available tests.

@clebs

clebs commented Jun 17, 2026

Copy link
Copy Markdown
Member

/lgtm

@bryan-cox

Copy link
Copy Markdown
Member Author

/verified by @bryan-cox

@openshift-ci-robot

Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by @bryan-cox.

Details

In response to this:

/verified by @bryan-cox

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants