Skip to content

[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971

Open
hari-kuriakose wants to merge 11 commits into
mainfrom
feat/test-rig-foundation
Open

[MISC] Add tox-driven test rig foundation (unit + integration + e2e)#1971
hari-kuriakose wants to merge 11 commits into
mainfrom
feat/test-rig-foundation

Conversation

@hari-kuriakose
Copy link
Copy Markdown
Contributor

Summary

  • Adds a single tox-driven entry point that brings every existing test suite under one rig, with consistent reporting.
  • Introduces an e2e tier on top of the existing docker-compose stack; testcontainers + local subprocesses available for dev.
  • Surfaces critical-path gaps and regressions explicitly in CI reports — the team target is zero regressions.

This is the foundation only — no new tests beyond a single e2e smoke. Existing co-located unit tests stay where they are; the rig just discovers and runs them.

What's new

  • tests/groups.yaml — single source of truth for 15 groups (9 unit, 1 integration placeholder, 5 e2e) with a depends_on graph.
  • tests/critical_paths.yaml — 10 critical flows registered, each pointing at the group(s) that cover it.
  • tests/rig/ — the dispatcher (python -m tests.rig run|list-groups|expand|validate|platform|report).
  • tests/e2e/platform session fixture + one smoke test exercising the e2e plumbing end-to-end.
  • tox.ini — rewritten with tier-shaped envs (unit, integration, e2e, groups, rig); legacy runner|sdk1|prompt-service envs preserved as thin aliases.
  • CIci-test.yaml rewritten to call the rig; new ci-test-e2e.yaml slow lane (main + run-e2e-labeled PRs + nightly).
  • Pre-commit hookunstract-tests fires only when developers opt in by creating .test-selection.

Usage

tox -e unit                                    # all unit groups, parallel, with coverage
tox -e e2e -- e2e-smoke                        # smoke against docker compose
tox -e groups -- unit-backend e2e-smoke        # explicit groups (deps auto-pulled)
tox -e groups -- --from-file .test-selection   # newline-delimited group list
tox -e rig -- list-groups                      # discovery
python -m tests.rig validate                   # CI-friendly manifest check

Full docs: tests/README.md.

Reports

  • Per-group: reports/<group>/{junit.xml, report.md, exit.txt}
  • Aggregate: reports/{summary.md, summary.json, combined-test-report.md} — same combined-test-report.md path keeps the existing sticky-PR-comment workflow working.
  • Coverage (when --coverage): reports/{coverage.xml, htmlcov/}coverage.xml is consumable by the existing SonarCloud integration.

Critical paths

CI on main runs with --fail-on-critical-gap --update-baseline, so:

  • ⚠️ Gaps — currently 6 (pipeline-etl-execute, usage-token-tracking, workflow-execution-fan-out, callback-result-delivery, plus the 4 covered by e2e groups that haven't shipped tests yet). Visible in PR comments; not blocking until covered.
  • Regressions — paths that were covered on the cached main baseline but aren't now. Always blocking.

Test plan

  • python -m tests.rig validate succeeds (OK — 15 groups, 10 critical paths).
  • python -m tests.rig list-groups and expand <group> print correct dep-expanded ordering.
  • validate correctly fails on an injected unknown-group reference in critical_paths.yaml.
  • python -m tests.rig run --no-coverage --no-parallel unit-core end-to-end: creates per-group venv, installs rig pytest plugins, runs pytest, writes junit.xml + report.md + exit.txt, and aggregates into summary.md/summary.json with critical-path gaps section.
  • bash .github/scripts/combine-test-reports.sh shim works (delegates to python -m tests.rig report combine and preserves the legacy combined-test-report.md path).
  • Reviewer: trigger ci-test.yaml on this PR and confirm the sticky comment renders.
  • Reviewer: label this PR run-e2e to smoke-test ci-test-e2e.yaml against docker compose.

Notes for reviewers

  • The unit-core run surfaced two pre-existing import errors in unstract/core/tests/ (LogHelper not exported; account_services submodule missing). Those are not rig issues — the rig correctly captured them as errors. Fixing them is out of scope here.
  • unstract/connectors/uv.lock had unrelated churn (litellm source switch); intentionally left out of this PR.
  • No existing tests were moved. Per-service unit tests stay co-located with their service code; only e2e and cross-service integration tests live under tests/.

🤖 Generated with Claude Code

hari-kuriakose and others added 2 commits May 19, 2026 14:14
Introduces a single tox-driven entry point that brings every existing test
suite under one rig, adds an e2e tier on top of the existing docker-compose
stack, and surfaces critical-path gaps + regressions in CI reports.

- tests/groups.yaml declares 15 groups across unit/integration/e2e with a
  depends_on graph; tests/critical_paths.yaml registers 10 critical flows.
- tests/rig is the dispatcher: python -m tests.rig run [groups|all|--from-file]
  [--tier] [--changed-only] [--coverage/--no-coverage] [--parallel].
- E2E supports docker compose (CI default) or testcontainers + local
  subprocesses (dev default), selectable via UNSTRACT_E2E_RUNTIME.
- Reports: per-group junit.xml + md, plus reports/summary.md with regressions
  and critical-path gaps; htmlcov + coverage.xml when --coverage.
- New ci-test-e2e.yaml lane runs on main, run-e2e-labeled PRs, and nightly;
  ci-test.yaml rewritten to call the rig and update the main baseline cache.
- Pre-commit hook fires only when developers create .test-selection.
- Existing per-service tests stay co-located; legacy tox envs (runner, sdk1,
  prompt-service) preserved as thin aliases.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Python "tests.rig" test rig: group and critical-path manifests and loaders, selection resolution, E2E runtime drivers, a CLI orchestrator to run groups, reporting and coverage combining, CI/workflow integration, pre-commit hook, helper combine script, extensive docs, and self-tests.

Changes

Test Rig Implementation

Layer / File(s) Summary
CI, tox and pre-commit wiring
.gitignore, .pre-commit-config.yaml, tox.ini, .github/workflows/ci-test.yaml, .github/workflows/ci-test-e2e.yaml, pyproject.toml, .github/scripts/combine-test-reports.sh
Adds CI workflows and e2e job, refactors tox to rig-driven tiers, adds rig dependency groups and coverage settings, updates gitignore and pre-commit hook for rig-run gating, and replaces the combine script with a thin wrapper invoking tests.rig report combine.
Manifests & docs
tests/groups.yaml, tests/critical_paths.yaml, tests/README.md, tests/compose/docker-compose.test.yaml, tests/conftest.py
Adds group and critical-path manifests, compose overlay for e2e, top-level pytest marker registration, and extensive README documenting rig usage and commands.
Group manifest loader & datamodel
tests/rig/groups.py
Typed dataclasses and loaders: GroupDefinition/GroupManifest with validation, path checks, dependency-cycle detection, platform-gate enforcement, and dependency-first expansion.
Selection & CLI orchestration
tests/rig/selection.py, tests/rig/cli.py, tests/rig/__main__.py, tests/rig/__init__.py
resolve() merges positional/from-file/tier/changed-only selections and the CLI implements run/report/validate/list/expand/platform; run orchestrates platform bring-up, per-group env prep, pytest/hurl execution, artifact capture, critical-path evaluation, baseline handling, report writing, and optional baseline updates.
E2E runtime drivers & endpoints
tests/rig/runtime.py, tests/compose/docker-compose.test.yaml, tests/e2e/conftest.py, tests/e2e/smoke/test_smoke.py
Provides Local/Compose/Testcontainers runtimes, readiness probing, PlatformEndpoints dataclass, compose overlay usage, e2e fixture rework, and smoke tests that assert rig session sentinel and backend health.
Reporting & coverage
tests/rig/reporting.py, tests/rig/coverage.py
Parses per-group junit/exit to GroupResult, writes summary.json/summary.md/combined-test-report.md, isolates per-group COVERAGE_FILE, combines coverage into xml/html, and exposes combine API used by the script and CI.
Tests, fixtures & self-tests
tests/rig/tests/*, tests/rig/tests/conftest.py
Adds comprehensive unit/self-tests covering groups manifest validation, critical-path evaluation/robustness, selection resolution, reporting parsing, runtime invariants, and CLI behaviors; includes docstring guidance for rig self-tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.74% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a tox-driven test rig foundation that consolidates unit, integration, and e2e testing under a single entry point.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering what was added, why it matters, usage examples, testing verification, and notes for reviewers. It addresses the repository's template sections adequately despite using a custom format.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/test-rig-foundation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (1)
tests/rig/groups.py (1)

91-94: ⚡ Quick win

Validate manifest schema version during load.

The manifest is versioned (version: 1) but load_groups() does not enforce it. Failing fast on unknown versions avoids silent misparsing when schema evolves.

Suggested fix
 def load_groups(path: Path | None = None) -> GroupManifest:
@@
     raw = yaml.safe_load(manifest_path.read_text())
     if not isinstance(raw, dict) or "groups" not in raw:
         raise ValueError(f"{manifest_path}: expected top-level `groups:` mapping")
+    if raw.get("version") != 1:
+        raise ValueError(
+            f"{manifest_path}: unsupported manifest version {raw.get('version')!r}; expected 1"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/groups.py` around lines 91 - 94, load_groups() currently accepts
any manifest without checking its version; add an explicit validation after the
top-level raw check to read raw.get("version") and if it's missing or not equal
to 1 raise a ValueError referencing manifest_path (e.g. "{manifest_path}:
unsupported or missing manifest version, expected version: 1"). Make the check
inside the same function (load_groups) right before reading
defaults/raw["groups"] so unknown schema versions fail fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci-test-e2e.yaml:
- Around line 39-45: The cache step "Restore main-branch test baseline" uses a
fixed key ("unstract-test-baseline-main-e2e") and a wrong restore-key
("unstract-test-baseline-main"); change the key to include a dynamic identifier
(e.g., github.ref or github.sha) so updates create new cache entries (for
example interpolate "${{ github.ref }}-unstract-test-baseline-main-e2e" or "${{
github.sha }}-unstract-test-baseline-main-e2e"), and make the restore-keys use
the correct e2e namespace (e.g., "unstract-test-baseline-main-e2e" as the
prefix) to avoid cross-lane contamination; apply the same change to the other
equivalent step referenced (lines 51-53).
- Around line 57-63: The "Capture docker compose logs on failure" step redirects
output to reports/docker-compose-logs.txt but doesn't ensure the reports/
directory exists; update the failing-step's run sequence to create the reports
directory (e.g., run mkdir -p reports) before executing the docker compose ... >
reports/docker-compose-logs.txt command so log redirection never fails and logs
are preserved.

In @.github/workflows/ci-test.yaml:
- Around line 44-50: The cache key "unstract-test-baseline-main" in the "Restore
main-branch test baseline (for regression detection)" step is constant and
prevents new baseline uploads; change the key to include a run-unique or
branch-specific variable (for example use unstract-test-baseline-main-${{
github.run_id }} or unstract-test-baseline-${{ github.ref_name }}-${{
github.run_id }}) so updates (--update-baseline) will write a new cache, and
keep the existing restore-keys prefix "unstract-test-baseline-" so prior
baselines can still be discovered; apply the same change to the equivalent step
referenced around lines 60-62.

In `@tests/e2e/smoke/test_smoke.py`:
- Around line 17-18: The current smoke test uses a lax assertion
(response.status_code < 500) which allows non-success codes like 404; update the
assertion after the requests.get call that fetches
f"{platform.backend_url.rstrip('/')}/health/" to require a successful response
(e.g., assert response.status_code == 200 or assert response.ok) so the health
endpoint must return an explicit success status to pass the test.

In `@tests/README.md`:
- Around line 13-37: The three fenced code blocks showing the test tree and
CLI/help snippets are missing language identifiers; update the three opening
backtick fences that precede the blocks starting with "tests/" (the directory
tree), "positional GROUPS  ∪  --from-file ..." (the CLI/usage snippet), and
"reports/" (the reports directory tree) to use a language tag (use "text") so
the fences read ```text instead of just ``` to satisfy markdownlint MD040.
- Around line 174-180: Update the opening sentence that currently reads
"reports/summary.md has three sections:" to match the enumerated list by
changing "three" to "four" (or alternatively remove the numbered count), so the
header and the listed items (Per-group results, ❌ Regressions, ⚠️ Critical paths
not yet covered, ✅ Covered critical paths) are consistent; locate the phrase
"reports/summary.md has three sections:" in the README and adjust it
accordingly.

In `@tests/rig/cli.py`:
- Around line 265-269: When the JUnit parsing returns None the current logic
(using result, group_results, and overall_exit) silently ignores the failure;
update the block handling result so that if result is None you still treat the
group as failed: append a placeholder/failure entry to group_results (or
otherwise record the failed parse) and set overall_exit = overall_exit or 1 (or
another non-zero exit) so the failure is propagated; keep existing behavior for
valid result objects (preserve the check against result.exit_code).
- Around line 207-297: cmd_run is too large and should be split into three clear
phases to reduce cognitive complexity: extract a plan_phase that performs
manifest/registry validation (using load_groups, cp.load_critical_paths,
cp.validate_registry_against_manifest), resolves groups (resolve,
_is_missing_placeholder) and prepares runnable/skipped lists and runtime
selection (pick_runtime) but does not start the runtime; extract an
execute_phase that receives runnable, manifest, endpoints, reports_dir and runs
each group via _execute_group collecting GroupResult and overall_exit
(preserving the dry_run behavior and parallel/timeout/workers args); extract a
finalize_phase that handles runtime teardown (runtime.down) in a single place,
coverage combine_and_report, baseline loading via cp.load_baseline, evaluation
via cp.evaluate, write_summary, reporting regressions/gaps and optionally
cp.emit_baseline, and returns overall_exit; ensure try/finally still guarantees
runtime.down only if runtime was started and not dry_run, thread through needed
values (runtime, endpoints, group_results, reports_dir, args) and keep existing
exit-code logic and logging messages unchanged.
- Line 60: Create a module-level constant (e.g., PREVIOUS_SUMMARY_FILENAME =
"previous-summary.json") in tests/rig/cli.py and replace the repeated literal
occurrences (referenced in the add_argument call where
pr.add_argument("--baseline", type=Path, default=REPO_ROOT / "reports" /
"previous-summary.json") and the other two occurrences at the same file) with
Path(REPO_ROOT / "reports" / PREVIOUS_SUMMARY_FILENAME) or Path(REPO_ROOT,
"reports", PREVIOUS_SUMMARY_FILENAME) so all uses reference the single constant
and avoid duplication/drift.
- Around line 252-253: The dry-run path currently does a per-plan "continue"
(args.dry_run) but still falls through to the post-plan evaluation that computes
gaps/regressions (the block that inspects the runs/gaps and returns non-zero),
causing false failures; change the control flow so that after planning is
complete (i.e., after the code that builds the runs/plan set) if args.dry_run is
true the function immediately short-circuits to success (return 0 or
sys.exit(0)) and does not execute the gaps/regressions evaluation block (the
code that inspects runs and computes regressions), and remove/replace the
per-iteration continue so dry-run does not leave an empty runs set for later
checks.

In `@tests/rig/coverage.py`:
- Around line 30-36: The subprocess calls that run "coverage combine/xml/html"
should be made path-safe and fail-fast: build the combine arguments using
absolute paths from reports_dir (e.g., list(map(str, [p.resolve() for p in
reports_dir.glob(".coverage.*")]))) and only call subprocess.run for combine
when that list is non-empty; remove cwd=reports_dir and instead pass absolute
output paths like str((reports_dir / "coverage.xml").resolve()) and
str((reports_dir / "htmlcov").resolve()); also set check=True on all
subprocess.run calls so failures raise instead of being silently ignored. Ensure
these changes are applied to the subprocess.run invocations shown (the calls to
subprocess.run for "coverage combine", "coverage xml", and "coverage html").

In `@tests/rig/critical_paths.py`:
- Around line 91-93: The baseline handling for regression evaluation is fragile:
update the previously_covered computation to defensively validate baseline and
its "covered_paths" before using it — check that baseline is a dict, read
covered = baseline.get("covered_paths", []) only if isinstance(baseline, dict),
ensure covered is an iterable/list (otherwise treat as empty), then build
previously_covered = set(str(p) for p in covered if isinstance(p, (str, int)))
so non-list shapes or non-string entries won't raise or silently miscompute
regressions; refer to the previously_covered variable and the
baseline["covered_paths"] access in tests/rig/critical_paths.py when making the
change.
- Around line 50-55: The code currently converts p.get("covered_by") directly to
tuple when building the CriticalPath, which turns a scalar string into a tuple
of characters; before converting, check the type of p.get("covered_by") (call it
covered = p.get("covered_by")) and if covered is not None and not an iterable
sequence like list/tuple, raise a clear ValueError (or TypeError) stating that
"covered_by must be a sequence", otherwise set covered_by=tuple(covered) (or
tuple() when covered is falsy); update the CriticalPath construction site to use
this validated/converted value so downstream validation receives a proper
sequence.

In `@tests/rig/groups.py`:
- Around line 159-165: The validation currently only checks existence for wd
(from g.absolute_workdir()) and each p in g.absolute_paths(), but doesn't ensure
they are inside the repository root; update the checks to resolve symlinks and
enforce repository boundaries: resolve wd and each p via
Path.resolve(strict=False) (to handle non-existing candidates if needed),
resolve the repository root once (repo_root.resolve()), then verify that
resolved_wd and each resolved_p are children of repo_root (using
Path.is_relative_to or comparing parents in a loop); if any resolved path is
outside repo_root, raise the same ValueError(f"group {name!r}: test path does
not exist: {p}") or a clearer message; keep the existing existence check
(p.exists()) but perform the containment check after resolving symlinks.

In `@tests/rig/reporting.py`:
- Around line 48-50: The current check returns None when junit_path doesn't
exist, which hides groups that have exit.txt; change the branch so that if
junit_path.exists() is False but exit_path.exists() is True you return a
zero-count GroupResult (preserving the group's name/status) instead of None.
Locate the code that references junit_path and exit_path and instantiate/return
a GroupResult (using the same constructor/signature used elsewhere in this file)
with an empty test list or zero counts so the group remains visible in
summaries.
- Line 18: Replace the stdlib XML parser import "import xml.etree.ElementTree as
ET" with the hardened parser "from defusedxml import ElementTree as ET" and
ensure all subsequent uses of ET (e.g., ET.parse, ET.fromstring) in the module
use this defusedxml.ElementTree API so JUnit artifacts are parsed with
XXE/XML-bomb protections; install defusedxml as a dependency if not present.

In `@tests/rig/runtime.py`:
- Around line 208-216: The helper _responds currently returns True when the
requests package is missing, which falsely marks services healthy; update the
ImportError branch in the _responds function so that if importing requests fails
it returns False (or raises a clear error) instead of True, keeping the existing
behavior for requests.get and requests.RequestException handling intact;
reference the _responds function to locate and change the ImportError handling.
- Around line 214-216: The health check currently treats any response with
resp.status_code < 500 as healthy, which wrongly accepts 4xx errors; update the
check so only successful responses are considered healthy (e.g., use resp.ok or
test 200 <= resp.status_code < 400) after the requests.get call and before
returning; keep the existing exception handling for requests.RequestException
unchanged but return False on exceptions so failures are not reported as
healthy.

In `@tests/rig/selection.py`:
- Around line 68-69: Replace the silent swallow in the except block that
currently returns an empty set when git diff fails (the except
(subprocess.CalledProcessError, FileNotFoundError): return set() in
tests/rig/selection.py) with raising a RuntimeError that includes the command
context and original exception message so callers can distinguish "no changes"
from a git error; ensure the raised error mentions the --changed-only scenario
and the underlying exception (e.g., include subprocess.CalledProcessError or
FileNotFoundError info) so callers like cmd_expand and cmd_run in cli.py will
surface a clear terminal error.

---

Nitpick comments:
In `@tests/rig/groups.py`:
- Around line 91-94: load_groups() currently accepts any manifest without
checking its version; add an explicit validation after the top-level raw check
to read raw.get("version") and if it's missing or not equal to 1 raise a
ValueError referencing manifest_path (e.g. "{manifest_path}: unsupported or
missing manifest version, expected version: 1"). Make the check inside the same
function (load_groups) right before reading defaults/raw["groups"] so unknown
schema versions fail fast.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 562d1a0f-86eb-454d-a64d-6eb6c1276fef

📥 Commits

Reviewing files that changed from the base of the PR and between 3cd2aa8 and 1495649.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (28)
  • .github/scripts/combine-test-reports.sh
  • .github/workflows/ci-test-e2e.yaml
  • .github/workflows/ci-test.yaml
  • .gitignore
  • .pre-commit-config.yaml
  • pyproject.toml
  • tests/README.md
  • tests/__init__.py
  • tests/compose/docker-compose.test.yaml
  • tests/conftest.py
  • tests/critical_paths.yaml
  • tests/e2e/__init__.py
  • tests/e2e/conftest.py
  • tests/e2e/smoke/__init__.py
  • tests/e2e/smoke/test_smoke.py
  • tests/fixtures/.gitkeep
  • tests/groups.yaml
  • tests/integration/__init__.py
  • tests/rig/__init__.py
  • tests/rig/__main__.py
  • tests/rig/cli.py
  • tests/rig/coverage.py
  • tests/rig/critical_paths.py
  • tests/rig/groups.py
  • tests/rig/reporting.py
  • tests/rig/runtime.py
  • tests/rig/selection.py
  • tox.ini

Comment thread .github/workflows/ci-test-e2e.yaml Outdated
Comment thread .github/workflows/ci-test-e2e.yaml
Comment thread .github/workflows/ci-test.yaml Outdated
Comment thread tests/e2e/smoke/test_smoke.py
Comment thread tests/README.md
Comment thread tests/rig/reporting.py
Comment thread tests/rig/reporting.py Outdated
Comment thread tests/rig/runtime.py Outdated
Comment thread tests/rig/runtime.py Outdated
Comment thread tests/rig/selection.py Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR establishes a unified tox-driven test rig that brings 15 group definitions, a critical-path registry, two CI workflows, and an e2e platform-runtime layer under a single entry point. The foundation is solid — manifest validation, topological dep-expansion, and per-group reporting are well-designed — but several logic gaps in the regression-detection pipeline need attention before the rig reliably gates main.

  • scope_groups is computed before optional-placeholder filtering (cli.py line 345): skipped optional groups remain in scope, so any group that was previously covered and later loses its test paths triggers a false regression classification that permanently blocks --update-baseline.
  • resolve() union semantics with --tier + named positionals (selection.py): passing a specific group name alongside --tier expands to all tier groups rather than filtering to just the named group; the PR description example tox -e e2e -- e2e-smoke is misleading (works today only because the other e2e groups are optional/missing).

Confidence Score: 3/5

The rig's regression-detection loop has multiple compounding defects on the main-branch path: the baseline is never written when critical-path gaps exist (6 gaps are known), and scope_groups is inflated with skipped optional groups which can cause false regressions that then block the baseline write permanently.

The --update-baseline guard (overall_exit == 0) is permanently blocked by the known 6 critical-path gaps on every main-branch unit run — the baseline cache stays empty, so regression detection never fires. On top of that, scope_groups includes optional placeholder groups that were skipped, meaning any future group that had tests, was in the baseline, and then lost its paths would mis-classify as a regression and extend the stuck state.

tests/rig/cli.py (scope_groups computation and update-baseline guard) and tests/rig/selection.py (tier-filter union vs intersection semantics) deserve the closest review before merging.

Important Files Changed

Filename Overview
tests/rig/cli.py Core rig dispatcher; scope_groups computation before optional-placeholder filtering can cause false regression classification and permanently block --update-baseline.
tests/rig/selection.py Group resolution logic; union semantics when --tier is combined with named positionals can produce surprising selection set, though the practical effect is masked by optional-group skipping.
tests/groups.yaml Group manifest; e2e-smoke and integration-workflow-execution both carry cross-tier depends_on [unit-sdk1, unit-workers], pulling unit groups into e2e and integration tier runs.
tests/rig/runtime.py Platform runtime drivers; x2text_url is now correctly included in both the subprocess env and the readiness probe; minio uses unpinned latest tag (flagged in prior threads).
.github/workflows/ci-test.yaml CI unit+integration workflow; --update-baseline only fires on main but is blocked whenever --fail-on-critical-gap causes overall_exit=1 (known 6-gap situation), so the baseline cache stays empty.
.github/workflows/ci-test-e2e.yaml E2e CI workflow; no longer passes the all positional (prior thread fixed); correctly relies on --tier e2e to scope the run.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[tox -e unit/integration/e2e] --> B[tests.rig run --tier X]
    B --> C[resolve: select groups]
    C --> D[manifest.expand: add transitive deps]
    D --> E[scope_groups = set of ordered groups]
    E --> F[filter runnable vs skipped optional]
    F --> G{needs_platform?}
    G -- yes --> H[pick_runtime / ComposeRuntime.up]
    G -- no --> I[execute groups sequentially]
    H --> I
    I --> J[_execute_group: uv run pytest / hurl]
    J --> K[writes junit.xml, report.md, exit.txt]
    K --> L[parse_junit → GroupResult]
    L --> M[_green_group_names: pass or empty]
    M --> N[evaluate: covered / gap / regression]
    N --> O[write_summary: summary.md + json]
    O --> P{update_baseline AND overall_exit==0?}
    P -- yes --> Q[merge_into_baseline]
    P -- no --> R[baseline unchanged]
    Q --> S[end]
    R --> S
Loading

Fix All in Claude Code

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
tests/rig/cli.py:345-347
`scope_groups` is set from `ordered` **before** optional-placeholder groups are filtered out into `skipped`. If a group was previously run green (and thus appears in the baseline), is now marked `optional: true`, and its paths have since been removed, it will remain in `scope_groups` but never enter `green`. On the next CI run that includes this group via dep-expansion, `evaluate()` sees: `path.id in previously_covered AND in_scope AND not covered``state = "regression"`. Because regressions set `overall_exit = 1`, `--update-baseline` is permanently blocked (its guard is `overall_exit == 0`), creating a stuck state that can only be resolved by manually deleting the cache. Filtering `scope_groups` to exclude the groups the rig actually intends to run fixes this.

```suggestion
    # scope_groups must exclude optional placeholders that were skipped:
    # if a skipped group was previously in the baseline, keeping it in scope
    # would classify its covered paths as regressions even though the group
    # never ran, blocking --update-baseline and creating a stuck state.
    runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))}
    scope_groups = runnable_pre

    reports_dir: Path = args.reports_dir
```

### Issue 2 of 2
tests/rig/selection.py:52-64
**`--tier` acts as expansion, not filter, when named positionals are provided**

When `--tier` is set and the positionals are specific group names (not `"all"`), the code always calls `selected.update(tier_set)`, which unions all tier groups into the selection rather than intersecting. Concretely, `tox -e e2e -- e2e-smoke` resolves to every e2e-tier group, not just `e2e-smoke`. The PR description lists this exact invocation as "smoke against docker compose", so users following it will silently run more than expected. Today the extra groups are optional/missing and get skipped, so the output happens to be correct — but any future non-optional e2e group would be unexpectedly included.

Reviews (9): Last reviewed commit: "[MISC] Forward + probe x2text_url (Grept..." | Re-trigger Greptile

Comment thread .github/workflows/ci-test-e2e.yaml
Comment thread tests/rig/runtime.py Outdated
Comment thread .github/scripts/combine-test-reports.sh Outdated
Comment thread tests/rig/runtime.py Outdated
Addresses the 5-agent review of PR #1971. Highlights:

Critical fixes:
- Wire --cov per group via new `coverage_source` field; pytest-cov now actually
  produces htmlcov/ and coverage.xml.
- Switch CI to run unit and integration tiers as separate steps and merge
  (rather than overwrite) baselines so the second tier doesn't erase the
  first's covered paths.
- Include github.run_id in baseline cache key so actions/cache writes a fresh
  entry every main build (it only saves on miss).
- Hurl exit 5 (no tests collected) now synthesises failures=0 instead of 1.
- Move runtime.up() inside try/finally so partial container stacks get torn
  down on failure.
- Always fold subprocess exit codes into overall_exit, even when junit.xml is
  missing (segfault / OOM / missing binary).

Important fixes:
- Shim script uses python3 (CI runners don't ship python).
- Drop `all` from ci-test-e2e.yaml so the env name matches what runs;
  resolve() now intersects `all` with --tier rather than unioning.
- Disable --md-report under xdist (incompatible).
- Add [tool.coverage.paths] mapping so combined XML/HTML resolves correctly
  across per-workdir runs.
- README: testcontainers stub status is now stated honestly; section count
  fixed.
- Honest covered_by: [] on adapter-register-llm.
- Add `tox -e rig -- validate` step to both workflows.
- Smoke test asserts UNSTRACT_BACKEND_URL matches the fixture URL.
- Surface git stderr on --changed-only failures.
- Guard parse_junit against malformed/truncated XML and missing counters.
- Log testcontainers down() failures instead of swallowing.
- ruff format + check clean on tests/.

Type / design:
- Literal types for Tier, Runner, CriticalPathState, ResultStatus.
- ClassVar[str] on PlatformRuntime; freeze GroupResult.
- Drop PlatformEndpoints.extras (never read) for typed InfraEndpoints.
- Merge PlatformAccess into PlatformEndpoints (single source of truth).
- by_id cache in CriticalPathRegistry.__post_init__.
- Tighten unit-backend.paths from `[.]` to explicit per-app tests/ dirs.

Self-tests:
- New unit-rig group covering tests/rig/* logic (25 tests, ~0.1s):
  cycle detection, unknown dep, topo expansion, tier/runner validation,
  platform-smoke-dep invariant, real-manifest sanity, evaluate state
  machine, baseline merge (not overwrite), junit edge cases (testsuites
  wrapper, exit 5, missing counters, malformed XML), selection precedence
  (all+tier intersect, empty selection, from-file merging).

Verified locally:
- `python -m tests.rig validate` → OK (16 groups, 10 critical paths).
- `python -m tests.rig run --no-parallel unit-rig` → 25 passed; emits
  htmlcov/, coverage.xml (42.5%), summary.{md,json}.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread tests/rig/cli.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (7)
tests/rig/runtime.py (2)

265-267: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tighten health criteria to avoid false-positive readiness.

Line 266 accepts 4xx responses as healthy; a misrouted or unauthorized /health endpoint can incorrectly pass readiness.

💡 Suggested fix
 def _responds(url: str, requests_mod) -> bool:
     try:
         resp = requests_mod.get(url, timeout=2)
-        return resp.status_code < 500
+        return 200 <= resp.status_code < 300
     except requests_mod.RequestException:
         return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/runtime.py` around lines 265 - 267, The health check currently
treats any status_code < 500 as healthy which allows 4xx responses; in the try
block where you call requests_mod.get(url, timeout=2) and assign resp, tighten
the success criterion to only accept 2xx responses (e.g., check resp.status_code
>= 200 and resp.status_code < 300) and return False for other statuses; keep the
existing except requests_mod.RequestException handler unchanged so network
errors still return False.

243-247: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t skip readiness checks when requests is unavailable.

Line 246 currently turns a broken rig environment into a soft pass for readiness, which can mask startup failures until much later.

💡 Suggested fix
     try:
         import requests
-    except ImportError:
-        log.warning("`requests` not installed; skipping platform readiness probe")
-        return
+    except ImportError as exc:
+        raise RuntimeError(
+            "requests is required for platform readiness probes in ComposeRuntime"
+        ) from exc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/runtime.py` around lines 243 - 247, The readiness probe swallows a
missing `requests` dependency by logging a warning and returning, which masks
startup failures; in the except ImportError block around the `import requests`
in tests/rig/runtime.py (the try/except that currently calls
`log.warning("`requests` not installed; skipping platform readiness probe")`),
change behavior to fail fast: replace the soft return with an explicit failure
(e.g., log an error and raise ImportError or RuntimeError with a clear message
instructing to install `requests`) so the readiness check cannot be skipped
silently.
tests/rig/coverage.py (1)

61-63: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make coverage combine path handling robust for relative reports_dir.

When reports_dir is relative, the current cwd=reports_dir + prefixed file args can point to the wrong location (e.g., reports/reports/.coverage.*), so combine may fail and coverage artifacts are dropped.

💡 Suggested fix
     result = subprocess.run(
-        [*base, "combine", *[str(p) for p in files]],
-        cwd=reports_dir,
+        [*base, "combine", *[str(p.resolve()) for p in files]],
+        cwd=REPO_ROOT,
         capture_output=True,
         text=True,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/coverage.py` around lines 61 - 63, The coverage combine invocation
can end up with duplicated/incorrect paths when reports_dir is relative; update
the subprocess.run call around result, base, files, and reports_dir so the file
arguments are either basenames when using cwd=reports_dir or are absolute paths
and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path
and pass file args as [p.name for p in files] when calling
subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and
pass absolute paths for each file (e.g., map each file through
Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no
longer produces wrong combined paths. Ensure changes reference subprocess.run,
result, base, files, and reports_dir.
tests/rig/reporting.py (2)

18-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a hardened XML parser for JUnit artifact parsing.

xml.etree.ElementTree is not the right default for untrusted CI artifacts. Switch to defusedxml.ElementTree here.

💡 Proposed fix
-import xml.etree.ElementTree as ET
+from defusedxml import ElementTree as ET
#!/bin/bash
# Verify XML parser usage and whether defusedxml is already declared.
set -euo pipefail
rg -n "xml\.etree\.ElementTree|defusedxml" tests/rig/reporting.py pyproject.toml requirements*.txt tox.ini 2>/dev/null || true

Also applies to: 73-74

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/reporting.py` at line 18, Replace the insecure import "import
xml.etree.ElementTree as ET" with the hardened parser "import
defusedxml.ElementTree as ET" (and update any other occurrences around the other
import at lines ~73-74) so JUnit artifact parsing uses defusedxml; also ensure
defusedxml is added to project dependencies (pyproject.toml/requirements) so the
import resolves at runtime.

67-68: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep groups visible when exit.txt exists but junit.xml is missing.

Returning None here drops the group from summaries, which can hide failed/aborted runs.

💡 Proposed fix
     junit_path = reports_dir / group_name / "junit.xml"
     exit_path = reports_dir / group_name / "exit.txt"
     if not junit_path.exists():
-        return None
+        if not exit_path.exists():
+            return None
+        return GroupResult(
+            name=group_name,
+            tier=tier,
+            exit_code=_read_exit_code(exit_path),
+            passed=0,
+            failed=0,
+            errors=0,
+            skipped=0,
+            duration_seconds=0.0,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/reporting.py` around lines 67 - 68, The current check returns None
when junit_path doesn't exist, which drops the group from summaries; instead,
update the code that reads JUnit to preserve the group when exit.txt exists by
not returning None for a missing junit file—use a placeholder/empty JUnit result
or a minimal object indicating “no junit” so downstream summary logic still
includes the group; modify the branch around junit_path.exists() (the junit_path
check and the function that returns its result) to return an empty/placeholder
test-report structure or a sentinel that keeps the group visible rather than
None.
tests/README.md (1)

13-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks.

These fences should use an explicit language (e.g., text) to satisfy markdownlint MD040.

💡 Proposed fix
-```
+```text
 tests/
 ...
-```
+```

-```
+```text
 positional GROUPS  ∪  --from-file lines  ∪  --tier filter  ∪  --changed-only diff
-```
+```

-```
+```text
 reports/
 ...
-```
+```

Also applies to: 131-131, 162-162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/README.md` at line 13, Fenced code blocks in tests/README.md are
missing language identifiers (markdownlint MD040); add the explicit language
specifier `text` to each triple-backtick fence that wraps the examples (the
blocks containing "tests/", the one with "positional GROUPS  ∪  --from-file
lines  ∪  --tier filter  ∪  --changed-only diff", and the block with
"reports/"), and apply the same change to the other occurrences noted (around
the blocks at the other two locations referenced in the comment).
tests/rig/cli.py (1)

319-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit --dry-run before critical-path evaluation.

Plan-only mode currently can still return failure from regression/gap evaluation despite executing nothing.

💡 Proposed fix
     finally:
         if runtime is not None and not args.dry_run:
             print(f"[rig] tearing platform down (runtime={runtime.name})")
             runtime.down()

+    if args.dry_run:
+        return 0
+
     if args.coverage and not args.dry_run:
         combine_and_report(reports_dir)

Also applies to: 347-379

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/cli.py` around lines 319 - 320, The loop currently checks
args.dry_run only after starting critical-path checks, which allows
regression/gap evaluation to set failure state; move the args.dry_run
short-circuit to the top of the loop (or before any calls that perform
regression/gap evaluation) so the iteration immediately continues/returns
without invoking those evaluators, and ensure any status-aggregating variables
(exit codes, failure flags) are not modified when args.dry_run is true; apply
the same change for the other similar block covering the regression/gap
evaluation region (the section corresponding to lines 347-379).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/README.md`:
- Line 13: Fenced code blocks in tests/README.md are missing language
identifiers (markdownlint MD040); add the explicit language specifier `text` to
each triple-backtick fence that wraps the examples (the blocks containing
"tests/", the one with "positional GROUPS  ∪  --from-file lines  ∪  --tier
filter  ∪  --changed-only diff", and the block with "reports/"), and apply the
same change to the other occurrences noted (around the blocks at the other two
locations referenced in the comment).

In `@tests/rig/cli.py`:
- Around line 319-320: The loop currently checks args.dry_run only after
starting critical-path checks, which allows regression/gap evaluation to set
failure state; move the args.dry_run short-circuit to the top of the loop (or
before any calls that perform regression/gap evaluation) so the iteration
immediately continues/returns without invoking those evaluators, and ensure any
status-aggregating variables (exit codes, failure flags) are not modified when
args.dry_run is true; apply the same change for the other similar block covering
the regression/gap evaluation region (the section corresponding to lines
347-379).

In `@tests/rig/coverage.py`:
- Around line 61-63: The coverage combine invocation can end up with
duplicated/incorrect paths when reports_dir is relative; update the
subprocess.run call around result, base, files, and reports_dir so the file
arguments are either basenames when using cwd=reports_dir or are absolute paths
and cwd is omitted. Concretely: either resolve reports_dir to an absolute Path
and pass file args as [p.name for p in files] when calling
subprocess.run([*base, "combine", *...], cwd=reports_dir), or keep cwd None and
pass absolute paths for each file (e.g., map each file through
Path(reports_dir).joinpath(p.name).resolve()) so subprocess.run(..., cwd=...) no
longer produces wrong combined paths. Ensure changes reference subprocess.run,
result, base, files, and reports_dir.

In `@tests/rig/reporting.py`:
- Line 18: Replace the insecure import "import xml.etree.ElementTree as ET" with
the hardened parser "import defusedxml.ElementTree as ET" (and update any other
occurrences around the other import at lines ~73-74) so JUnit artifact parsing
uses defusedxml; also ensure defusedxml is added to project dependencies
(pyproject.toml/requirements) so the import resolves at runtime.
- Around line 67-68: The current check returns None when junit_path doesn't
exist, which drops the group from summaries; instead, update the code that reads
JUnit to preserve the group when exit.txt exists by not returning None for a
missing junit file—use a placeholder/empty JUnit result or a minimal object
indicating “no junit” so downstream summary logic still includes the group;
modify the branch around junit_path.exists() (the junit_path check and the
function that returns its result) to return an empty/placeholder test-report
structure or a sentinel that keeps the group visible rather than None.

In `@tests/rig/runtime.py`:
- Around line 265-267: The health check currently treats any status_code < 500
as healthy which allows 4xx responses; in the try block where you call
requests_mod.get(url, timeout=2) and assign resp, tighten the success criterion
to only accept 2xx responses (e.g., check resp.status_code >= 200 and
resp.status_code < 300) and return False for other statuses; keep the existing
except requests_mod.RequestException handler unchanged so network errors still
return False.
- Around line 243-247: The readiness probe swallows a missing `requests`
dependency by logging a warning and returning, which masks startup failures; in
the except ImportError block around the `import requests` in
tests/rig/runtime.py (the try/except that currently calls
`log.warning("`requests` not installed; skipping platform readiness probe")`),
change behavior to fail fast: replace the soft return with an explicit failure
(e.g., log an error and raise ImportError or RuntimeError with a clear message
instructing to install `requests`) so the readiness check cannot be skipped
silently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 74802a43-b372-4ed8-a7bd-f3c2cd741b61

📥 Commits

Reviewing files that changed from the base of the PR and between 1495649 and 8bb49a5.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • .github/scripts/combine-test-reports.sh
  • .github/workflows/ci-test-e2e.yaml
  • .github/workflows/ci-test.yaml
  • pyproject.toml
  • tests/README.md
  • tests/compose/docker-compose.test.yaml
  • tests/conftest.py
  • tests/critical_paths.yaml
  • tests/e2e/conftest.py
  • tests/e2e/smoke/test_smoke.py
  • tests/groups.yaml
  • tests/rig/__init__.py
  • tests/rig/__main__.py
  • tests/rig/cli.py
  • tests/rig/coverage.py
  • tests/rig/critical_paths.py
  • tests/rig/groups.py
  • tests/rig/reporting.py
  • tests/rig/runtime.py
  • tests/rig/selection.py
  • tests/rig/tests/__init__.py
  • tests/rig/tests/conftest.py
  • tests/rig/tests/test_critical_paths.py
  • tests/rig/tests/test_groups.py
  • tests/rig/tests/test_reporting.py
  • tests/rig/tests/test_selection.py
✅ Files skipped from review due to trivial changes (2)
  • tests/rig/tests/conftest.py
  • tests/conftest.py
🚧 Files skipped from review as they are similar to previous changes (9)
  • tests/rig/main.py
  • tests/e2e/smoke/test_smoke.py
  • tests/compose/docker-compose.test.yaml
  • .github/workflows/ci-test.yaml
  • .github/workflows/ci-test-e2e.yaml
  • tests/critical_paths.yaml
  • pyproject.toml
  • tests/rig/groups.py
  • tests/rig/critical_paths.py

…egrity

Addresses the 5 new findings from the second review pass:

N1+N2 (critical) — cross-tier regression false positives:
- Namespace baseline caches per workflow (`unstract-test-baseline-ut-*` for
  ci-test.yaml; `unstract-test-baseline-e2e-*` for ci-test-e2e.yaml).
  Previously, ci-test.yaml's restore-keys prefix matched the e2e cache,
  pulling e2e-tier coverage into unit/integration where those groups never
  ran — every e2e-covered path lit up as a regression.
- Add `scope_groups` to evaluate(); paths whose covered_by is fully outside
  the current invocation's scope are classified as `gap` (out of scope), not
  `regression`. cmd_run passes ordered groups as the scope.

N3+N4 (critical) — silent baseline corruption:
- merge_into_baseline and load_baseline now raise BaselineCorruptError on
  parse failure instead of treating the file as empty. The old behavior
  would erase the other tier's coverage on the next merge and demote real
  regressions to gaps on the build that most needed detection.

N5 — smoke test tautology:
- Old assertion compared platform.backend_url against the env var it was
  built from — guaranteed to pass. Replaced with a session-sentinel check:
  the rig stamps UNSTRACT_RIG_SESSION_ID (uuid4, set unconditionally so
  setdefault can't let a stale value win) into every group's env, and the
  smoke test asserts its presence.

N6 — unit-rig measuring its own tests as covered:
- Add [tool.coverage.run].omit for `**/test_*.py`, `**/*_tests.py`,
  `**/tests.py`. Keeps the source modules under tests/rig/ measurable while
  excluding the test files themselves.

N7 — gate validator silently disabled if smoke renamed:
- `_validate_platform_groups_depend_on_gate` now raises when platform groups
  exist but the gate is undefined. Gate name overridable via
  `defaults.platform_gate_group` for forks.

N8 — private cross-package import:
- Promote `_endpoints_from_env` to `PlatformEndpoints.from_env()` classmethod;
  conftest.py imports the public name.

N9 — teardown exceptions masking the original:
- Wrap `runtime.down()` in cmd_run's finally with its own try/except so a
  teardown failure no longer hides the upstream platform-startup error.

N10 — InfraEndpoints invariant:
- __post_init__ asserts redis_host/redis_port (and rabbitmq_*) are set
  together; partial specs raise.

N11 — comment polish:
- Lift "smoke gates every platform group" rationale to a section banner
  above the e2e tier in groups.yaml.
- Remove dead `emit_baseline` alias.
- Rewrite unit-backend paths comment as a positive rule.

Self-tests: 25 → 31 passing in 0.05s. New tests cover baseline corruption
detection (both load and merge), scope-aware regression demotion, gate
validation error paths, and custom gate names via defaults.

Verified locally:
- `python -m tests.rig validate` → OK.
- `python -m tests.rig run --no-parallel unit-rig` → 31 pass; coverage XML +
  HTML emitted at ~32.5% on the rig itself (down from 42% because broader
  source files are now in scope).
- `uvx ruff check --line-length 90 tests/` → clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (6)
tests/e2e/smoke/test_smoke.py (1)

33-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require a success response for the smoke health check.

This currently passes on 4xx responses, which can mask a broken deployment path in e2e smoke.

Suggested fix
 def test_backend_health(platform) -> None:
     response = requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10)
-    assert response.status_code < 500, f"backend /health returned {response.status_code}"
+    assert response.status_code == 200, f"backend /health returned {response.status_code}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/smoke/test_smoke.py` around lines 33 - 35, The health-check
assertion in test_backend_health currently allows 4xx responses; change it to
require a successful 2xx response by asserting response.status_code is in the
200–299 range (e.g., response.status_code >= 200 and response.status_code < 300)
when calling requests.get(f"{platform.backend_url.rstrip('/')}/health/",
timeout=10) inside the test_backend_health function so failures surface
correctly.
tests/rig/groups.py (1)

185-190: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce repository-boundary checks for workdir and test paths.

Existence-only validation still allows resolved paths outside REPO_ROOT.

Suggested fix
 def _validate_paths(groups: dict[str, GroupDefinition]) -> None:
+    repo_root = REPO_ROOT.resolve()
     for name, g in groups.items():
@@
-        wd = g.absolute_workdir()
+        wd = g.absolute_workdir().resolve()
+        if not wd.is_relative_to(repo_root):
+            raise ValueError(f"group {name!r}: workdir escapes repo root: {wd}")
         if not wd.exists():
             raise ValueError(f"group {name!r}: workdir does not exist: {wd}")
-        for p in g.absolute_paths():
+        for p in g.absolute_paths():
+            p = p.resolve()
+            if not p.is_relative_to(repo_root):
+                raise ValueError(f"group {name!r}: test path escapes repo root: {p}")
             if not p.exists():
                 raise ValueError(f"group {name!r}: test path does not exist: {p}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/groups.py` around lines 185 - 190, The current validation in the
block using g.absolute_workdir() and g.absolute_paths() only checks existence
and allows paths outside REPO_ROOT; update the checks to also ensure the
resolved workdir and each resolved test path is inside the repository root
(e.g., use Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare
pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath),
and if a path is outside REPO_ROOT raise a ValueError with a clear message
referencing the offending path and group name; apply this logic for wd (from
g.absolute_workdir()) and every p from g.absolute_paths() to enforce
repository-boundary constraints.
tests/rig/cli.py (1)

345-346: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit --dry-run before baseline evaluation and gating.

Plan-only runs currently fall through into regression/gap checks and can return non-zero without executing any group.

Suggested fix
     finally:
         if runtime is not None and not args.dry_run:
@@
             except Exception as exc:
                 print(
                     f"[rig] teardown failed (runtime={runtime.name}): {exc}",
                     file=sys.stderr,
                 )
+
+    if args.dry_run:
+        return 0
 
     if args.coverage and not args.dry_run:
         combine_and_report(reports_dir)

Also applies to: 382-421

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/cli.py` around lines 345 - 346, Short-circuit handling of the
--dry-run flag: in the main CLI loop (where args.dry_run is checked) move/extend
the check so that if args.dry_run is true you skip baseline evaluation and
gating logic (the calls to evaluate_baseline and gate_groups) and exit
successfully (return/exit 0) instead of falling through; ensure you reference
and guard both places mentioned (around the current args.dry_run check and the
block that calls evaluate_baseline and gate_groups) so plan-only runs never
perform regression/gap checks or return non-zero.
tests/rig/critical_paths.py (2)

69-75: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate covered_by type before tuple conversion.

If YAML provides a scalar (e.g., "unit-api"), this becomes a tuple of characters and corrupts coverage semantics.

Suggested fix
 from typing import Any, Literal
@@
 def load_critical_paths(path: Path | None = None) -> CriticalPathRegistry:
@@
             CriticalPath(
                 id=str(p["id"]),
                 description=str(p.get("description", "")),
                 entry=str(p.get("entry", "")),
-                covered_by=tuple(p.get("covered_by") or ()),
+                covered_by=_parse_covered_by(p.get("covered_by"), path_id=str(p.get("id", "<unknown>"))),
             )
             for p in raw["paths"]
         )
     )
+
+
+def _parse_covered_by(value: Any, *, path_id: str) -> tuple[str, ...]:
+    if value is None:
+        return ()
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"critical path {path_id!r}: `covered_by` must be a list")
+    return tuple(str(v) for v in value)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/critical_paths.py` around lines 69 - 75, The generator building
CriticalPath instances incorrectly converts a scalar covered_by (e.g.,
"unit-api") into a tuple of characters; update the covered_by expression in the
CriticalPath construction to validate the type: if p.get("covered_by") is
already a list/tuple use tuple(value), if it's a non-empty scalar (like str)
wrap it as (str(value),), and if None/empty produce an empty tuple. Locate the
generator over raw["paths"] and change the covered_by handling accordingly so
only iterable lists/tuples are expanded and scalars become single-element
tuples.

121-123: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Defensively validate baseline shape before consuming covered_paths.

Current logic assumes a dict with list-like covered_paths; malformed but parseable JSON can miscompute regressions or raise unexpected exceptions during merge.

Suggested fix
 def evaluate(
@@
-    previously_covered: set[str] = set(
-        (baseline or {}).get("covered_paths", []) if baseline else []
-    )
+    covered_raw: list[Any] = []
+    if isinstance(baseline, dict):
+        candidate = baseline.get("covered_paths", [])
+        if isinstance(candidate, list):
+            covered_raw = candidate
+    previously_covered: set[str] = {p for p in covered_raw if isinstance(p, str)}
@@
 def merge_into_baseline(statuses: list[CriticalPathStatus], destination: Path) -> None:
@@
-            parsed = json.loads(destination.read_text())
-            existing = set(parsed.get("covered_paths") or [])
+            parsed = json.loads(destination.read_text())
+            if not isinstance(parsed, dict):
+                raise BaselineCorruptError(
+                    f"refusing to merge into corrupt baseline {destination}: "
+                    "expected JSON object with `covered_paths` list. Delete cache and re-run."
+                )
+            raw = parsed.get("covered_paths") or []
+            if not isinstance(raw, list):
+                raise BaselineCorruptError(
+                    f"refusing to merge into corrupt baseline {destination}: "
+                    "`covered_paths` must be a list. Delete cache and re-run."
+                )
+            existing = {p for p in raw if isinstance(p, str)}

Also applies to: 169-171

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/critical_paths.py` around lines 121 - 123, The code assumes
`baseline` is a dict with a list-like `covered_paths` which can raise or corrupt
results; update the logic around the `previously_covered` assignment to
defensively validate the shape: check `isinstance(baseline, dict)` and
`isinstance(baseline.get("covered_paths"), list)` before using it, coerce into a
set filtering only string entries (e.g., {p for p in baseline["covered_paths"]
if isinstance(p, str)}) and otherwise use an empty set; apply the same defensive
pattern where `covered_paths` is consumed again (the same block referenced at
lines 169-171) to avoid exceptions or bad data propagation.
tests/rig/runtime.py (1)

264-268: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make readiness probing fail-closed instead of pass-when-broken.

Skipping probes when requests is unavailable and accepting 4xx as healthy both create false-positive “ready” states.

Suggested fix
 def _wait_ready(endpoints: PlatformEndpoints, *, timeout_seconds: int = 300) -> None:
@@
     try:
         import requests
     except ImportError:
-        log.warning("`requests` not installed; skipping platform readiness probe")
-        return
+        raise RuntimeError(
+            "`requests` is required for platform readiness checks in e2e runtime"
+        )
@@
 def _responds(url: str, requests_mod) -> bool:
     try:
         resp = requests_mod.get(url, timeout=2)
-        return resp.status_code < 500
+        return 200 <= resp.status_code < 300
     except requests_mod.RequestException:
         return False

Also applies to: 287-287

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/runtime.py` around lines 264 - 268, The readiness probe currently
skips when the requests package is unavailable and treats 4xx responses as
healthy; change it to fail-closed by making the ImportError path error out
(log.error and return/raise a non-ready result) instead of returning early, and
update the HTTP-status handling so only 2xx responses are considered healthy
(treat 3xx/4xx/5xx as unhealthy). Specifically, modify the try/except that
imports requests to report failure and propagate non-ready status, and adjust
the probe logic that currently accepts 4xx responses so it only accepts status
codes in the 200-299 range.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tests/e2e/smoke/test_smoke.py`:
- Around line 33-35: The health-check assertion in test_backend_health currently
allows 4xx responses; change it to require a successful 2xx response by
asserting response.status_code is in the 200–299 range (e.g.,
response.status_code >= 200 and response.status_code < 300) when calling
requests.get(f"{platform.backend_url.rstrip('/')}/health/", timeout=10) inside
the test_backend_health function so failures surface correctly.

In `@tests/rig/cli.py`:
- Around line 345-346: Short-circuit handling of the --dry-run flag: in the main
CLI loop (where args.dry_run is checked) move/extend the check so that if
args.dry_run is true you skip baseline evaluation and gating logic (the calls to
evaluate_baseline and gate_groups) and exit successfully (return/exit 0) instead
of falling through; ensure you reference and guard both places mentioned (around
the current args.dry_run check and the block that calls evaluate_baseline and
gate_groups) so plan-only runs never perform regression/gap checks or return
non-zero.

In `@tests/rig/critical_paths.py`:
- Around line 69-75: The generator building CriticalPath instances incorrectly
converts a scalar covered_by (e.g., "unit-api") into a tuple of characters;
update the covered_by expression in the CriticalPath construction to validate
the type: if p.get("covered_by") is already a list/tuple use tuple(value), if
it's a non-empty scalar (like str) wrap it as (str(value),), and if None/empty
produce an empty tuple. Locate the generator over raw["paths"] and change the
covered_by handling accordingly so only iterable lists/tuples are expanded and
scalars become single-element tuples.
- Around line 121-123: The code assumes `baseline` is a dict with a list-like
`covered_paths` which can raise or corrupt results; update the logic around the
`previously_covered` assignment to defensively validate the shape: check
`isinstance(baseline, dict)` and `isinstance(baseline.get("covered_paths"),
list)` before using it, coerce into a set filtering only string entries (e.g.,
{p for p in baseline["covered_paths"] if isinstance(p, str)}) and otherwise use
an empty set; apply the same defensive pattern where `covered_paths` is consumed
again (the same block referenced at lines 169-171) to avoid exceptions or bad
data propagation.

In `@tests/rig/groups.py`:
- Around line 185-190: The current validation in the block using
g.absolute_workdir() and g.absolute_paths() only checks existence and allows
paths outside REPO_ROOT; update the checks to also ensure the resolved workdir
and each resolved test path is inside the repository root (e.g., use
Path.resolve() and Path.is_relative_to(REPO_ROOT) or compare
pathlib.Path(REPO_ROOT).resolve() with p.resolve().absolute() via commonpath),
and if a path is outside REPO_ROOT raise a ValueError with a clear message
referencing the offending path and group name; apply this logic for wd (from
g.absolute_workdir()) and every p from g.absolute_paths() to enforce
repository-boundary constraints.

In `@tests/rig/runtime.py`:
- Around line 264-268: The readiness probe currently skips when the requests
package is unavailable and treats 4xx responses as healthy; change it to
fail-closed by making the ImportError path error out (log.error and return/raise
a non-ready result) instead of returning early, and update the HTTP-status
handling so only 2xx responses are considered healthy (treat 3xx/4xx/5xx as
unhealthy). Specifically, modify the try/except that imports requests to report
failure and propagate non-ready status, and adjust the probe logic that
currently accepts 4xx responses so it only accepts status codes in the 200-299
range.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cacbb976-fbc3-43a1-8c48-556550c3014f

📥 Commits

Reviewing files that changed from the base of the PR and between 8bb49a5 and c492099.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/ci-test-e2e.yaml
  • .github/workflows/ci-test.yaml
  • pyproject.toml
  • tests/e2e/conftest.py
  • tests/e2e/smoke/test_smoke.py
  • tests/groups.yaml
  • tests/rig/cli.py
  • tests/rig/critical_paths.py
  • tests/rig/groups.py
  • tests/rig/runtime.py
  • tests/rig/tests/test_critical_paths.py
  • tests/rig/tests/test_groups.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/ci-test-e2e.yaml
  • pyproject.toml
  • tests/groups.yaml
  • .github/workflows/ci-test.yaml

Comment thread tests/rig/cli.py
Comment thread tests/groups.yaml
hari-kuriakose and others added 2 commits May 19, 2026 15:53
R1 (critical) — cmd_run returned early on BaselineCorruptError BEFORE
write_summary, so a green build with one corrupt baseline lost all per-group
visibility. Restructured so write_summary always runs; baseline-corrupt now
just flips overall_exit and falls through.

R2 (important) — TestcontainersRuntime.up() built InfraEndpoints AFTER the
cleanup try/except, so a __post_init__ failure leaked the four started
containers. Moved construction inside the try (and the return with it) so
the class invariant is self-cleaning at its construction site.

R3 (important) — added tests/rig/tests/test_cli.py covering three previously-
untested behaviors of cmd_run:
  - scope_groups is the dep-expanded selection (guards N2 wiring)
  - down() failure doesn't mask up() exception (guards N9 try/except)
  - write_summary runs even when load_baseline raises (guards R1)

R4 (important) — evaluate's groups_run_green and scope_groups annotations
loosened from set[str] to Collection[str]. Function only reads them
(`g in ...`), so accepting frozenset/list/tuple from callers is fine. Hot
path still pays the O(1) tax via internal set() conversion.

R5 (suggestion, deferred) — InfraEndpoints shape asymmetry (redis/rabbitmq
host+port pairs vs. postgres/minio collapsed URL/endpoint strings). Picking a
single shape would be cleaner but neither option has a clear win without a
real second consumer of the infra block; revisit when one appears.

R7 (suggestion) — fixed misleading comment: scope_groups includes dep-expanded
groups, not just "what the user asked for".

R8 (suggestion) — InfraEndpoints docstring now honestly describes which fields
the __post_init__ invariant applies to.

R9 (suggestion) — strengthened scope test with a "straddling" path
(covered_by=[unit, e2e]) so a weaker implementation checking against
groups_run_green instead of covered_by would be caught.

R10 (suggestion) — InfraEndpoints redis_port/rabbitmq_port: str → int. Cast at
the testcontainers boundary instead of pre-stringifying.

Verified:
- 34/34 self-tests pass (up from 31).
- `python -m tests.rig validate` → OK.
- `python -m tests.rig run --no-parallel unit-rig` → 34 pass; coverage emitted.
- `uvx ruff check --line-length 90 tests/` → clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/rig/runtime.py (1)

288-293: 💤 Low value

Health check accepts 4xx responses as healthy.

resp.status_code < 500 treats 4xx (e.g., 404, 401) as ready. Consider tightening to resp.ok or 200 <= resp.status_code < 300 to catch endpoint misconfiguration early.

 def _responds(url: str, requests_mod) -> bool:
     try:
         resp = requests_mod.get(url, timeout=2)
-        return resp.status_code < 500
+        return resp.ok
     except requests_mod.RequestException:
         return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/rig/runtime.py` around lines 288 - 293, The health check in _responds
currently treats any status_code < 500 as healthy, which accepts 4xx responses;
update the function (inside _responds where requests_mod.get is called and
exceptions are caught) to return only true for successful 2xx responses—use
resp.ok or explicitly check 200 <= resp.status_code < 300—instead of
resp.status_code < 500, preserving the timeout and existing RequestException
handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/rig/runtime.py`:
- Around line 288-293: The health check in _responds currently treats any
status_code < 500 as healthy, which accepts 4xx responses; update the function
(inside _responds where requests_mod.get is called and exceptions are caught) to
return only true for successful 2xx responses—use resp.ok or explicitly check
200 <= resp.status_code < 300—instead of resp.status_code < 500, preserving the
timeout and existing RequestException handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9868dd1c-57b8-4c78-8ef6-d886af0692e6

📥 Commits

Reviewing files that changed from the base of the PR and between c492099 and b926307.

📒 Files selected for processing (5)
  • tests/rig/cli.py
  • tests/rig/critical_paths.py
  • tests/rig/runtime.py
  • tests/rig/tests/test_cli.py
  • tests/rig/tests/test_critical_paths.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/rig/critical_paths.py
  • tests/rig/tests/test_critical_paths.py

…rage

R4-A (important) — summary.md misrepresented regressions on corrupt baseline.
write_summary now accepts a baseline_corrupt flag; when set, summary.md (and
the sticky PR comment) emit a "Baseline cache was corrupt; regression
detection disabled this run" banner so reviewers don't read "gap" entries as
first-time gaps when they may actually be regressions hidden by the cache
failure. summary.json also exposes the flag.

R4-B (important) — added test_cmd_run_does_not_update_baseline_on_red_build:
proves the `--update-baseline` guard refuses to bake red-build state into
the cache. Without the guard, the next build wouldn't surface regressions.

R4-C (suggestion) — baseline corruption now flips overall_exit regardless of
whether the build was otherwise green. Previously red→corrupt was swallowed.

R4-D (suggestion) — added tests/rig/tests/test_runtime.py covering
InfraEndpoints partial-pair rejection (4 cases) and PlatformEndpoints.from_env
defaults. Closes the zero-coverage gap left by R5 deferral.

R4-E (suggestion) — monkeypatch rationale documented at top of test_cli.py
so the pattern's safety assumptions are explicit for future readers.

R4-F (suggestion) — simplified dead-code ternary in evaluate's
previously_covered initialization.

R4-G (suggestion) — scope_groups docstring "set" → "collection" to match
the R4 signature loosening.

R4-H (suggestion) — replaced internal review-tag references in test
docstrings ("guards N2/N9/R1") with invariant descriptions that stand on
their own post-merge.

R4-I (suggestion) — strengthened the corrupt-baseline test to assert the
summary.md content includes the banner, not just that the file exists; a
zero-byte file would no longer pass.

Verified:
- 40/40 self-tests pass (up from 34).
- `python -m tests.rig validate` → OK.
- End-to-end run with corrupt baseline emits the banner correctly.
- `uvx ruff check/format --line-length 90 tests/` → clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/rig/cli.py`:
- Around line 236-242: The current except block can hide the original exception
from runtime.up() if runtime.down() also raises; modify the handler around
runtime.up()/runtime.down() so you capture the original exception (e.g., store
it as exc_up), then call runtime.down() inside its own try/except, and always
re-raise the original exc_up regardless of whether runtime.down() fails
(optionally log the down() error). Specifically update the try/except that
invokes runtime.up() and runtime.down() to preserve and re-raise the original
up() exception while isolating any errors from runtime.down().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31752086-90af-4796-aeab-f10b9e9a6e59

📥 Commits

Reviewing files that changed from the base of the PR and between b926307 and 4030c00.

📒 Files selected for processing (5)
  • tests/rig/cli.py
  • tests/rig/critical_paths.py
  • tests/rig/reporting.py
  • tests/rig/tests/test_cli.py
  • tests/rig/tests/test_runtime.py

Comment thread tests/rig/cli.py
Copy link
Copy Markdown
Contributor

@jaseemjaskp jaseemjaskp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review summary

Ran six review passes (Code Reviewer, Code Simplifier, Comment Analyzer, PR Test Analyzer, Silent Failure Hunter, Type Design Analyzer) against the rig. Existing coderabbitai/greptile threads cover most of the headline bugs; the inline comments below are net-new findings.

Top themes uncovered:

  1. Silent prep failures_prepare_group_env and synthetic-junit writes both fail-soft in ways that let pytest produce a misleading red/green that does not reflect the actual state of the build.
  2. Subprocess hygiene_spawn timeout does not kill the process group, and coverage invocations have no timeout. Both can hang or leak processes between groups.
  3. Type invariantsGroupDefinition.env is mutable inside a frozen dataclass; CriticalPathRegistry silently last-wins on duplicate ids; CriticalPathStatus allows contradictory state="covered" with empty covering_groups_run.
  4. Pre-commit hook regression — uses bare python despite combine-test-reports.sh being explicitly fixed for python3-only environments.
  5. Docstring drift — three docstrings claim guarantees the surrounding code does not actually provide (URL sentinel, x2text wiring, scope-aware re-aggregation in cmd_report).

Findings overlapping with existing bot threads (e.g. _responds() laxity, smoke < 500, previous-summary.json duplication, cmd_run complexity, JUnit defusedxml hardening) were deliberately dropped to avoid noise.

Low-impact items not posted inline:

  • runtime.py:67-73 x2text_url is declared on PlatformEndpoints but never exported in cli.py:486-490. Either delete the field or wire env.setdefault("UNSTRACT_X2TEXT_URL", ...).
  • tests/README.md:144-149 testcontainers row should warn that down() is best-effort and leaked containers can block the next run on port conflict.
  • tests/critical_paths.yaml:22-24 perpetual gap TODO has no tracking ticket; consider linking an issue.
  • tox.ini passenv = UNSTRACT_* glob is wide enough to leak adapter API keys into CI logs via compose _run (no capture_output=True).

No files were modified.

Comment thread tests/rig/cli.py
Comment thread tests/rig/cli.py Outdated
Comment thread tests/rig/cli.py Outdated
Comment thread tests/rig/cli.py
Comment thread tests/rig/cli.py
Comment thread tests/rig/critical_paths.py
Comment thread .pre-commit-config.yaml Outdated
Comment thread tests/rig/selection.py Outdated
Comment thread tests/rig/cli.py
Comment thread tests/rig/cli.py
hari-kuriakose and others added 2 commits May 21, 2026 02:40
When the rig runs under tox, tox exports VIRTUAL_ENV=.tox/<env>. The rig's
per-group `uv run` then warns "VIRTUAL_ENV=.tox/unit does not match the
project environment path .venv and will be ignored" before ignoring it
anyway — noise in CI logs, and a latent ambiguity since each group is meant
to resolve its own workdir's project venv, never tox's.

Strip VIRTUAL_ENV (and UV_PROJECT_ENVIRONMENT) from the environment passed to
every uv/pytest/coverage subprocess via a shared `_subprocess_env()` in cli.py
and `_clean_env()` in coverage.py. uv then does clean per-group project
discovery with no warning.

Verified: with VIRTUAL_ENV=.tox/unit set, a raw `uv run` emits the warning
while `python -m tests.rig run unit-rig` emits zero. 40/40 self-tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Critical:
- B1: XML-escape group name in synthetic JUnit. A group key with "/&/< would
  produce malformed XML that parse_junit reads as a phantom error on a green
  hurl run.
- B2: pre-commit hook uses `$(command -v python3 || command -v python)` so the
  opt-in test hook works on stock Ubuntu/containers without a `python` symlink
  (matches the combine-script fix).

Major:
- B3: fold junit-reported errors/failures into overall_exit. A truncated junit
  (errors=1) with pytest exit 0 rendered ❌ but exited green.
- B4: coverage subprocesses (combine/xml/html) now run with timeout=300. They
  run after the per-group loop so per-group timeouts don't apply; a slow
  `uv run --with` resolve could otherwise hang to the CI ceiling.
- B5: --changed-only detects HEAD == base (push-to-main) and falls back to
  HEAD~1...HEAD with a clear message instead of selecting nothing and exiting 2.
- dry-run no longer fails on critical-path gaps/regressions: no groups ran, so
  every covered path looks uncovered; a dry run is plan-only.
- Q1: duplicate critical-path ids now raise at load instead of silent last-wins.
- Q2: CriticalPathStatus enforces covered⇒non-empty / gap|regression⇒empty
  covering_groups_run, making the contradictory state unrepresentable.
- Q3: GroupDefinition.env coerced to MappingProxyType in __post_init__ so the
  frozen record can't be mutated through its dict.
- Q5: synthetic-junit write wrapped in try/except OSError, matching exit.txt.

Minor:
- Q4: reworded _rig_session_id docstring — the sentinel proves the rig ran;
  URL ownership is intentionally cooperative via setdefault.

Tests: 40 → 46. New coverage: synthetic-junit escaping + exit-5, duplicate-id
rejection, status-invariant enforcement, frozen env, and cmd_report
re-aggregation (previously zero coverage).

Deferred to follow-up (replied on threads): _spawn process-group kill on
timeout (heavy Popen refactor), cmd_run cognitive-complexity split.

Verified: 46/46 self-tests pass; validate OK; end-to-end emits coverage.xml +
htmlcov; ruff check/format clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread tests/rig/cli.py
Comment thread tests/rig/runtime.py
PlatformEndpoints collects x2text_url from UNSTRACT_X2TEXT_URL but the rig
never forwarded it to group subprocesses and never probed it for readiness:

- _execute_group now setdefaults UNSTRACT_X2TEXT_URL into the group env, so
  e2e tests hitting x2text use the rig's URL instead of a stale/default value.
- _wait_ready now includes x2text /health, so ComposeRuntime.up() waits for it
  rather than returning early and letting the first x2text call hit
  ConnectionRefusedError.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

@github-actions
Copy link
Copy Markdown
Contributor

Unstract test results

Per-group results

Status Group Tier Passed Failed Errors Skipped Duration (s)
unit-connectors unit 64 12 0 3 17.0
unit-core unit 0 0 2 0 1.2
unit-platform-service unit 9 0 1 0 1.3
unit-prompt-service unit 15 0 0 0 17.9
unit-rig unit 46 0 0 0 1.5
unit-runner unit 11 0 0 0 3.5
unit-sdk1 unit 334 0 0 0 18.2
unit-tool-registry unit 0 0 1 0 1.3
unit-workers unit 0 0 0 0 17.6
TOTAL 479 12 4 3 79.5

Critical paths

⚠️ Critical paths not yet covered

  • auth-login — User can log in and obtain a session cookie. (entry: POST /api/v1/auth/login; declared coverage: e2e-smoke)
  • adapter-register-llm — Register and validate an LLM adapter. (entry: POST /api/v1/adapter/; declared coverage: no groups declared)
  • workflow-create-execute — Create a workflow, configure source+destination, execute, poll, fetch result. (entry: POST /api/v1/workflow/{id}/execute/; declared coverage: e2e-workflow)
  • api-deployment-run — Deploy a workflow as an API, POST a document, receive structured JSON. (entry: POST /deployment/api/{org}/{name}/; declared coverage: e2e-api-deployment)
  • prompt-studio-fetch-response — Prompt Studio: create project, add prompt, run single-pass, get response. (entry: POST /api/v1/prompt-studio/prompt-studio-tool/{id}/fetch_response/; declared coverage: e2e-prompt-studio)
  • pipeline-etl-execute — Run an ETL pipeline from source connector to destination. (entry: POST /api/v1/pipeline/{id}/execute/; declared coverage: no groups declared)
  • usage-token-tracking — Per-execution token usage is recorded and retrievable. (entry: GET /api/v1/usage/get_token_usage/; declared coverage: no groups declared)
  • workflow-execution-fan-out — Multi-file workflow execution fans out to file-processing workers and rejoins. (entry: internal: backend → rabbitmq → workers/file_processing; declared coverage: no groups declared)
  • callback-result-delivery — Async results are posted back via the callback worker. (entry: internal: workers/callback → backend /internal endpoints; declared coverage: no groups declared)
✅ Covered critical paths
  • tool-sandbox-exec — covered by unit-runner

Comment thread tests/rig/cli.py
Comment on lines +345 to +347
scope_groups = set(ordered)

reports_dir: Path = args.reports_dir
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 scope_groups is set from ordered before optional-placeholder groups are filtered out into skipped. If a group was previously run green (and thus appears in the baseline), is now marked optional: true, and its paths have since been removed, it will remain in scope_groups but never enter green. On the next CI run that includes this group via dep-expansion, evaluate() sees: path.id in previously_covered AND in_scope AND not coveredstate = "regression". Because regressions set overall_exit = 1, --update-baseline is permanently blocked (its guard is overall_exit == 0), creating a stuck state that can only be resolved by manually deleting the cache. Filtering scope_groups to exclude the groups the rig actually intends to run fixes this.

Suggested change
scope_groups = set(ordered)
reports_dir: Path = args.reports_dir
# scope_groups must exclude optional placeholders that were skipped:
# if a skipped group was previously in the baseline, keeping it in scope
# would classify its covered paths as regressions even though the group
# never ran, blocking --update-baseline and creating a stuck state.
runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))}
scope_groups = runnable_pre
reports_dir: Path = args.reports_dir
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/rig/cli.py
Line: 345-347

Comment:
`scope_groups` is set from `ordered` **before** optional-placeholder groups are filtered out into `skipped`. If a group was previously run green (and thus appears in the baseline), is now marked `optional: true`, and its paths have since been removed, it will remain in `scope_groups` but never enter `green`. On the next CI run that includes this group via dep-expansion, `evaluate()` sees: `path.id in previously_covered AND in_scope AND not covered``state = "regression"`. Because regressions set `overall_exit = 1`, `--update-baseline` is permanently blocked (its guard is `overall_exit == 0`), creating a stuck state that can only be resolved by manually deleting the cache. Filtering `scope_groups` to exclude the groups the rig actually intends to run fixes this.

```suggestion
    # scope_groups must exclude optional placeholders that were skipped:
    # if a skipped group was previously in the baseline, keeping it in scope
    # would classify its covered paths as regressions even though the group
    # never ran, blocking --update-baseline and creating a stuck state.
    runnable_pre = {n for n in ordered if not _is_missing_placeholder(manifest.get(n))}
    scope_groups = runnable_pre

    reports_dir: Path = args.reports_dir
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants