Skip to content

ayyadam/agentic-testing-framework

Repository files navigation

🤖 Agentic Testing Framework

Pytest-Allure Playwright Self-Healing

The Agentic Testing Framework replaces deterministic, fragile test scripts with natural-language goals.

Instead of writing "click button X, type into field Y," you provide a high-level intent like "Navigate to the dashboard and ensure the latest invoice is paid." The agent autonomously reasons over your Page Object Model (POM), explores the UI, and automatically repairs its own source code if a selector breaks.


✨ Key Features

  • Goal-Driven Execution: Give the agent a target; it figures out the path.
  • Guarded Self-Healing: When a locator drifts, the agent completes the test via visual fallbacks, then proposes a replacement selector — which is validated against the page DOM and only written to disk if it resolves to exactly one element. The run that triggered the heal fails by design (two-pass model); the patched POM is reviewed and committed before the next run turns green. Every attempt — applied or rejected — is recorded in a maintenance ledger.
  • Native Vision Fallback: No reliance on heavy external libraries. The agent uses a custom-built DOM skeletonizer and visual analysis engine to "see" the page when POM selectors fail.
  • Auto-Scaffolding (Architect): Point architect_cli.py at a requirements document — local file or ALM URL (Azure DevOps / Jira / Confluence REST API) — and it generates skeleton Page Objects, a per-app conftest, and one Pytest function per acceptance criterion. Bearer / Basic auth supported via .env for protected ALM endpoints.
  • Database Assertions: A built-in DatabaseToolkit lets the agent verify backend state directly via SQL — no UI scraping needed. Supports SQLite, PostgreSQL, and MSSQL via a single connection string.
  • Forensic Observability: Every step is logged in structured JSON, every failure captures a Playwright trace, and every logic jump is tracked in Allure report steps.
  • Model Agnostic: Run on Gemini, Claude, OpenAI, or locally via Ollama.

🚀 Quick Start

1. Prerequisites

  • Python 3.12+
  • An LLM API Key (Gemini is currently set as the primary model)

2. Installation

# Clone the repository
git clone https://github.com/ayyadam/agentic-testing-framework.git
cd agentic-testing-framework

# Setup environment
python -m venv .venv
source .venv/bin/activate  # Or .venv\Scripts\activate on Windows
pip install -r requirements.txt

# Install Playwright browser
playwright install chromium

3. Configuration

  • Create a .env file in the root directory
  • Add your LLM API key to the .env file
  • Set the PRIMARY_MODEL environment variable to your desired model
  • Optionally set DATABASE_URL to point at a Postgres or MSSQL instance (defaults to local SQLite)

🛠️ Usage

Run via CLI

Ideal for quick logic checks or ad-hoc tasks:

python main.py "Log in as a member and verify the member dashboard loads"

Run via Pytest

The framework is designed to fit into standard Pytest pipelines:

# Run all tests
pytest

# Run specific tests
pytest -m agentic

# Run with Allure reporting
pytest --alluredir=results/allure-results
allure serve results/allure-results

🗄️ Database Assertions

The agent can verify backend state directly using two built-in SQL tools:

Tool When to use
db.verify_record(table, criteria) Semantic check — "did this record get created?"
db.query(sql) Read-only SELECT / WITH (write statements are refused)

Identifiers are whitelist-validated and write-statements are rejected at the tool boundary, so the toolkit is safe to expose to an LLM-driven agent.

Quick start

  1. Pass a DatabaseToolkit to build_agent_graph:
from tools.database_tools import DatabaseToolkit
from config import get_settings

db = DatabaseToolkit(get_settings().database_url)
async with await build_agent_graph(tools, settings, page=page, db_toolkit=db) as graph:
    ...
  1. Drive hybrid goals:
python main.py "Register a new user via the UI, then verify the record exists in the users table."

For a different database, set DATABASE_URL in .env:

DATABASE_URL=postgresql+asyncpg://user:pass@localhost/mydb
DATABASE_URL=mssql+aioodbc://user:pass@host/db?driver=ODBC+Driver+18+for+SQL+Server

Unit tests

tests/test_database_tools.py exercises the toolkit end-to-end against an in-memory SQLite database — no physical DB file or external server required.


🔌 API Testing

Same goal-driven loop, HTTP transport. Service Objects parallel Page Objects, an ApiToolkit parallels DatabaseToolkit, and pluggable auth strategies cover Basic, Bearer, ApiKey, OAuth2 client_credentials, and a custom ClientWamId encryption-handshake flow. The heal node is intentionally bypassed in API mode — selector-repair logic is UI-specific and has no analogue for HTTP failures.

Quick start

# CLI run against an HTTP API (no browser launched)
python main.py --mode api "Create an item via POST /items with name=widget, qty=3, then verify with GET /items"

In a Pytest test:

import httpx
from agents.executor_agent import build_agent_graph
from services.sample_api.items_service import ItemsService
from tools.api_tools import ApiToolkit
from tools.tool_registry import ToolRegistry
from config import get_settings
from tests._helpers import assert_api_clean_pass

async def test_create_item():
    settings = get_settings()
    async with httpx.AsyncClient(timeout=settings.api_request_timeout_s) as client:
        registry = ToolRegistry()
        registry.register_service(ItemsService(client, settings))
        tools = registry.get_all_tools()
        api_toolkit = ApiToolkit.from_settings(client, settings)
        async with await build_agent_graph(
            tools, settings, api_toolkit=api_toolkit, healing_enabled=False,
        ) as graph:
            ...
        await api_toolkit.close()

Auth methods

Set api_auth_method in .env to one of:

Method Required settings
none (default — unauthenticated)
basic api_auth_basic_username, api_auth_basic_password
bearer api_auth_bearer_token
apikey api_auth_apikey_value (header name defaults to X-API-Key, override via api_auth_apikey_header)
oauth2_client_credentials api_auth_oauth2_token_url, api_auth_oauth2_client_id, api_auth_oauth2_client_secret, optional api_auth_oauth2_scope
clientwamid api_auth_clientwamid_endpoint, api_auth_clientwamid_public_key_url, api_auth_clientwamid_guid

OAuth2 and ClientWamId both refresh credentials on a 401 response automatically. Cached tokens / encrypted values live in memory only; auth headers are redacted from toolkit logs.

Architect — --mode api

python architect_cli.py --app demo_api --mode api --source <ALM URL> --dry-run

Generates Service Objects under services/<app>/, a per-app conftest, and one Pytest function per acceptance criterion. The pre-fetch, weak-model nudge, AC coverage, and audit log are all reused unchanged from UI mode.

Sample app

apps/sample_api/ ships a self-contained aiohttp server (POST /items, GET /items/{id}, GET /items, plus a bearer-protected /protected endpoint) so the API stack has something to drive end-to-end without depending on a real target application. Treat it as a temporary demo — remove apps/sample_api/, services/sample_api/, and tests/sample_api/ once a real target API is integrated.


🏛️ Architect — Auto-Scaffold from Requirements

The Architect reads a requirements document (user story, acceptance criteria) and generates skeleton Page Objects + Pytest tests + a per-app conftest for a new application. It runs in its own graph with its own model chain, so scaffolding work never competes with the executor's quota or state.

Quick start

# Local file
python architect_cli.py --app hr_portal --source requirements.md --dry-run

# Azure DevOps work item (REST API — NOT the browser URL)
python architect_cli.py --app hr_portal \
  --source "https://dev.azure.com/<org>/<project>/_apis/wit/workitems/42?fields=System.Title,System.Description,Microsoft.VSTS.Common.AcceptanceCriteria&api-version=7.0" \
  --dry-run

Use the REST API URL, not the browser URL. ALM web interfaces (Azure DevOps, Jira, Confluence) are single-page apps that load content dynamically via JavaScript — a plain HTTP fetch of the browser URL returns an HTML shell with no requirements text. The Architect logs a warning (read_source_looks_like_html) when it detects this common mistake.

CLI options

Flag Purpose
--app <name> App namespace (lowercase snake_case). Files land under pages/<app>/ and tests/<app>/.
--source <url-or-path> URL or local file path to the requirements document.
--force Overwrite existing files (refuses by default — safe on existing apps).
--dry-run Print intended writes to stdout without touching disk.
--recursion-limit N Max tool-call loop iterations (default 12).

Configuration (.env)

Variable Purpose Default
ARCHITECT_MODEL Primary model. Dedicated chain separate from executor. ollama:qwen2.5:32b-instruct-q4_K_M
ARCHITECT_FALLBACK_MODEL Falls back if the primary times out / rate-limits. google_genai:gemini-2.5-flash-lite
ARCHITECT_HTTP_AUTH_BEARER Bearer token for protected sources (e.g. Jira PAT). (unset)
ARCHITECT_HTTP_AUTH_BASIC Basic auth user:pass. For Azure DevOps: :<your_PAT> (empty user, PAT as password). (unset)
OLLAMA_NUM_CTX Context window for any Ollama-hosted model. Raises Ollama's 2048-token default so long prompts don't silently truncate. 8192
LLM_CALL_TIMEOUT_S Per-call timeout (s). Kind to large local models generating long tool-call arguments. 180
ARCHITECT_LOG_FILE Audit log destination — truncated at the start of each run. Lives outside results/logs/ so pytest's sweep never wipes it. results/architect/architect.log

What gets generated

The Architect runs as a multi-agent split topology — a POM/Service-Object agent writes the scaffolding artifacts, the orchestrator writes the conftest deterministically, then a per-AC test-writer agent is invoked once per acceptance criterion in a Python loop. Each LLM session has a narrow, focused job — addresses the test-iteration collapse local 27–32B models exhibit on 5+ AC sources when one agent has to hold the whole job in working memory.

For a work item with N distinct pages/endpoints and M acceptance criteria, the Architect produces:

  • N Page Object files (--mode ui) under pages/<app>/<page>_page.py — inherit from BasePage, implement the three required abstract members (url, navigate, is_loaded), plus interactive @tool_method stubs using Playwright-only API. Real selectors are left as "#TODO" placeholders for a human to fill in.
  • N Service Object files (--mode api) under services/<app>/<endpoint>_service.py — inherit from BaseService, implement base_url from settings.api_base_url, and expose one @tool_method async def per endpoint. Negative-auth cases are parameterised via a with_auth: bool = True argument so a single method covers happy-path and 401 scenarios.
  • One conftest under tests/<app>/conftest.py containing an <app>_initial_state(task) factory that builds the full AgentState dict. Written deterministically by the orchestrator (not the LLM) — fully templated, only the function name varies by app. Preserves an existing user-customised conftest unless --force is passed.
  • M test files under tests/<app>/test_*.py — one per acceptance criterion (happy path and each error/negative path get separate tests), each importing the factory and using the canonical @pytest.mark.agentic + @allure.feature + @allure.story shape. Each test is written by an independent test-writer agent invocation with a focused prompt: just one AC text plus the AST-extracted briefing of the POM/Service classes the previous phase wrote.
  • Auto-created __init__.py files for any new intermediate directories, so the generated code is importable on first pytest run.

Guardrails

The write_code_file tool enforces:

  • Path traversal — all writes must resolve inside pages/, services/, or tests/. pages/../../etc/passwd-style attempts are refused.
  • No overwrite — existing files are refused unless --force. Safe to run on established apps without fear of clobbering hand-written code.
  • Idempotent re-writes — re-emitting an identical (path, content) pair within a single run returns a NO-OP result with a binary instruction to either advance or summarise. Catches the duplicate-write loops that local models occasionally fall into on retry, without consuming an extra tool call slot.
  • Syntax validationast.parse runs before the file is written; SyntaxError is rejected with a targeted hint covering the common LLM failure modes (\n escape sequences, markdown fences).
  • Style normalisation@ decorator (with stray space) is silently rewritten to @decorator at write time, working around a persistent LLM quirk.

Cross-model robustness

The CLI pre-fetches the source URL itself before invoking the agent graph and seeds the conversation with a synthesised read_source tool call + result pair. The LLM never has to copy the URL into a tool argument, so model-specific tokenisation quirks (e.g. gemini-2.5-flash-lite encoding spaces and CamelCase identifiers during JSON serialisation) cannot mangle ALM URLs. A bad source aborts the run with architect_run_aborted_source_unreadable before any token is spent on the graph.

For weaker tool-calling models that terminate after a single round-trip (currently any model id containing flash-lite), the CLI also appends a follow-up HumanMessage that re-issues the write instructions. Stronger models (qwen2.5:32b, gemini-2.5-flash, claude, etc.) skip the nudge so the ~150-token cost is only paid where it's needed.

Operator diagnostics

Every Architect run writes its full structured trace to results/architect/architect.log (location overridable via ARCHITECT_LOG_FILE). Each run truncates the file at start, so it always reflects the latest scaffold. The log lives outside results/logs/ so pytest's session-start sweep never wipes it.

The CLI also emits two best-effort coverage warnings — they never block the run, but they make it obvious where the model dropped work:

Event Triggered when
architect_split_pom_phase_start / _done Phase 1 (POM/Service-Object writing) start and completion. _done records the tool-message count and the list of write results.
architect_split_conftest_auto_written The orchestrator-written conftest factory landed (or was preserved as a PRESERVED result if a customised file already exists).
architect_split_briefing_built The AST-extracted class/method briefing handed to each per-AC test-writer invocation — logs the discovered class names and total @tool_method count.
architect_split_test_phase_start / _done One pair per acceptance criterion, scoping the per-AC test writer's tool messages and write results.
architect_coverage_summary / architect_coverage_gap Source contains more numbered acceptance criteria than the architect produced test files for. The summary lists the detected ACs so reviewers can identify which were dropped.
architect_module_summary / architect_module_gap A generated test imports a POM or Service Object module the architect never wrote (those tests would fail at pytest collection time). The event carries a kind field (pom or service) so UI and API runs are distinguishable; the warning lists the missing modules and the offending imports.

Quality gates

Every Architect run finishes with an advisory static-analysis pass over the files it wrote (and any files it preserved on disk because they already existed). Findings print to stdout under the file tree, and the structured events flow into architect.log alongside the rest of the run.

The reviewer never fails the run — it's informational. To re-run it independently against an existing app:

python -m tools.architect_reviewer --app adams_golf_club
# Optional: source-coupled gates (e.g. every-AC-has-a-test) need a source
python -m tools.architect_reviewer --app adams_golf_club --source <ALM URL>
# JSON output for tooling
python -m tools.architect_reviewer --app adams_golf_club --format json

Exit codes follow the standard 0/1/2 convention (0 = clean, 1 = at least one ERROR, 2 = usage error). Logs land at results/architect/reviewer.log — same directory as architect.log.

A small set of gates can be globally silenced via settings when a codebase legitimately diverges from the framework convention:

# .env
ARCHITECT_REVIEWER_DISABLED_GATES=["pom_has_class_docstring","pom_url_returns_full_url"]

Gate IDs are stable: once published, an ID always refers to the same check. Deleted gates leave a reserved slot rather than being reused, so disable-list entries don't silently target a different check after an upgrade.

Running the Architect on an existing app

The tool is designed as a greenfield scaffold. On an app that already has pages/tests, write_code_file refuses to overwrite — you'll see already exists errors in the log. The Architect currently has no file-listing tool, so it can't reason about what's already there and will still try to generate a full set of pages/tests, which may duplicate ACs already covered elsewhere. Treat the output as a starting point and prune duplicates manually.


🏗️ Adding Your Application

Adding a new application requires zero changes to the framework core. Each app gets its own subpackage under pages/ and tests/:

pages/hr_portal/dashboard_page.py
tests/hr_portal/test_search.py

The app namespace (hr_portal) is auto-derived from the module path and used to prefix tool names (hr_portal.DashboardPage.find_employee), so apps never collide.

💡 Skip the boilerplate: For a greenfield app with requirements already written down, the Architect generates the subpackage, Page Objects, conftest, and tests automatically. The walkthrough below covers the underlying structure so you can extend Architect-scaffolded code or hand-write from scratch.

  1. Create an app subpackage under pages/ with an __init__.py.
  2. Create a Page Object: Extend BasePage and decorate your interactions with @tool_method.
  3. Define a Test in a matching tests/<app>/ subpackage: register your page, build the graph, and invoke it with a goal.
# pages/hr_portal/dashboard_page.py
from pages.base_page import BasePage, tool_method

class DashboardPage(BasePage):
    @tool_method
    async def find_employee(self, name: str) -> str:
        """Search for an employee in the main directory."""
        await self._page.fill("#search", name)
        await self._page.click("#submit")
        return f"Search triggered for {name}"

📖 Detailed Guide: For a step-by-step technical walkthrough on onboarding, see the New App Recipe.


🧠 Architecture & Deep Dive

This framework is built on LangGraph, a state-machine based approach to AI orchestration.

Layer Component
Tools Your Page Objects, native Vision Fallbacks, and the Database Toolkit
Logic Plan, Verify, and Diagnose nodes in `agents/nodes.py`
Healing Validation-gated selector repair in `agents/heal_node.py` — DOM-verified patches, dated source annotations, and an append-only `results/heal_log.jsonl` ledger
Assertions `tests/_helpers.py::assert_clean_pass` — the two-pass gate that fails any run which fired the heal node

For a full technical breakdown of the 4 core nodes, state management, and the self-healing plumbing, please refer to the: 👉 Architecture & User Guide


📊 Reporting & Debugging

  • Execution Log: `results/logs/execution.log` (JSONL forensic audit)
  • Architect Log: `results/architect/architect.log` (per-run JSONL audit of the most recent scaffold — truncated at each run, isolated from pytest's `results/logs/` sweep)
  • Heal Ledger: `results/heal_log.jsonl` (append-only maintenance record — one line per heal attempt with a stable status enum: `applied`, `rejected_validation`, `rejected_patch`, `rejected_llm`, or `skipped_blank_page`)
  • Allure Dashboard: `results/allure-results` (View via `allure serve results/allure-results`)
  • Playwright Traces: `results/traces/` (View via `playwright show-trace`)
  • Checkpoints: `results/checkpoints/agent.db` (State snapshots)

🔬 Sample Application

One sample app ships with the framework to exercise the full UI + DB + self-healing surface end-to-end:

App Location What it demonstrates
`adams_golf_club` pages/adams_golf_club/, tests/adams_golf_club/ Local Flask app with a real backend DB — exercises UI + `db.verify_record` hybrid goals, member/admin role credentials, and the full self-healing path against controlled selector drift.

About

Goal-driven test automation that replaces brittle scripts with natural-language goals. LangGraph + LLM + Playwright, with self-healing selectors and auto-scaffolding from ALM requirement documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages