🤖 Agentic Testing Framework

The Agentic Testing Framework replaces deterministic, fragile test scripts with natural-language goals.

Instead of writing "click button X, type into field Y," you provide a high-level intent like "Navigate to the dashboard and ensure the latest invoice is paid." The agent autonomously reasons over your Page Object Model (POM), explores the UI, and automatically repairs its own source code if a selector breaks.

✨ Key Features

Goal-Driven Execution: Give the agent a target; it figures out the path.
Guarded Self-Healing: When a locator drifts, the agent completes the test via visual fallbacks, then proposes a replacement selector — which is validated against the page DOM and only written to disk if it resolves to exactly one element. The run that triggered the heal fails by design (two-pass model); the patched POM is reviewed and committed before the next run turns green. Every attempt — applied or rejected — is recorded in a maintenance ledger.
Native Vision Fallback: No reliance on heavy external libraries. The agent uses a custom-built DOM skeletonizer and visual analysis engine to "see" the page when POM selectors fail.
Auto-Scaffolding (Architect): Point architect_cli.py at a requirements document — local file or ALM URL (Azure DevOps / Jira / Confluence REST API) — and it generates skeleton Page Objects, a per-app conftest, and one Pytest function per acceptance criterion. Bearer / Basic auth supported via .env for protected ALM endpoints.
Database Assertions: A built-in DatabaseToolkit lets the agent verify backend state directly via SQL — no UI scraping needed. Supports SQLite, PostgreSQL, and MSSQL via a single connection string.
Forensic Observability: Every step is logged in structured JSON, every failure captures a Playwright trace, and every logic jump is tracked in Allure report steps.
Model Agnostic: Run on Gemini, Claude, OpenAI, or locally via Ollama.

🚀 Quick Start

1. Prerequisites

Python 3.12+
An LLM API Key (Gemini is currently set as the primary model)

2. Installation

# Clone the repository
git clone https://github.com/ayyadam/agentic-testing-framework.git
cd agentic-testing-framework

# Setup environment
python -m venv .venv
source .venv/bin/activate  # Or .venv\Scripts\activate on Windows
pip install -r requirements.txt

# Install Playwright browser
playwright install chromium

3. Configuration

Create a .env file in the root directory
Add your LLM API key to the .env file
Set the PRIMARY_MODEL environment variable to your desired model
Optionally set DATABASE_URL to point at a Postgres or MSSQL instance (defaults to local SQLite)

🛠️ Usage

Run via CLI

Ideal for quick logic checks or ad-hoc tasks:

python main.py "Log in as a member and verify the member dashboard loads"

Run via Pytest

The framework is designed to fit into standard Pytest pipelines:

# Run all tests
pytest

# Run specific tests
pytest -m agentic

# Run with Allure reporting
pytest --alluredir=results/allure-results
allure serve results/allure-results

🗄️ Database Assertions

The agent can verify backend state directly using two built-in SQL tools:

Tool	When to use
`db.verify_record(table, criteria)`	Semantic check — "did this record get created?"
`db.query(sql)`	Read-only SELECT / WITH (write statements are refused)

Identifiers are whitelist-validated and write-statements are rejected at the tool boundary, so the toolkit is safe to expose to an LLM-driven agent.

Quick start

Pass a DatabaseToolkit to build_agent_graph:

from tools.database_tools import DatabaseToolkit
from config import get_settings

db = DatabaseToolkit(get_settings().database_url)
async with await build_agent_graph(tools, settings, page=page, db_toolkit=db) as graph:
    ...

Drive hybrid goals:

python main.py "Register a new user via the UI, then verify the record exists in the users table."

For a different database, set DATABASE_URL in .env:

DATABASE_URL=postgresql+asyncpg://user:pass@localhost/mydb
DATABASE_URL=mssql+aioodbc://user:pass@host/db?driver=ODBC+Driver+18+for+SQL+Server

Unit tests

tests/test_database_tools.py exercises the toolkit end-to-end against an in-memory SQLite database — no physical DB file or external server required.

🔌 API Testing

Same goal-driven loop, HTTP transport. Service Objects parallel Page Objects, an ApiToolkit parallels DatabaseToolkit, and pluggable auth strategies cover Basic, Bearer, ApiKey, OAuth2 client_credentials, and a custom ClientWamId encryption-handshake flow. The heal node is intentionally bypassed in API mode — selector-repair logic is UI-specific and has no analogue for HTTP failures.

Quick start

# CLI run against an HTTP API (no browser launched)
python main.py --mode api "Create an item via POST /items with name=widget, qty=3, then verify with GET /items"

In a Pytest test:

import httpx
from agents.executor_agent import build_agent_graph
from services.sample_api.items_service import ItemsService
from tools.api_tools import ApiToolkit
from tools.tool_registry import ToolRegistry
from config import get_settings
from tests._helpers import assert_api_clean_pass

async def test_create_item():
    settings = get_settings()
    async with httpx.AsyncClient(timeout=settings.api_request_timeout_s) as client:
        registry = ToolRegistry()
        registry.register_service(ItemsService(client, settings))
        tools = registry.get_all_tools()
        api_toolkit = ApiToolkit.from_settings(client, settings)
        async with await build_agent_graph(
            tools, settings, api_toolkit=api_toolkit, healing_enabled=False,
        ) as graph:
            ...
        await api_toolkit.close()

Auth methods

Set api_auth_method in .env to one of:

Method	Required settings
`none`	(default — unauthenticated)
`basic`	`api_auth_basic_username`, `api_auth_basic_password`
`bearer`	`api_auth_bearer_token`
`apikey`	`api_auth_apikey_value` (header name defaults to `X-API-Key`, override via `api_auth_apikey_header`)
`oauth2_client_credentials`	`api_auth_oauth2_token_url`, `api_auth_oauth2_client_id`, `api_auth_oauth2_client_secret`, optional `api_auth_oauth2_scope`
`clientwamid`	`api_auth_clientwamid_endpoint`, `api_auth_clientwamid_public_key_url`, `api_auth_clientwamid_guid`

OAuth2 and ClientWamId both refresh credentials on a 401 response automatically. Cached tokens / encrypted values live in memory only; auth headers are redacted from toolkit logs.

Architect — `--mode api`

python architect_cli.py --app demo_api --mode api --source <ALM URL> --dry-run

Generates Service Objects under services/<app>/, a per-app conftest, and one Pytest function per acceptance criterion. The pre-fetch, weak-model nudge, AC coverage, and audit log are all reused unchanged from UI mode.

Sample app

apps/sample_api/ ships a self-contained aiohttp server (POST /items, GET /items/{id}, GET /items, plus a bearer-protected /protected endpoint) so the API stack has something to drive end-to-end without depending on a real target application. Treat it as a temporary demo — remove apps/sample_api/, services/sample_api/, and tests/sample_api/ once a real target API is integrated.

🏛️ Architect — Auto-Scaffold from Requirements

The Architect reads a requirements document (user story, acceptance criteria) and generates skeleton Page Objects + Pytest tests + a per-app conftest for a new application. It runs in its own graph with its own model chain, so scaffolding work never competes with the executor's quota or state.

Quick start

# Local file
python architect_cli.py --app hr_portal --source requirements.md --dry-run

# Azure DevOps work item (REST API — NOT the browser URL)
python architect_cli.py --app hr_portal \
  --source "https://dev.azure.com/<org>/<project>/_apis/wit/workitems/42?fields=System.Title,System.Description,Microsoft.VSTS.Common.AcceptanceCriteria&api-version=7.0" \
  --dry-run

Use the REST API URL, not the browser URL. ALM web interfaces (Azure DevOps, Jira, Confluence) are single-page apps that load content dynamically via JavaScript — a plain HTTP fetch of the browser URL returns an HTML shell with no requirements text. The Architect logs a warning (read_source_looks_like_html) when it detects this common mistake.

CLI options

Flag	Purpose
`--app <name>`	App namespace (lowercase snake_case). Files land under `pages/<app>/` and `tests/<app>/`.
`--source <url-or-path>`	URL or local file path to the requirements document.
`--force`	Overwrite existing files (refuses by default — safe on existing apps).
`--dry-run`	Print intended writes to stdout without touching disk.
`--recursion-limit N`	Max tool-call loop iterations (default 12).

Configuration (`.env`)

Variable	Purpose	Default
`ARCHITECT_MODEL`	Primary model. Dedicated chain separate from executor.	`ollama:qwen2.5:32b-instruct-q4_K_M`
`ARCHITECT_FALLBACK_MODEL`	Falls back if the primary times out / rate-limits.	`google_genai:gemini-2.5-flash-lite`
`ARCHITECT_HTTP_AUTH_BEARER`	Bearer token for protected sources (e.g. Jira PAT).	(unset)
`ARCHITECT_HTTP_AUTH_BASIC`	Basic auth `user:pass`. For Azure DevOps: `:<your_PAT>` (empty user, PAT as password).	(unset)
`OLLAMA_NUM_CTX`	Context window for any Ollama-hosted model. Raises Ollama's 2048-token default so long prompts don't silently truncate.	`8192`
`LLM_CALL_TIMEOUT_S`	Per-call timeout (s). Kind to large local models generating long tool-call arguments.	`180`
`ARCHITECT_LOG_FILE`	Audit log destination — truncated at the start of each run. Lives outside `results/logs/` so pytest's sweep never wipes it.	`results/architect/architect.log`

What gets generated

The Architect runs as a multi-agent split topology — a POM/Service-Object agent writes the scaffolding artifacts, the orchestrator writes the conftest deterministically, then a per-AC test-writer agent is invoked once per acceptance criterion in a Python loop. Each LLM session has a narrow, focused job — addresses the test-iteration collapse local 27–32B models exhibit on 5+ AC sources when one agent has to hold the whole job in working memory.

For a work item with N distinct pages/endpoints and M acceptance criteria, the Architect produces:

N Page Object files (--mode ui) under pages/<app>/<page>_page.py — inherit from BasePage, implement the three required abstract members (url, navigate, is_loaded), plus interactive @tool_method stubs using Playwright-only API. Real selectors are left as "#TODO" placeholders for a human to fill in.
N Service Object files (--mode api) under services/<app>/<endpoint>_service.py — inherit from BaseService, implement base_url from settings.api_base_url, and expose one @tool_method async def per endpoint. Negative-auth cases are parameterised via a with_auth: bool = True argument so a single method covers happy-path and 401 scenarios.
One conftest under tests/<app>/conftest.py containing an <app>_initial_state(task) factory that builds the full AgentState dict. Written deterministically by the orchestrator (not the LLM) — fully templated, only the function name varies by app. Preserves an existing user-customised conftest unless --force is passed.
M test files under tests/<app>/test_*.py — one per acceptance criterion (happy path and each error/negative path get separate tests), each importing the factory and using the canonical @pytest.mark.agentic + @allure.feature + @allure.story shape. Each test is written by an independent test-writer agent invocation with a focused prompt: just one AC text plus the AST-extracted briefing of the POM/Service classes the previous phase wrote.
Auto-created __init__.py files for any new intermediate directories, so the generated code is importable on first pytest run.

Guardrails

The write_code_file tool enforces:

Path traversal — all writes must resolve inside pages/, services/, or tests/. pages/../../etc/passwd-style attempts are refused.
No overwrite — existing files are refused unless --force. Safe to run on established apps without fear of clobbering hand-written code.
Idempotent re-writes — re-emitting an identical (path, content) pair within a single run returns a NO-OP result with a binary instruction to either advance or summarise. Catches the duplicate-write loops that local models occasionally fall into on retry, without consuming an extra tool call slot.
Syntax validation — ast.parse runs before the file is written; SyntaxError is rejected with a targeted hint covering the common LLM failure modes (\n escape sequences, markdown fences).
Style normalisation — @ decorator (with stray space) is silently rewritten to @decorator at write time, working around a persistent LLM quirk.

Cross-model robustness

The CLI pre-fetches the source URL itself before invoking the agent graph and seeds the conversation with a synthesised read_source tool call + result pair. The LLM never has to copy the URL into a tool argument, so model-specific tokenisation quirks (e.g. gemini-2.5-flash-lite encoding spaces and CamelCase identifiers during JSON serialisation) cannot mangle ALM URLs. A bad source aborts the run with architect_run_aborted_source_unreadable before any token is spent on the graph.

For weaker tool-calling models that terminate after a single round-trip (currently any model id containing flash-lite), the CLI also appends a follow-up HumanMessage that re-issues the write instructions. Stronger models (qwen2.5:32b, gemini-2.5-flash, claude, etc.) skip the nudge so the ~150-token cost is only paid where it's needed.

Operator diagnostics

Every Architect run writes its full structured trace to results/architect/architect.log (location overridable via ARCHITECT_LOG_FILE). Each run truncates the file at start, so it always reflects the latest scaffold. The log lives outside results/logs/ so pytest's session-start sweep never wipes it.

The CLI also emits two best-effort coverage warnings — they never block the run, but they make it obvious where the model dropped work:

Event	Triggered when
`architect_split_pom_phase_start` / `_done`	Phase 1 (POM/Service-Object writing) start and completion. `_done` records the tool-message count and the list of write results.
`architect_split_conftest_auto_written`	The orchestrator-written conftest factory landed (or was preserved as a `PRESERVED` result if a customised file already exists).
`architect_split_briefing_built`	The AST-extracted class/method briefing handed to each per-AC test-writer invocation — logs the discovered class names and total `@tool_method` count.
`architect_split_test_phase_start` / `_done`	One pair per acceptance criterion, scoping the per-AC test writer's tool messages and write results.
`architect_coverage_summary` / `architect_coverage_gap`	Source contains more numbered acceptance criteria than the architect produced test files for. The summary lists the detected ACs so reviewers can identify which were dropped.
`architect_module_summary` / `architect_module_gap`	A generated test imports a POM or Service Object module the architect never wrote (those tests would fail at pytest collection time). The event carries a `kind` field (`pom` or `service`) so UI and API runs are distinguishable; the warning lists the missing modules and the offending imports.

Quality gates

Every Architect run finishes with an advisory static-analysis pass over the files it wrote (and any files it preserved on disk because they already existed). Findings print to stdout under the file tree, and the structured events flow into architect.log alongside the rest of the run.

The reviewer never fails the run — it's informational. To re-run it independently against an existing app:

python -m tools.architect_reviewer --app adams_golf_club
# Optional: source-coupled gates (e.g. every-AC-has-a-test) need a source
python -m tools.architect_reviewer --app adams_golf_club --source <ALM URL>
# JSON output for tooling
python -m tools.architect_reviewer --app adams_golf_club --format json

Exit codes follow the standard 0/1/2 convention (0 = clean, 1 = at least one ERROR, 2 = usage error). Logs land at results/architect/reviewer.log — same directory as architect.log.

A small set of gates can be globally silenced via settings when a codebase legitimately diverges from the framework convention:

# .env
ARCHITECT_REVIEWER_DISABLED_GATES=["pom_has_class_docstring","pom_url_returns_full_url"]

Gate IDs are stable: once published, an ID always refers to the same check. Deleted gates leave a reserved slot rather than being reused, so disable-list entries don't silently target a different check after an upgrade.

Running the Architect on an existing app

The tool is designed as a greenfield scaffold. On an app that already has pages/tests, write_code_file refuses to overwrite — you'll see already exists errors in the log. The Architect currently has no file-listing tool, so it can't reason about what's already there and will still try to generate a full set of pages/tests, which may duplicate ACs already covered elsewhere. Treat the output as a starting point and prune duplicates manually.

🏗️ Adding Your Application

Adding a new application requires zero changes to the framework core. Each app gets its own subpackage under pages/ and tests/:

pages/hr_portal/dashboard_page.py
tests/hr_portal/test_search.py

The app namespace (hr_portal) is auto-derived from the module path and used to prefix tool names (hr_portal.DashboardPage.find_employee), so apps never collide.

💡 Skip the boilerplate: For a greenfield app with requirements already written down, the Architect generates the subpackage, Page Objects, conftest, and tests automatically. The walkthrough below covers the underlying structure so you can extend Architect-scaffolded code or hand-write from scratch.

Create an app subpackage under pages/ with an __init__.py.
Create a Page Object: Extend BasePage and decorate your interactions with @tool_method.
Define a Test in a matching tests/<app>/ subpackage: register your page, build the graph, and invoke it with a goal.

# pages/hr_portal/dashboard_page.py
from pages.base_page import BasePage, tool_method

class DashboardPage(BasePage):
    @tool_method
    async def find_employee(self, name: str) -> str:
        """Search for an employee in the main directory."""
        await self._page.fill("#search", name)
        await self._page.click("#submit")
        return f"Search triggered for {name}"

📖 Detailed Guide: For a step-by-step technical walkthrough on onboarding, see the New App Recipe.

🧠 Architecture & Deep Dive

This framework is built on LangGraph, a state-machine based approach to AI orchestration.

Layer	Component
Tools	Your Page Objects, native Vision Fallbacks, and the Database Toolkit
Logic	Plan, Verify, and Diagnose nodes in `agents/nodes.py`
Healing	Validation-gated selector repair in `agents/heal_node.py` — DOM-verified patches, dated source annotations, and an append-only `results/heal_log.jsonl` ledger
Assertions	`tests/_helpers.py::assert_clean_pass` — the two-pass gate that fails any run which fired the heal node

For a full technical breakdown of the 4 core nodes, state management, and the self-healing plumbing, please refer to the: 👉 Architecture & User Guide

📊 Reporting & Debugging

Execution Log: `results/logs/execution.log` (JSONL forensic audit)
Architect Log: `results/architect/architect.log` (per-run JSONL audit of the most recent scaffold — truncated at each run, isolated from pytest's `results/logs/` sweep)
Heal Ledger: `results/heal_log.jsonl` (append-only maintenance record — one line per heal attempt with a stable status enum: `applied`, `rejected_validation`, `rejected_patch`, `rejected_llm`, or `skipped_blank_page`)
Allure Dashboard: `results/allure-results` (View via `allure serve results/allure-results`)
Playwright Traces: `results/traces/` (View via `playwright show-trace`)
Checkpoints: `results/checkpoints/agent.db` (State snapshots)

🔬 Sample Application

One sample app ships with the framework to exercise the full UI + DB + self-healing surface end-to-end:

App	Location	What it demonstrates
`adams_golf_club`	pages/adams_golf_club/, tests/adams_golf_club/	Local Flask app with a real backend DB — exercises UI + `db.verify_record` hybrid goals, member/admin role credentials, and the full self-healing path against controlled selector drift.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.azure-pipelines		.azure-pipelines
agents		agents
apps		apps
pages		pages
services		services
tests		tests
tools		tools
utils		utils
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
architect_cli.py		architect_cli.py
config.py		config.py
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 Agentic Testing Framework

✨ Key Features

🚀 Quick Start

1. Prerequisites

2. Installation

3. Configuration

🛠️ Usage

Run via CLI

Run via Pytest

🗄️ Database Assertions

Quick start

Unit tests

🔌 API Testing

Quick start

Auth methods

Architect — --mode api

Sample app

🏛️ Architect — Auto-Scaffold from Requirements

Quick start

CLI options

Configuration (.env)

What gets generated

Guardrails

Cross-model robustness

Operator diagnostics

Quality gates

Running the Architect on an existing app

🏗️ Adding Your Application

🧠 Architecture & Deep Dive

📊 Reporting & Debugging

🔬 Sample Application

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Architect — `--mode api`

Configuration (`.env`)

Packages