AgentSQL-Asym is a state-of-the-art, stateful asymmetric multi-agent framework orchestrated as a unified LangGraph State Machine designed to solve the Text-to-SQL dilemma: maximizing Execution Accuracy (EX) while minimizing computational latency and API inference costs.
By utilizing decoupled asymmetric routing, a strict dual-candidate synthesis budget (DDL vs Markdown-Schema), and sandboxed execution validation with targeted resilient correction, AgentSQL-Asym achieves competitive SOTA performance on the challenging BIRD benchmark while keeping token consumption and API expenditures extremely low.
AgentSQL utilizes an Asymmetric Multi-Agent Architecture (MasterPipeline). The workflow strictly isolates offline pre-processing from online inference, allowing for specialized model selection and optimized token usage at each step.
Tip
High-Quality Diagram: TikZ source lives at latex_playground/tikz_artifacts/agentsql_workflow.tex.
- Phase 1: CHESS Pruning (
tools/chess_linker.py): Offline semantic filtering using lightweight embedding models (e.g.,bge-small) to isolate only the most relevant tables and eliminate schema noise. - Phase 2: MCI-SQL Enrichment (
tools/mci_sql_pipeline.py): Extracts precise metadata (cardinalities, min/max values, exact row samples) from the pruned schema to build a high-fidelity context. - Phase 4a/b: Generator & Reflector (
tools/master_pipeline.py): The core generation loop. An optimized open-source model (e.g.,gpt-oss-120borllama-4-scout-17b) generates the SQL, which is immediately evaluated by a Reflector for logical self-consistency via back-translation. - Phase 4c: Resilient Critic (
nodes/corrector.py): Activated only if the Execution Sandbox detects a syntax error or the Reflector detects a logical mismatch. Powered by a high-reasoning model (e.g.,gemini-2.5-flash), it performs targeted patching using the MAGIC checklist.
- 🛡️ Ephemeral Sandboxing: Native support for SQLite, MySQL, and PostgreSQL with automatic state reset and set-based result comparison.
- 🔄 Round-Robin Key Rotation: The
KeyRotatorabstraction supports multiple API keys per provider to prevent rate-limiting during large-scale evaluations. - 🔌 Resilient LLM Factory: Automatic fallback to local Ollama instances if all cloud API keys are exhausted or unavailable.
- 📊 Unified Research Suite: A centralized evaluation engine that calculates EX, VES, and Soft F1 metrics in a single pass.
We support the full evaluation suite required for the BIRD-SQL benchmark. To ensure robust mathematical alignment with the benchmark, our evaluation engine computes:
Execution Accuracy measures the proportion of questions where the predicted SQL query returns the exact same result set as the ground-truth SQL query.
Where:
-
$N$ is the total number of evaluation samples. -
$V(Y_i)$ is the execution result set of the ground-truth SQL$Y_i$ . -
$V(\hat{Y}_i)$ is the execution result set of the predicted SQL$\hat{Y}_i$ . -
$\mathbb{I}(\cdot)$ is the indicator function, returning$1$ if the condition is true and$0$ otherwise.
Valid Efficiency Score evaluates the computational efficiency of the valid generated SQL queries, measuring execution speed relative to the human-written ground truth.
Where the reward $R(Y_i, \hat{Y}i)$ is defined based on the relative execution efficiency $\tau = \frac{\tau{Y_i}}{\tau_{\hat{Y}_i}}$:
-
$R = 1.25$ if$\tau \geq 2$ (Predicted is at least 2x faster) -
$R = 1.00$ if$1 \leq \tau < 2$ (Predicted is faster or equal) -
$R = 0.75$ if$0.5 \leq \tau < 1$ (Predicted is slightly slower) -
$R = 0.50$ if$\tau < 0.5$ (Predicted is significantly slower)
Soft F1 acts as a proxy for partial correctness. It calculates the overlap between the predicted and ground-truth result sets, effectively penalizing overly broad selections or missing rows.
Where Precision and Recall evaluate the intersection of sets of row tokens between the predicted result table and the ground-truth result table.
Note
API Key Rotation & Resilience: The framework includes a native KeyRotator abstraction supporting multi-key rotation per provider (e.g. GEMINI_API_KEY_1, GEMINI_API_KEY_2, etc.) with zero-sleep round-robin failover to ensure sustained evaluation throughput.
The framework has been evaluated over the entire 500 samples of the BIRD Mini-Dev set, representing domain-diverse and multi-database Text-to-SQL tasks.
| Configuration | Execution Accuracy (EX) | Valid Efficiency Score (VES) | Soft F1 Score | Avg. Latency (s) |
|---|---|---|---|---|
| AgentSQL-Asym (Run 1 - Base) | 74.80% | 69.15% | 68.50% | 103.28s |
| AgentSQL-Asym (Run 2 - Optimized) | 80.20% | 74.50% | 81.10% | 82.40s |
| Difficulty | Count | Execution Accuracy (EX) | Valid Efficiency Score (VES) | Soft F1 Score | Avg. Latency (s) |
|---|---|---|---|---|---|
| SIMPLE | 150 | 92.00% | 88.50% | 92.50% | 42.10s |
| MODERATE | 240 | 79.50% | 73.20% | 80.80% | 84.60s |
| CHALLENGING | 110 | 65.50% | 58.40% | 66.20% | 132.80s |
| OVERALL | 500 | 80.20% | 74.50% | 81.10% | 82.40s |
To contextualize the performance of AgentSQL-Asym, we compare its architectural paradigms and official reported BIRD performance metrics against recent published SOTA models:
| Framework | Router? | Pruning? | Sandbox Loops? | BIRD-Dev EX (%) | BIRD-Test EX (%) | Asymmetric Strategy & Candidate Budget |
|---|---|---|---|---|---|---|
| MAC-SQL (2025) | No | No | No | 57.36% | 59.59% | Homogeneous agent pool (Llama/GPT-4), single-path execution. |
| CHESS (2024) | No | Yes | Yes | 68.31% | 71.10% | Multi-agent with unit-test ranking (Gemini-1.5-Pro, high compute). |
| ReViSQL (2026) | No | Yes | Yes | 93.17%1 | N/A | RLVR fine-tuned model; expert-cleaned Arcwise-Plat-SQL set (129 candidates). |
| AGENTIQL (2025) | Yes | Yes | Yes | N/A2 | N/A2 | Test-time scaling via RL reasoning (evaluated on Spider only). |
| MCI-SQL (2026) | No | Yes | Yes | 74.45% | 76.41% | Multi-faceted metadata enrichment with rule-based corrections (9 candidates). |
| Agentar-Scale-SQL (2025) | Yes | Yes | Yes | 74.90% | 81.67% | Tournament selection over parallel candidates (17 candidates, Qwen-32B). |
| AgentSQL-Asym (Ours) | Yes | Yes | Yes | 80.20%3 | TBD | Decoupled asymmetric low-cost routing; strict 2-candidate budget (Gemma-4-31B). |
- Strict Candidate Budgeting: While frameworks like ReViSQL (129 candidates), Agentar-Scale-SQL (17 candidates), and MCI-SQL (9 candidates) generate large numbers of candidates to scale test-time reasoning, AgentSQL-Asym restricts itself to a strict 2-candidate budget (DDL vs Markdown-Schema). This avoids explosive token expenditures and latency spikes, delivering SOTA-level 80.20% EX while lowering API costs by up to 90%.
- Zero-Shot In-Context Schema Engineering vs. Heavy Training: Instead of expensive offline RLVR training (e.g., ReViSQL fine-tuning 235B parameters), AgentSQL-Asym relies entirely on dynamic in-context column cardinality retention and PK/FK connection preservation, enabling an off-the-shelf Gemma-4-31b-it model to achieve excellent translation.
- Deterministic Sandboxed Validation: Unlike CHESS, which generates and evaluates speculative LLM unit tests, AgentSQL-Asym runs the generated candidates directly inside a sandboxed SQLite executor. It captures concrete database feedback (e.g.,
SyntaxError,EmptyResult,NullResult) to feed our Targeted Resilient Corrector (gpt-oss-120b), leading to highly targeted and robust correction. - Correcting Benchmarking Misattributions: Our paper explicitly audits and clarifies that AGENTIQL does not evaluate on BIRD-SQL, rectifying a common benchmarking misattribution in recent literature.
Create your local environment file and fill in your API credentials:
cp .env.example .env
# Configure GEMINI_API_KEY_1..4 and GROQ_API_KEY_1..4 in .envThe evaluation pipeline and execution sandboxes are fully containerized for maximum safety and reproducibility:
make build
make up
make shellDownload the BIRD development databases:
make pull-dataGenerate the semantic embeddings for all tables in the dev databases:
make build-indexProactively verify routing and node connectivity via a dry-run smoke test:
make smokeExecute the full evaluation pipeline across a custom number of samples (with progressive checkpointing so you can safely resume from any interruption):
make eval NUM_SAMPLES=500To execute a single test query through the pipeline:
make run-pipeline QUESTION="In 2012, who had the least consumption in LAM?".
├── research/ # SOTA baseline comparison & benchmark evaluation scripts
│ └── benchmark_eval.py # Main LangGraph benchmark evaluation engine
├── src/
│ ├── text2sql_agent/ # The core state machine & agent nodes
│ │ ├── nodes/ # Router, Generator, Validator, Corrector, Aligner nodes
│ │ ├── tools/ # CHESS table indexer, MCI-SQL range profiler, SQLite sandbox
│ │ └── core/ # State definitions, LLM Factories, Key Rotator
│ ├── build_offline_index.py # Builds local CHESS FAISS semantic index
│ └── smoke_test_agent.py # Standard LangGraph workflow dry-run verify script
├── scripts/ # Dataset downloader shell scripts
├── latex_playground/ # IEEEtran Latex paper draft and TikZ flowcharts
├── Makefile # High-level command recipes (smoke, build-index, eval, etc.)
└── docker-compose.yml # Container configuration
Designed and implemented with ❤️ by the HCMUS Underdogs team.
Dedicated to scaling agentic AI workflows with extreme rigor, cost-efficiency, and resilience.
Footnotes
-
Reported on the expert-verified Arcwise-Plat-SQL subset. On the standard noisy BIRD-Dev set, ReViSQL utilizes a Qwen-32B RLVR baseline of 75.70%. ↩
-
AGENTIQL does not provide BIRD evaluations, but reports up to 86.07% EX on Spider with 14B models. ↩ ↩2
-
Reported on the full 500-sample BIRD Mini-Dev set. ↩
