Adaptive neuro-symbolic code vulnerability repair for code LLMs. Companion code for the ACL 2026 Findings paper "SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair."
SynthFix is a hybrid training and inference framework for automated
program repair. A small neural router inspects each training example's
difficulty features and decides, per sample, how strongly a symbolic
reward should shape the gradient: easy samples are learned with plain
supervised fine-tuning (SFT), while harder samples additionally receive a
variance-reduced REINFORCE (RLOO) update driven by a composite symbolic
reward. Crucially, the same symbolic signals are reused at inference
time to rerank a pool of K candidate patches. The two halves —
router-gated symbolic RFT at training time and symbolic-feature-guided
best-of-K selection at decoding time — are what deliver SynthFix's gains.
We evaluate on deployable repair metrics — whether a patch actually
works — across five modern code LLMs. functional pass@1 is the fraction
of bugs whose patched program compiles and passes its held-out test
suite; security cleared is the fraction of SVEN vulnerabilities that
Semgrep reports as removed. Selection is leak-free (public tests / static
signals only) with a greedy floor, and best-of-K uses K=16. The RFT-only
baseline is budget-matched (2-epoch SFT warmup + 2 RL epochs).
| Model | pyrepair SFT | RFT | SynthFix | CodeFlaws SFT | RFT | SynthFix |
|---|---|---|---|---|---|---|
| DeepSeek-1.3B | 68.7 | 69.6 | 87.0 | 12.3 | 10.5 | 15.2 |
| Llama-3.2-3B | 73.9 | 73.9 | 85.2 | 14.4 | 13.6 | 22.1 |
| Qwen3-4B-Base | 80.9 | 80.9 | 91.3 | 18.8 | 15.9 | 22.6 |
| CodeLlama-7B | 72.2 | 75.7 | 81.7 | 12.9 | 12.6 | 18.8 |
| StarCoder2-7B | 79.1 | 76.5 | 93.0 | 15.7 | 15.7 | 22.6 |
| Model | SFT | RFT | SynthFix |
|---|---|---|---|
| DeepSeek-1.3B | 87.5 | 87.5 | 100 |
| Llama-3.2-3B | 100 | 87.5 | 100 |
| Qwen3-4B-Base | 87.5 | 87.5 | 93.8 |
| CodeLlama-7B | 93.8 | 93.8 | 93.8 |
| StarCoder2-7B | 87.5 | 81.2 | 100 |
SynthFix improves over SFT on every benchmark and model (or ties at the
ceiling). The budget-matched RFT-only baseline tracks SFT and on several
cells even regresses under greedy decoding — training-time RL alone does not
reliably translate into deployable repairs. Greedy CodeBLEU is reported only
as a diagnostic (docs/RESULTS.md): the three methods cluster within ~1 CodeBLEU
point, even though their functional/security quality differs sharply — which
is exactly why we evaluate on execution and security. Full numbers,
ablations, and the reproduction recipe are in docs/RESULTS.md.
The training pipeline has three phases:
| Phase | What happens |
|---|---|
| 1. SFT warmup (2 epochs) | Standard cross-entropy fine-tuning on (buggy → fixed) pairs. Per-sample loss is recorded as a difficulty signal. |
| 2. Router pre-training | A 2-layer MLP router is supervised to predict "above-median loss" from code features: AST complexity, CFG depth, code length, and the current per-sample loss. |
| 3. Router-gated RFT (2 epochs) | Per batch: (a) compute the SFT anchor loss, (b) draw K on-policy continuations (RLOO baseline), (c) score each with the split symbolic reward, (d) gate the RL contribution by the router's probability that a sample is "hard", (e) combine as L = L_SFT + β · gate · A_LOO · CE_sampled, with a KL anchor to the SFT reference. |
Every generated continuation is scored on multiple dimensions in [0, 1]
(see src/models/), then combined into a composite reward and also
exposed component-wise so the same signals can be reused as features for the
inference-time reranker:
- AST — syntactic validity / bracket balance (
symbolic.py). - CFG / DFG — control- and data-flow fidelity vs. the reference via
tree-sitter parses (
parse_dfg.py). - Lint — language-specific static linting (
lint_reward.py). - Security — Semgrep scan for known vulnerability patterns
(used directly for SVEN;
symbolic.py). - Execution — for execution-capable benchmarks, the held-out/public
test pass rate of the patched program (
exec_reward.py). - Repair-effect — reference-free signal that the edit actually changes
the buggy region (
repair_effect.py). - Surface similarity — character n-gram F-score (chrF) to the reference.
[f_AST, f_CFG, f_len, f_loss] → 64 (ReLU) → 64 (ReLU) → 1 (σ) → P(hard)
Features are min-max normalized per batch. During phase 3 the RL term is
gated by gate = 1[P(hard) ≥ 0.5]: easy samples receive only the SFT
anchor, hard samples additionally receive the variance-reduced REINFORCE
signal from RLOO. Advantages are normalized across the batch and clamped to
the non-negative half-line, so updates only amplify rollouts that beat their
leave-one-out baseline. The KL anchor keeps greedy decoding from collapsing.
At decoding time SynthFix samples K=16 diverse candidates (mixed
temperature), applies hard validity filters (parse / compile / public
tests), and ranks the survivors with a learned reranker over the same
symbolic features (src/models/inference.py). A greedy floor guarantees
best-of-K never scores below the greedy patch, so inference-time selection
can only help.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtPython 3.9+ and a CUDA-capable GPU are recommended (24 GB+ for 7B models).
Parameter-efficient fine-tuning uses LoRA (rank 16, alpha 32, dropout 0.05)
on the attention projections. Qwen3-4B-Base needs newer libraries
(torch ≥ 2.5, transformers ≥ 4.51); the orchestrators can route it
through a separate interpreter via the QWEN3_PY constant.
| Model | Size | Hugging Face ID |
|---|---|---|
| DeepSeek-Coder | 1.3B | deepseek-ai/deepseek-coder-1.3b-base |
| Llama-3.2 | 3B | meta-llama/Llama-3.2-3B |
| Qwen3-Base | 4B | Qwen/Qwen3-4B-Base |
| CodeLlama | 7B | codellama/CodeLlama-7b-hf |
| StarCoder2 | 7B | bigcode/starcoder2-7b |
Any decoder-only LM exposed through AutoModelForCausalLM can be added by
extending the MODEL_PATHS registry in src/train_synthfix.py.
Three execution- and security-grounded benchmarks are used in the paper:
- pyrepair (Python) — execution repair built from MBPP + QuixBugs
(
scripts/data/build_pyexec_benchmark.py). - CodeFlaws (C) — execution repair with per-bug test suites
(
scripts/data/build_codeflaws_with_tests.py). - SVEN (Python) — security repair, scored with Semgrep.
src/data/process_benchmarks.py converts raw benchmarks (CodeFlaws, SVEN,
and FixJS) into a unified JSON schema where every record is
{"buggy": str, "fixed": str, "language": str}, split into
{train,val,test}.json. The loader in src/data/dataset.py consumes this
schema directly, so any dataset expressible in it works without touching
the training code. Benchmark data is not bundled with this repository;
obtain it from the upstream sources and point the scripts at it via
--data_dir (or the data/ defaults).
CUDA_VISIBLE_DEVICES=0 python scripts/train/run_all_experiments.py --worker \
--method synthfix --model_name deepseek-1.3b --dataset_name codeflaws \
--data_dir <path/to/codeflaws> --gpu 0 \
--epochs 4 --sft_warmup_epochs 2 --batch_size 16 --lr 2e-4 \
--rl_beta 0.12 --kl_beta 0.12 --rloo_k 2 --use_rich_exec_reward \
--select_metric val_codebleu \
--save_ckpt_to checkpoints/synthfix_deepseek-1.3b_codeflaws \
--out results/synthfix_deepseek-1.3b_codeflaws.json--method accepts sft, rft, or synthfix. The budget-matched RFT
baseline uses the same --sft_warmup_epochs 2.
GPU selection. Always pin the device with
CUDA_VISIBLE_DEVICES=<id>and pass--gpu 0. Passing a non-zero CUDA index directly produces empty generations for some LLaMA-architecture models; the sweep runners handle this automatically.
python scripts/experiments/queue_artifact.py # full (model × benchmark × {SFT,RFT,SynthFix}) matrix
python scripts/experiments/run_rft_matrix.py # budget-matched RFT-only baseline matrix
python scripts/eval/run_eval_sweep.py # deployable functional/security best-of-K eval sweep
python scripts/experiments/run_ablations.py # RQ2 (router) and RQ3 (inference selection) ablationsThese encode the paper's experiment grid and assume the directory layout in
the Repository layout section (checkpoints under checkpoints/, results
under results/); adjust the dataset paths near the top of each file to your
own data locations.
# pyrepair (Python functional pass@1)
python scripts/eval/eval_functional_py.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
--data <pyrepair> --split test --model_name <model> --gpu 0 --K 16 \
--out results/functional_pyrepair_<model>.json
# CodeFlaws (C functional pass@1)
python scripts/eval/eval_functional.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
--data <codeflaws> --split test --model_name <model> --gpu 0 --K 16 \
--out results/functional_codeflaws_<model>.json
# SVEN (Semgrep security cleared)
python scripts/eval/eval_security_sven.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
--data <sven> --model_name <model> --gpu 0 --K 16 \
--out results/security_sven_<model>.jsonAdd --greedy_only to scripts/eval/eval_functional*.py for a fast
greedy-decoding pass@1 (used for the RFT-only baseline).
scripts/analysis/make_paper_figs.py renders the paper figures from the
result JSONs, and scripts/analysis/collect_rft.py /
scripts/analysis/aggregate_final.py roll per-run JSONs into a single report.
SynthFix/
├── README.md
├── requirements.txt # Python dependencies
├── LICENSE · CITATION.cff
├── configs/ # DeepSpeed ZeRO configs
├── docs/
│ └── RESULTS.md # Full results tables + reproduction recipe
├── scripts/ # All runnable entry points, grouped by purpose
│ ├── train/
│ │ ├── run_all_experiments.py # Single train/eval worker (sft | rft | synthfix)
│ │ ├── orchestrate_final.py # End-to-end SFT → SynthFix → aggregate pipeline
│ │ └── orchestrate_twostage.py # Two-stage (SFT warm-start) driver
│ ├── eval/
│ │ ├── eval_functional.py # CodeFlaws functional pass@1 eval
│ │ ├── eval_functional_py.py # pyrepair functional pass@1 eval
│ │ ├── eval_security_sven.py # SVEN Semgrep security eval
│ │ ├── eval_strategies.py # Inference-strategy comparison
│ │ ├── codebleu_one.py # Standalone greedy-CodeBLEU eval
│ │ └── run_eval_sweep.py # Deployable functional/security best-of-K sweep
│ ├── data/
│ │ ├── build_pyexec_benchmark.py # Build pyrepair from MBPP + QuixBugs
│ │ └── build_codeflaws_with_tests.py # Attach per-bug tests to CodeFlaws
│ ├── experiments/
│ │ ├── queue_artifact.py # Full (model × benchmark × method) matrix driver
│ │ ├── run_rft_matrix.py # Budget-matched RFT-only baseline matrix
│ │ ├── run_ablations.py # RQ2 (router) / RQ3 (inference) ablations
│ │ └── run_sensitivity.sh # Reward-weight sensitivity sweep
│ └── analysis/
│ ├── make_paper_figs.py # Render paper figures from result JSONs
│ ├── collect_rft.py # Aggregate RFT/SynthFix metrics into tables
│ ├── aggregate_final.py # Collect per-run JSONs into a report
│ └── diag_*.py # Standalone diagnostics
└── src/ # Importable library (no path setup needed)
├── train_synthfix.py # Router-gated REINFORCE training + evaluation
├── train_baseline.py # SFT / RFT baselines (shared data path)
├── data/ # RepairDataset, dataloaders, benchmark processing
└── models/
├── router.py # MLP router + feature extraction
├── reward.py # Composite symbolic reward
├── symbolic.py # AST / security split features
├── parse_dfg.py # tree-sitter parse, AST/CFG/DFG scores
├── lint_reward.py # Language-specific linting reward
├── exec_reward.py # Execution test-case reward
├── repair_effect.py # Reference-free repair-effect reward
└── inference.py # Best-of-K candidate reranker
Every script resolves the repository root automatically, so it can be run
from anywhere (e.g. python scripts/eval/eval_functional.py ...) without
setting PYTHONPATH.
If you use this code in your research, please cite:
@inproceedings{synthfix2026,
title = {SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair},
author = {SynthFix Authors},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
}Released under the terms of the LICENSE file.