SynthFix

Adaptive neuro-symbolic code vulnerability repair for code LLMs. Companion code for the ACL 2026 Findings paper "SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair."

SynthFix is a hybrid training and inference framework for automated program repair. A small neural router inspects each training example's difficulty features and decides, per sample, how strongly a symbolic reward should shape the gradient: easy samples are learned with plain supervised fine-tuning (SFT), while harder samples additionally receive a variance-reduced REINFORCE (RLOO) update driven by a composite symbolic reward. Crucially, the same symbolic signals are reused at inference time to rerank a pool of K candidate patches. The two halves — router-gated symbolic RFT at training time and symbolic-feature-guided best-of-K selection at decoding time — are what deliver SynthFix's gains.

Headline results

We evaluate on deployable repair metrics — whether a patch actually works — across five modern code LLMs. functional pass@1 is the fraction of bugs whose patched program compiles and passes its held-out test suite; security cleared is the fraction of SVEN vulnerabilities that Semgrep reports as removed. Selection is leak-free (public tests / static signals only) with a greedy floor, and best-of-K uses K=16. The RFT-only baseline is budget-matched (2-epoch SFT warmup + 2 RL epochs).

Functional pass@1 — pyrepair (Python, n=115) / CodeFlaws (C, n=389)

Model	pyrepair SFT	RFT	SynthFix	CodeFlaws SFT	RFT	SynthFix
DeepSeek-1.3B	68.7	69.6	87.0	12.3	10.5	15.2
Llama-3.2-3B	73.9	73.9	85.2	14.4	13.6	22.1
Qwen3-4B-Base	80.9	80.9	91.3	18.8	15.9	22.6
CodeLlama-7B	72.2	75.7	81.7	12.9	12.6	18.8
StarCoder2-7B	79.1	76.5	93.0	15.7	15.7	22.6

Security cleared — SVEN (Semgrep, n=16)

Model	SFT	RFT	SynthFix
DeepSeek-1.3B	87.5	87.5	100
Llama-3.2-3B	100	87.5	100
Qwen3-4B-Base	87.5	87.5	93.8
CodeLlama-7B	93.8	93.8	93.8
StarCoder2-7B	87.5	81.2	100

SynthFix improves over SFT on every benchmark and model (or ties at the ceiling). The budget-matched RFT-only baseline tracks SFT and on several cells even regresses under greedy decoding — training-time RL alone does not reliably translate into deployable repairs. Greedy CodeBLEU is reported only as a diagnostic (docs/RESULTS.md): the three methods cluster within ~1 CodeBLEU point, even though their functional/security quality differs sharply — which is exactly why we evaluate on execution and security. Full numbers, ablations, and the reproduction recipe are in docs/RESULTS.md.

Method

The training pipeline has three phases:

Phase	What happens
1. SFT warmup (2 epochs)	Standard cross-entropy fine-tuning on `(buggy → fixed)` pairs. Per-sample loss is recorded as a difficulty signal.
2. Router pre-training	A 2-layer MLP router is supervised to predict "above-median loss" from code features: AST complexity, CFG depth, code length, and the current per-sample loss.
3. Router-gated RFT (2 epochs)	Per batch: (a) compute the SFT anchor loss, (b) draw `K` on-policy continuations (RLOO baseline), (c) score each with the split symbolic reward, (d) gate the RL contribution by the router's probability that a sample is "hard", (e) combine as `L = L_SFT + β · gate · A_LOO · CE_sampled`, with a KL anchor to the SFT reference.

Split symbolic reward

Every generated continuation is scored on multiple dimensions in [0, 1] (see src/models/), then combined into a composite reward and also exposed component-wise so the same signals can be reused as features for the inference-time reranker:

AST — syntactic validity / bracket balance (symbolic.py).
CFG / DFG — control- and data-flow fidelity vs. the reference via tree-sitter parses (parse_dfg.py).
Lint — language-specific static linting (lint_reward.py).
Security — Semgrep scan for known vulnerability patterns (used directly for SVEN; symbolic.py).
Execution — for execution-capable benchmarks, the held-out/public test pass rate of the patched program (exec_reward.py).
Repair-effect — reference-free signal that the edit actually changes the buggy region (repair_effect.py).
Surface similarity — character n-gram F-score (chrF) to the reference.

Router & gating

[f_AST, f_CFG, f_len, f_loss]  →  64 (ReLU)  →  64 (ReLU)  →  1 (σ)  →  P(hard)

Features are min-max normalized per batch. During phase 3 the RL term is gated by gate = 1[P(hard) ≥ 0.5]: easy samples receive only the SFT anchor, hard samples additionally receive the variance-reduced REINFORCE signal from RLOO. Advantages are normalized across the batch and clamped to the non-negative half-line, so updates only amplify rollouts that beat their leave-one-out baseline. The KL anchor keeps greedy decoding from collapsing.

Inference-time selection

At decoding time SynthFix samples K=16 diverse candidates (mixed temperature), applies hard validity filters (parse / compile / public tests), and ranks the survivors with a learned reranker over the same symbolic features (src/models/inference.py). A greedy floor guarantees best-of-K never scores below the greedy patch, so inference-time selection can only help.

Installation

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Python 3.9+ and a CUDA-capable GPU are recommended (24 GB+ for 7B models). Parameter-efficient fine-tuning uses LoRA (rank 16, alpha 32, dropout 0.05) on the attention projections. Qwen3-4B-Base needs newer libraries (torch ≥ 2.5, transformers ≥ 4.51); the orchestrators can route it through a separate interpreter via the QWEN3_PY constant.

Models

Model	Size	Hugging Face ID
DeepSeek-Coder	1.3B	`deepseek-ai/deepseek-coder-1.3b-base`
Llama-3.2	3B	`meta-llama/Llama-3.2-3B`
Qwen3-Base	4B	`Qwen/Qwen3-4B-Base`
CodeLlama	7B	`codellama/CodeLlama-7b-hf`
StarCoder2	7B	`bigcode/starcoder2-7b`

Any decoder-only LM exposed through AutoModelForCausalLM can be added by extending the MODEL_PATHS registry in src/train_synthfix.py.

Data

Three execution- and security-grounded benchmarks are used in the paper:

pyrepair (Python) — execution repair built from MBPP + QuixBugs (scripts/data/build_pyexec_benchmark.py).
CodeFlaws (C) — execution repair with per-bug test suites (scripts/data/build_codeflaws_with_tests.py).
SVEN (Python) — security repair, scored with Semgrep.

src/data/process_benchmarks.py converts raw benchmarks (CodeFlaws, SVEN, and FixJS) into a unified JSON schema where every record is {"buggy": str, "fixed": str, "language": str}, split into {train,val,test}.json. The loader in src/data/dataset.py consumes this schema directly, so any dataset expressible in it works without touching the training code. Benchmark data is not bundled with this repository; obtain it from the upstream sources and point the scripts at it via --data_dir (or the data/ defaults).

Usage

Single training run

CUDA_VISIBLE_DEVICES=0 python scripts/train/run_all_experiments.py --worker \
    --method synthfix --model_name deepseek-1.3b --dataset_name codeflaws \
    --data_dir <path/to/codeflaws> --gpu 0 \
    --epochs 4 --sft_warmup_epochs 2 --batch_size 16 --lr 2e-4 \
    --rl_beta 0.12 --kl_beta 0.12 --rloo_k 2 --use_rich_exec_reward \
    --select_metric val_codebleu \
    --save_ckpt_to checkpoints/synthfix_deepseek-1.3b_codeflaws \
    --out results/synthfix_deepseek-1.3b_codeflaws.json

--method accepts sft, rft, or synthfix. The budget-matched RFT baseline uses the same --sft_warmup_epochs 2.

GPU selection. Always pin the device with CUDA_VISIBLE_DEVICES=<id> and pass --gpu 0. Passing a non-zero CUDA index directly produces empty generations for some LLaMA-architecture models; the sweep runners handle this automatically.

Orchestrators (full experiment matrix)

python scripts/experiments/queue_artifact.py   # full (model × benchmark × {SFT,RFT,SynthFix}) matrix
python scripts/experiments/run_rft_matrix.py    # budget-matched RFT-only baseline matrix
python scripts/eval/run_eval_sweep.py           # deployable functional/security best-of-K eval sweep
python scripts/experiments/run_ablations.py     # RQ2 (router) and RQ3 (inference selection) ablations

These encode the paper's experiment grid and assume the directory layout in the Repository layout section (checkpoints under checkpoints/, results under results/); adjust the dataset paths near the top of each file to your own data locations.

Per-benchmark deployable evaluation

# pyrepair (Python functional pass@1)
python scripts/eval/eval_functional_py.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <pyrepair> --split test --model_name <model> --gpu 0 --K 16 \
    --out results/functional_pyrepair_<model>.json

# CodeFlaws (C functional pass@1)
python scripts/eval/eval_functional.py    --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <codeflaws> --split test --model_name <model> --gpu 0 --K 16 \
    --out results/functional_codeflaws_<model>.json

# SVEN (Semgrep security cleared)
python scripts/eval/eval_security_sven.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <sven> --model_name <model> --gpu 0 --K 16 \
    --out results/security_sven_<model>.json

Add --greedy_only to scripts/eval/eval_functional*.py for a fast greedy-decoding pass@1 (used for the RFT-only baseline). scripts/analysis/make_paper_figs.py renders the paper figures from the result JSONs, and scripts/analysis/collect_rft.py / scripts/analysis/aggregate_final.py roll per-run JSONs into a single report.

Repository layout

SynthFix/
├── README.md
├── requirements.txt            # Python dependencies
├── LICENSE · CITATION.cff
├── configs/                    # DeepSpeed ZeRO configs
├── docs/
│   └── RESULTS.md              # Full results tables + reproduction recipe
├── scripts/                    # All runnable entry points, grouped by purpose
│   ├── train/
│   │   ├── run_all_experiments.py   # Single train/eval worker (sft | rft | synthfix)
│   │   ├── orchestrate_final.py     # End-to-end SFT → SynthFix → aggregate pipeline
│   │   └── orchestrate_twostage.py  # Two-stage (SFT warm-start) driver
│   ├── eval/
│   │   ├── eval_functional.py       # CodeFlaws functional pass@1 eval
│   │   ├── eval_functional_py.py    # pyrepair functional pass@1 eval
│   │   ├── eval_security_sven.py    # SVEN Semgrep security eval
│   │   ├── eval_strategies.py       # Inference-strategy comparison
│   │   ├── codebleu_one.py          # Standalone greedy-CodeBLEU eval
│   │   └── run_eval_sweep.py        # Deployable functional/security best-of-K sweep
│   ├── data/
│   │   ├── build_pyexec_benchmark.py    # Build pyrepair from MBPP + QuixBugs
│   │   └── build_codeflaws_with_tests.py # Attach per-bug tests to CodeFlaws
│   ├── experiments/
│   │   ├── queue_artifact.py        # Full (model × benchmark × method) matrix driver
│   │   ├── run_rft_matrix.py        # Budget-matched RFT-only baseline matrix
│   │   ├── run_ablations.py         # RQ2 (router) / RQ3 (inference) ablations
│   │   └── run_sensitivity.sh       # Reward-weight sensitivity sweep
│   └── analysis/
│       ├── make_paper_figs.py       # Render paper figures from result JSONs
│       ├── collect_rft.py           # Aggregate RFT/SynthFix metrics into tables
│       ├── aggregate_final.py       # Collect per-run JSONs into a report
│       └── diag_*.py                # Standalone diagnostics
└── src/                        # Importable library (no path setup needed)
    ├── train_synthfix.py       # Router-gated REINFORCE training + evaluation
    ├── train_baseline.py       # SFT / RFT baselines (shared data path)
    ├── data/                   # RepairDataset, dataloaders, benchmark processing
    └── models/
        ├── router.py           # MLP router + feature extraction
        ├── reward.py           # Composite symbolic reward
        ├── symbolic.py         # AST / security split features
        ├── parse_dfg.py        # tree-sitter parse, AST/CFG/DFG scores
        ├── lint_reward.py      # Language-specific linting reward
        ├── exec_reward.py      # Execution test-case reward
        ├── repair_effect.py    # Reference-free repair-effect reward
        └── inference.py        # Best-of-K candidate reranker

Every script resolves the repository root automatically, so it can be run from anywhere (e.g. python scripts/eval/eval_functional.py ...) without setting PYTHONPATH.

Citation

If you use this code in your research, please cite:

@inproceedings{synthfix2026,
  title     = {SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair},
  author    = {SynthFix Authors},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year      = {2026},
}

License

Released under the terms of the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthFix

Headline results

Functional pass@1 — pyrepair (Python, n=115) / CodeFlaws (C, n=389)

Security cleared — SVEN (Semgrep, n=16)

Method

Split symbolic reward

Router & gating

Inference-time selection

Installation

Models

Data

Usage

Single training run

Orchestrators (full experiment matrix)

Per-benchmark deployable evaluation

Repository layout

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data		data
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SynthFix

Headline results

Functional pass@1 — pyrepair (Python, n=115) / CodeFlaws (C, n=389)

Security cleared — SVEN (Semgrep, n=16)

Method

Split symbolic reward

Router & gating

Inference-time selection

Installation

Models

Data

Usage

Single training run

Orchestrators (full experiment matrix)

Per-benchmark deployable evaluation

Repository layout

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages