Skip to content

CoderDoge1108/SynthFix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SynthFix

Adaptive neuro-symbolic code vulnerability repair for code LLMs. Companion code for the ACL 2026 Findings paper "SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair."

SynthFix is a hybrid training and inference framework for automated program repair. A small neural router inspects each training example's difficulty features and decides, per sample, how strongly a symbolic reward should shape the gradient: easy samples are learned with plain supervised fine-tuning (SFT), while harder samples additionally receive a variance-reduced REINFORCE (RLOO) update driven by a composite symbolic reward. Crucially, the same symbolic signals are reused at inference time to rerank a pool of K candidate patches. The two halves — router-gated symbolic RFT at training time and symbolic-feature-guided best-of-K selection at decoding time — are what deliver SynthFix's gains.


Headline results

We evaluate on deployable repair metrics — whether a patch actually works — across five modern code LLMs. functional pass@1 is the fraction of bugs whose patched program compiles and passes its held-out test suite; security cleared is the fraction of SVEN vulnerabilities that Semgrep reports as removed. Selection is leak-free (public tests / static signals only) with a greedy floor, and best-of-K uses K=16. The RFT-only baseline is budget-matched (2-epoch SFT warmup + 2 RL epochs).

Functional pass@1 — pyrepair (Python, n=115) / CodeFlaws (C, n=389)

Model pyrepair SFT RFT SynthFix CodeFlaws SFT RFT SynthFix
DeepSeek-1.3B 68.7 69.6 87.0 12.3 10.5 15.2
Llama-3.2-3B 73.9 73.9 85.2 14.4 13.6 22.1
Qwen3-4B-Base 80.9 80.9 91.3 18.8 15.9 22.6
CodeLlama-7B 72.2 75.7 81.7 12.9 12.6 18.8
StarCoder2-7B 79.1 76.5 93.0 15.7 15.7 22.6

Security cleared — SVEN (Semgrep, n=16)

Model SFT RFT SynthFix
DeepSeek-1.3B 87.5 87.5 100
Llama-3.2-3B 100 87.5 100
Qwen3-4B-Base 87.5 87.5 93.8
CodeLlama-7B 93.8 93.8 93.8
StarCoder2-7B 87.5 81.2 100

SynthFix improves over SFT on every benchmark and model (or ties at the ceiling). The budget-matched RFT-only baseline tracks SFT and on several cells even regresses under greedy decoding — training-time RL alone does not reliably translate into deployable repairs. Greedy CodeBLEU is reported only as a diagnostic (docs/RESULTS.md): the three methods cluster within ~1 CodeBLEU point, even though their functional/security quality differs sharply — which is exactly why we evaluate on execution and security. Full numbers, ablations, and the reproduction recipe are in docs/RESULTS.md.


Method

The training pipeline has three phases:

Phase What happens
1. SFT warmup (2 epochs) Standard cross-entropy fine-tuning on (buggy → fixed) pairs. Per-sample loss is recorded as a difficulty signal.
2. Router pre-training A 2-layer MLP router is supervised to predict "above-median loss" from code features: AST complexity, CFG depth, code length, and the current per-sample loss.
3. Router-gated RFT (2 epochs) Per batch: (a) compute the SFT anchor loss, (b) draw K on-policy continuations (RLOO baseline), (c) score each with the split symbolic reward, (d) gate the RL contribution by the router's probability that a sample is "hard", (e) combine as L = L_SFT + β · gate · A_LOO · CE_sampled, with a KL anchor to the SFT reference.

Split symbolic reward

Every generated continuation is scored on multiple dimensions in [0, 1] (see src/models/), then combined into a composite reward and also exposed component-wise so the same signals can be reused as features for the inference-time reranker:

  • AST — syntactic validity / bracket balance (symbolic.py).
  • CFG / DFG — control- and data-flow fidelity vs. the reference via tree-sitter parses (parse_dfg.py).
  • Lint — language-specific static linting (lint_reward.py).
  • Security — Semgrep scan for known vulnerability patterns (used directly for SVEN; symbolic.py).
  • Execution — for execution-capable benchmarks, the held-out/public test pass rate of the patched program (exec_reward.py).
  • Repair-effect — reference-free signal that the edit actually changes the buggy region (repair_effect.py).
  • Surface similarity — character n-gram F-score (chrF) to the reference.

Router & gating

[f_AST, f_CFG, f_len, f_loss]  →  64 (ReLU)  →  64 (ReLU)  →  1 (σ)  →  P(hard)

Features are min-max normalized per batch. During phase 3 the RL term is gated by gate = 1[P(hard) ≥ 0.5]: easy samples receive only the SFT anchor, hard samples additionally receive the variance-reduced REINFORCE signal from RLOO. Advantages are normalized across the batch and clamped to the non-negative half-line, so updates only amplify rollouts that beat their leave-one-out baseline. The KL anchor keeps greedy decoding from collapsing.

Inference-time selection

At decoding time SynthFix samples K=16 diverse candidates (mixed temperature), applies hard validity filters (parse / compile / public tests), and ranks the survivors with a learned reranker over the same symbolic features (src/models/inference.py). A greedy floor guarantees best-of-K never scores below the greedy patch, so inference-time selection can only help.


Installation

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Python 3.9+ and a CUDA-capable GPU are recommended (24 GB+ for 7B models). Parameter-efficient fine-tuning uses LoRA (rank 16, alpha 32, dropout 0.05) on the attention projections. Qwen3-4B-Base needs newer libraries (torch ≥ 2.5, transformers ≥ 4.51); the orchestrators can route it through a separate interpreter via the QWEN3_PY constant.


Models

Model Size Hugging Face ID
DeepSeek-Coder 1.3B deepseek-ai/deepseek-coder-1.3b-base
Llama-3.2 3B meta-llama/Llama-3.2-3B
Qwen3-Base 4B Qwen/Qwen3-4B-Base
CodeLlama 7B codellama/CodeLlama-7b-hf
StarCoder2 7B bigcode/starcoder2-7b

Any decoder-only LM exposed through AutoModelForCausalLM can be added by extending the MODEL_PATHS registry in src/train_synthfix.py.


Data

Three execution- and security-grounded benchmarks are used in the paper:

  • pyrepair (Python) — execution repair built from MBPP + QuixBugs (scripts/data/build_pyexec_benchmark.py).
  • CodeFlaws (C) — execution repair with per-bug test suites (scripts/data/build_codeflaws_with_tests.py).
  • SVEN (Python) — security repair, scored with Semgrep.

src/data/process_benchmarks.py converts raw benchmarks (CodeFlaws, SVEN, and FixJS) into a unified JSON schema where every record is {"buggy": str, "fixed": str, "language": str}, split into {train,val,test}.json. The loader in src/data/dataset.py consumes this schema directly, so any dataset expressible in it works without touching the training code. Benchmark data is not bundled with this repository; obtain it from the upstream sources and point the scripts at it via --data_dir (or the data/ defaults).


Usage

Single training run

CUDA_VISIBLE_DEVICES=0 python scripts/train/run_all_experiments.py --worker \
    --method synthfix --model_name deepseek-1.3b --dataset_name codeflaws \
    --data_dir <path/to/codeflaws> --gpu 0 \
    --epochs 4 --sft_warmup_epochs 2 --batch_size 16 --lr 2e-4 \
    --rl_beta 0.12 --kl_beta 0.12 --rloo_k 2 --use_rich_exec_reward \
    --select_metric val_codebleu \
    --save_ckpt_to checkpoints/synthfix_deepseek-1.3b_codeflaws \
    --out results/synthfix_deepseek-1.3b_codeflaws.json

--method accepts sft, rft, or synthfix. The budget-matched RFT baseline uses the same --sft_warmup_epochs 2.

GPU selection. Always pin the device with CUDA_VISIBLE_DEVICES=<id> and pass --gpu 0. Passing a non-zero CUDA index directly produces empty generations for some LLaMA-architecture models; the sweep runners handle this automatically.

Orchestrators (full experiment matrix)

python scripts/experiments/queue_artifact.py   # full (model × benchmark × {SFT,RFT,SynthFix}) matrix
python scripts/experiments/run_rft_matrix.py    # budget-matched RFT-only baseline matrix
python scripts/eval/run_eval_sweep.py           # deployable functional/security best-of-K eval sweep
python scripts/experiments/run_ablations.py     # RQ2 (router) and RQ3 (inference selection) ablations

These encode the paper's experiment grid and assume the directory layout in the Repository layout section (checkpoints under checkpoints/, results under results/); adjust the dataset paths near the top of each file to your own data locations.

Per-benchmark deployable evaluation

# pyrepair (Python functional pass@1)
python scripts/eval/eval_functional_py.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <pyrepair> --split test --model_name <model> --gpu 0 --K 16 \
    --out results/functional_pyrepair_<model>.json

# CodeFlaws (C functional pass@1)
python scripts/eval/eval_functional.py    --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <codeflaws> --split test --model_name <model> --gpu 0 --K 16 \
    --out results/functional_codeflaws_<model>.json

# SVEN (Semgrep security cleared)
python scripts/eval/eval_security_sven.py --sft_ckpt <ckpt> --synthfix_ckpt <ckpt> \
    --data <sven> --model_name <model> --gpu 0 --K 16 \
    --out results/security_sven_<model>.json

Add --greedy_only to scripts/eval/eval_functional*.py for a fast greedy-decoding pass@1 (used for the RFT-only baseline). scripts/analysis/make_paper_figs.py renders the paper figures from the result JSONs, and scripts/analysis/collect_rft.py / scripts/analysis/aggregate_final.py roll per-run JSONs into a single report.


Repository layout

SynthFix/
├── README.md
├── requirements.txt            # Python dependencies
├── LICENSE · CITATION.cff
├── configs/                    # DeepSpeed ZeRO configs
├── docs/
│   └── RESULTS.md              # Full results tables + reproduction recipe
├── scripts/                    # All runnable entry points, grouped by purpose
│   ├── train/
│   │   ├── run_all_experiments.py   # Single train/eval worker (sft | rft | synthfix)
│   │   ├── orchestrate_final.py     # End-to-end SFT → SynthFix → aggregate pipeline
│   │   └── orchestrate_twostage.py  # Two-stage (SFT warm-start) driver
│   ├── eval/
│   │   ├── eval_functional.py       # CodeFlaws functional pass@1 eval
│   │   ├── eval_functional_py.py    # pyrepair functional pass@1 eval
│   │   ├── eval_security_sven.py    # SVEN Semgrep security eval
│   │   ├── eval_strategies.py       # Inference-strategy comparison
│   │   ├── codebleu_one.py          # Standalone greedy-CodeBLEU eval
│   │   └── run_eval_sweep.py        # Deployable functional/security best-of-K sweep
│   ├── data/
│   │   ├── build_pyexec_benchmark.py    # Build pyrepair from MBPP + QuixBugs
│   │   └── build_codeflaws_with_tests.py # Attach per-bug tests to CodeFlaws
│   ├── experiments/
│   │   ├── queue_artifact.py        # Full (model × benchmark × method) matrix driver
│   │   ├── run_rft_matrix.py        # Budget-matched RFT-only baseline matrix
│   │   ├── run_ablations.py         # RQ2 (router) / RQ3 (inference) ablations
│   │   └── run_sensitivity.sh       # Reward-weight sensitivity sweep
│   └── analysis/
│       ├── make_paper_figs.py       # Render paper figures from result JSONs
│       ├── collect_rft.py           # Aggregate RFT/SynthFix metrics into tables
│       ├── aggregate_final.py       # Collect per-run JSONs into a report
│       └── diag_*.py                # Standalone diagnostics
└── src/                        # Importable library (no path setup needed)
    ├── train_synthfix.py       # Router-gated REINFORCE training + evaluation
    ├── train_baseline.py       # SFT / RFT baselines (shared data path)
    ├── data/                   # RepairDataset, dataloaders, benchmark processing
    └── models/
        ├── router.py           # MLP router + feature extraction
        ├── reward.py           # Composite symbolic reward
        ├── symbolic.py         # AST / security split features
        ├── parse_dfg.py        # tree-sitter parse, AST/CFG/DFG scores
        ├── lint_reward.py      # Language-specific linting reward
        ├── exec_reward.py      # Execution test-case reward
        ├── repair_effect.py    # Reference-free repair-effect reward
        └── inference.py        # Best-of-K candidate reranker

Every script resolves the repository root automatically, so it can be run from anywhere (e.g. python scripts/eval/eval_functional.py ...) without setting PYTHONPATH.


Citation

If you use this code in your research, please cite:

@inproceedings{synthfix2026,
  title     = {SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair},
  author    = {SynthFix Authors},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year      = {2026},
}

License

Released under the terms of the LICENSE file.

About

Adaptive neuro-symbolic code vulnerability repair — companion code for the ACL 2026 Findings paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors