Skip to content

nec-research/SteeringTokens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compositional Steering of Large Language Models with Steering Tokens

arXiv ACL 2026

Steering Tokens overview

🧩 Overview

Steering Tokens steers large language models toward verifiable behaviors — response length, output language, formatting, structure — without fine-tuning the base model. Each behavior, expressed as a natural-language instruction, is compressed into a single learned steering token (a vector in the model's input space) via self-distillation. A dedicated <and> composition token then captures the general concept of composition, enabling zero-shot fusion of multiple behaviors (e.g. "respond in German and in 25–30 words") without training on every combination.

Because steering tokens live in the input space rather than interacting with model internals, they compose better than activation-steering or gist-token approaches, and a learned composition operator generalizes to unseen compositions, unseen behaviors, and an unseen number of composed behaviors.

📦 Installation

Requires Python ≥ 3.11. We recommend uv:

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .            # core: training + vLLM inference
uv pip install -e ".[tools]"   # + analysis & visualization tools

Or with plain pip:

pip install -e .
pip install -e ".[tools]"

📂 Dataset Setup

Training and evaluation datasets are not bundled with the repository. Place the JSON files under data/ following the paths used in the examples below (e.g. data/words/words_10_50_train.json).

🚀 Quickstart

Train a steering token

Distill a behavior into a steering token from a dataset of prompts and target responses:

torchrun --standalone --nnodes=1 --nproc-per-node=1 scripts/train.py \
    --model_name "meta-llama/Llama-3.2-3B-Instruct" \
    --json_paths "data/words/words_10_50_train.json" \
    --subset 0.1 \
    --run_validation \
    --val_data_path "data/test.json" \
    --gradient_checkpointing

Training uses torchrun; set --nproc-per-node to the number of GPUs. Trained tokens are written under models/<MODEL>/steering_tokens/<token_name>/. The orthogonality regularizer (--lambda_orth) is important for zero-shot composition; the <and> token is trained the same way, on a dataset of behavior pairs.

Run inference

Apply one or more steering tokens at generation time (vLLM-based):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 \
    --control-method steering \
    --silent

--control-method selects how behavior is controlled: steering (token only), text (instruction only), or hybrid (token + instruction).

Compose multiple behaviors

Pass several tokens to --steering to steer for multiple behaviors at once (e.g. respond in German and in 10–50 words):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 german \
    --control-method steering \
    --silent

For stronger zero-shot composition, insert the trained <and> composition token between the behaviors (quote it — < and > are shell redirections):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 "<and>" german \
    --control-method steering \
    --silent

🎛️ Behaviors

Type Examples
Length words_10_50, words_70_90
Language german, spanish, french, italian
Format title_case
Structure sentences_3
Composition <and>

🔀 Ordering experiments

scripts/run_ordering_experiments.py evaluates compositional steering across token combinations and orderings using a single persistent vLLM instance. --token_combinations selects how many behaviors to compose (1–4), and --mode controls how they are joined: composition (<and> between tokens), concatenation (no operator), or native_and (literal <native_and> baseline). Set --tensor_parallel_size to the number of GPUs.

Compose with the <and> token (steering), text instructions (text), or both (hybrid):

# Steering tokens, <and> composition
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method steering \
    --mode composition \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --notes "Steering tokens with <and> composition"

# Text instructions
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method text \
    --mode concatenation \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --sample_random_instruction \
    --notes "Text instructions"

# Hybrid (steering tokens + text)
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method hybrid \
    --mode composition \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --sample_random_instruction \
    --notes "Hybrid steering with <and> token"

Results are written as JSON and can be inspected with the analysis tools below.

🔍 Analysis tools

The tools/ scripts (install with .[tools]) help inspect results and tokens:

  • results_comparison.py / plot_results_comparison.py — compare experiment results
  • visualize_steering_tokens.py — PCA projection of learned tokens, colored by category
  • nearest_vocab_tokens.py — nearest vocabulary tokens to a steering embedding

⚖️ Quality evaluation

scripts/inference.py --evaluate_response_quality judges responses via an OpenAI-compatible endpoint. Configure it through environment variables:

export LLM_JUDGE_BASE_URL=http://localhost:8000/v1   # OpenAI-compatible server
export LLM_JUDGE_MODEL=Qwen3-30B-A3B-Instruct-2507-FP8
export LLM_JUDGE_API_KEY=EMPTY

📄 Citation

If you use this work, please cite:

@article{radevski2026compositional,
  title={Compositional Steering of Large Language Models with Steering Tokens},
  author={Radevski, Gorjan and Gashteovski, Kiril and Hong, Giwon and Lawrence, Carolin and Glava{\v{s}}, Goran},
  journal={arXiv preprint arXiv:2601.05062},
  year={2026}
}

📫 Contact

For questions, reach out to:

About

Repository for the ACL 2026 paper "Compositional Steering of Large Language Models with Steering Tokens": https://arxiv.org/pdf/2601.05062

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages