Compositional Steering of Large Language Models with Steering Tokens

🧩 Overview

Steering Tokens steers large language models toward verifiable behaviors — response length, output language, formatting, structure — without fine-tuning the base model. Each behavior, expressed as a natural-language instruction, is compressed into a single learned steering token (a vector in the model's input space) via self-distillation. A dedicated <and> composition token then captures the general concept of composition, enabling zero-shot fusion of multiple behaviors (e.g. "respond in German and in 25–30 words") without training on every combination.

Because steering tokens live in the input space rather than interacting with model internals, they compose better than activation-steering or gist-token approaches, and a learned composition operator generalizes to unseen compositions, unseen behaviors, and an unseen number of composed behaviors.

📦 Installation

Requires Python ≥ 3.11. We recommend uv:

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .            # core: training + vLLM inference
uv pip install -e ".[tools]"   # + analysis & visualization tools

Or with plain pip:

pip install -e .
pip install -e ".[tools]"

📂 Dataset Setup

Training and evaluation datasets are not bundled with the repository. Place the JSON files under data/ following the paths used in the examples below (e.g. data/words/words_10_50_train.json).

🚀 Quickstart

Train a steering token

Distill a behavior into a steering token from a dataset of prompts and target responses:

torchrun --standalone --nnodes=1 --nproc-per-node=1 scripts/train.py \
    --model_name "meta-llama/Llama-3.2-3B-Instruct" \
    --json_paths "data/words/words_10_50_train.json" \
    --subset 0.1 \
    --run_validation \
    --val_data_path "data/test.json" \
    --gradient_checkpointing

Training uses torchrun; set --nproc-per-node to the number of GPUs. Trained tokens are written under models/<MODEL>/steering_tokens/<token_name>/. The orthogonality regularizer (--lambda_orth) is important for zero-shot composition; the <and> token is trained the same way, on a dataset of behavior pairs.

Run inference

Apply one or more steering tokens at generation time (vLLM-based):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 \
    --control-method steering \
    --silent

--control-method selects how behavior is controlled: steering (token only), text (instruction only), or hybrid (token + instruction).

Compose multiple behaviors

Pass several tokens to --steering to steer for multiple behaviors at once (e.g. respond in German and in 10–50 words):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 german \
    --control-method steering \
    --silent

For stronger zero-shot composition, insert the trained <and> composition token between the behaviors (quote it — < and > are shell redirections):

python scripts/inference.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --steering words_10_50 "<and>" german \
    --control-method steering \
    --silent

🎛️ Behaviors

Type	Examples
Length	`words_10_50`, `words_70_90`
Language	`german`, `spanish`, `french`, `italian`
Format	`title_case`
Structure	`sentences_3`
Composition	`<and>`

🔀 Ordering experiments

scripts/run_ordering_experiments.py evaluates compositional steering across token combinations and orderings using a single persistent vLLM instance. --token_combinations selects how many behaviors to compose (1–4), and --mode controls how they are joined: composition (<and> between tokens), concatenation (no operator), or native_and (literal <native_and> baseline). Set --tensor_parallel_size to the number of GPUs.

Compose with the <and> token (steering), text instructions (text), or both (hybrid):

# Steering tokens, <and> composition
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method steering \
    --mode composition \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --notes "Steering tokens with <and> composition"

# Text instructions
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method text \
    --mode concatenation \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --sample_random_instruction \
    --notes "Text instructions"

# Hybrid (steering tokens + text)
python scripts/run_ordering_experiments.py \
    --data_path data/test.json \
    --model_path models/Llama-3.2-3B-Instruct/ \
    --control_method hybrid \
    --mode composition \
    --token_combinations 1 2 3 4 \
    --num_samples 300 \
    --tensor_parallel_size 1 \
    --sample_random_instruction \
    --notes "Hybrid steering with <and> token"

Results are written as JSON and can be inspected with the analysis tools below.

🔍 Analysis tools

The tools/ scripts (install with .[tools]) help inspect results and tokens:

results_comparison.py / plot_results_comparison.py — compare experiment results
visualize_steering_tokens.py — PCA projection of learned tokens, colored by category
nearest_vocab_tokens.py — nearest vocabulary tokens to a steering embedding

⚖️ Quality evaluation

scripts/inference.py --evaluate_response_quality judges responses via an OpenAI-compatible endpoint. Configure it through environment variables:

export LLM_JUDGE_BASE_URL=http://localhost:8000/v1   # OpenAI-compatible server
export LLM_JUDGE_MODEL=Qwen3-30B-A3B-Instruct-2507-FP8
export LLM_JUDGE_API_KEY=EMPTY

📄 Citation

If you use this work, please cite:

@article{radevski2026compositional,
  title={Compositional Steering of Large Language Models with Steering Tokens},
  author={Radevski, Gorjan and Gashteovski, Kiril and Hong, Giwon and Lawrence, Carolin and Glava{\v{s}}, Goran},
  journal={arXiv preprint arXiv:2601.05062},
  year={2026}
}

📫 Contact

For questions, reach out to:

Gorjan Radevski — [email protected]
Kiril Gashteovski — [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
models		models
scripts		scripts
src/llm_compass		src/llm_compass
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compositional Steering of Large Language Models with Steering Tokens

🧩 Overview

📦 Installation

📂 Dataset Setup

🚀 Quickstart

Train a steering token

Run inference

Compose multiple behaviors

🎛️ Behaviors

🔀 Ordering experiments

🔍 Analysis tools

⚖️ Quality evaluation

📄 Citation

📫 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Compositional Steering of Large Language Models with Steering Tokens

🧩 Overview

📦 Installation

📂 Dataset Setup

🚀 Quickstart

Train a steering token

Run inference

Compose multiple behaviors

🎛️ Behaviors

🔀 Ordering experiments

🔍 Analysis tools

⚖️ Quality evaluation

📄 Citation

📫 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages