Steering Tokens steers large language models toward verifiable behaviors —
response length, output language, formatting, structure — without fine-tuning
the base model. Each behavior, expressed as a natural-language instruction, is compressed
into a single learned steering token (a vector in the model's input space)
via self-distillation. A dedicated <and> composition token then captures
the general concept of composition, enabling zero-shot fusion of multiple
behaviors (e.g. "respond in German and in 25–30 words") without training on
every combination.
Because steering tokens live in the input space rather than interacting with model internals, they compose better than activation-steering or gist-token approaches, and a learned composition operator generalizes to unseen compositions, unseen behaviors, and an unseen number of composed behaviors.
Requires Python ≥ 3.11. We recommend uv:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e . # core: training + vLLM inference
uv pip install -e ".[tools]" # + analysis & visualization toolsOr with plain pip:
pip install -e .
pip install -e ".[tools]"Training and evaluation datasets are not bundled with the repository. Place the
JSON files under data/ following the paths used in the examples below
(e.g. data/words/words_10_50_train.json).
Distill a behavior into a steering token from a dataset of prompts and target responses:
torchrun --standalone --nnodes=1 --nproc-per-node=1 scripts/train.py \
--model_name "meta-llama/Llama-3.2-3B-Instruct" \
--json_paths "data/words/words_10_50_train.json" \
--subset 0.1 \
--run_validation \
--val_data_path "data/test.json" \
--gradient_checkpointingTraining uses torchrun; set --nproc-per-node to the number of GPUs. Trained
tokens are written under models/<MODEL>/steering_tokens/<token_name>/. The
orthogonality regularizer (--lambda_orth) is important for zero-shot
composition; the <and> token is trained the same way, on a dataset of
behavior pairs.
Apply one or more steering tokens at generation time (vLLM-based):
python scripts/inference.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--steering words_10_50 \
--control-method steering \
--silent--control-method selects how behavior is controlled: steering (token only),
text (instruction only), or hybrid (token + instruction).
Pass several tokens to --steering to steer for multiple behaviors at once
(e.g. respond in German and in 10–50 words):
python scripts/inference.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--steering words_10_50 german \
--control-method steering \
--silentFor stronger zero-shot composition, insert the trained <and> composition
token between the behaviors (quote it — < and > are shell redirections):
python scripts/inference.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--steering words_10_50 "<and>" german \
--control-method steering \
--silent| Type | Examples |
|---|---|
| Length | words_10_50, words_70_90 |
| Language | german, spanish, french, italian |
| Format | title_case |
| Structure | sentences_3 |
| Composition | <and> |
scripts/run_ordering_experiments.py evaluates compositional steering across
token combinations and orderings using a single persistent vLLM instance.
--token_combinations selects how many behaviors to compose (1–4), and
--mode controls how they are joined: composition (<and> between tokens),
concatenation (no operator), or native_and (literal <native_and> baseline).
Set --tensor_parallel_size to the number of GPUs.
Compose with the <and> token (steering), text instructions (text), or both
(hybrid):
# Steering tokens, <and> composition
python scripts/run_ordering_experiments.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--control_method steering \
--mode composition \
--token_combinations 1 2 3 4 \
--num_samples 300 \
--tensor_parallel_size 1 \
--notes "Steering tokens with <and> composition"
# Text instructions
python scripts/run_ordering_experiments.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--control_method text \
--mode concatenation \
--token_combinations 1 2 3 4 \
--num_samples 300 \
--tensor_parallel_size 1 \
--sample_random_instruction \
--notes "Text instructions"
# Hybrid (steering tokens + text)
python scripts/run_ordering_experiments.py \
--data_path data/test.json \
--model_path models/Llama-3.2-3B-Instruct/ \
--control_method hybrid \
--mode composition \
--token_combinations 1 2 3 4 \
--num_samples 300 \
--tensor_parallel_size 1 \
--sample_random_instruction \
--notes "Hybrid steering with <and> token"Results are written as JSON and can be inspected with the analysis tools below.
The tools/ scripts (install with .[tools]) help inspect results and tokens:
results_comparison.py/plot_results_comparison.py— compare experiment resultsvisualize_steering_tokens.py— PCA projection of learned tokens, colored by categorynearest_vocab_tokens.py— nearest vocabulary tokens to a steering embedding
scripts/inference.py --evaluate_response_quality judges responses via an
OpenAI-compatible endpoint. Configure it through environment variables:
export LLM_JUDGE_BASE_URL=http://localhost:8000/v1 # OpenAI-compatible server
export LLM_JUDGE_MODEL=Qwen3-30B-A3B-Instruct-2507-FP8
export LLM_JUDGE_API_KEY=EMPTYIf you use this work, please cite:
@article{radevski2026compositional,
title={Compositional Steering of Large Language Models with Steering Tokens},
author={Radevski, Gorjan and Gashteovski, Kiril and Hong, Giwon and Lawrence, Carolin and Glava{\v{s}}, Goran},
journal={arXiv preprint arXiv:2601.05062},
year={2026}
}For questions, reach out to:
- Gorjan Radevski — [email protected]
- Kiril Gashteovski — [email protected]
