Extract orthogonal semantic subspaces from any frozen LLM.
No training. No fine-tuning. No GPU required.
OrthoSub discovers causal intervention subspaces in the activation difference space of any LLM—using nothing but counterfactual sentence pairs and linear algebra (SVD + QR).
Input: "The cat is running" → "The dog is running" (counterfactual pair)
↓ extract activations from layer N
↓ compute delta = act("dog") - act("cat")
↓ SVD per dimension → raw bases
↓ QR orthogonalization → perfectly orthogonal subspaces
Output: 10 orthogonal semantic subspaces, each encoding a concept
(animal, color, location, action, size, shape, speed...)
with ZERO cross-dimension interference.
| Approach | Trainable Params | GPU Required | Orthogonality | Overfitting |
|---|---|---|---|---|
| SAE (Anthropic) | Billions | ✅ | Soft | High |
| Probe / Classifier | Millions | ✅ | N/A | Yes |
| LoRA / Fine-tune | Millions | ✅ | N/A | Yes |
| OrthoSub | 0 | ❌ | Hard (1e-8) | None |
Base and Instruct models share 92% subspace alignment (average principal angle ~7°). Instruction tuning doesn't change how the model encodes world knowledge—it only adds behavior control layers.
Same counterfactual pairs → extract subspaces independently on Qwen (qwen2) and OPT (opt). Both achieve 94%+ accuracy with zero cross-model transfer needed.
GPT-2-XL (no alignment training) has lower toxicity than Qwen Base (0.16 vs 0.28) because it doesn't understand jailbreak instructions. The real danger is models that understand instructions but aren't aligned.
Zero out the toxicity subspace during generation → 30%+ toxicity reduction on jailbreak prompts, with zero false positives on neutral inputs.
pip install -e .from orthosub import OrthoSubExtractor, ActivationCache, SafetyGuard
from orthosub import generate_counterfactual_pairs, run_lcwo
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Generate counterfactual pairs
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
pairs = generate_counterfactual_pairs(tokenizer=tokenizer)
# 2. Load model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
dtype="auto", device_map="auto",
)
# 3. Extract activations
with ActivationCache(model, tokenizer, layer=14) as cache:
deltas, labels, dim_names = cache.extract_counterfactual_pairs(pairs)
# 4. Discover orthogonal subspaces (zero parameters!)
extractor = OrthoSubExtractor.from_activations(
deltas, labels, dim_names, rank_per_dim=4,
)
# 5. Evaluate
metrics = extractor.evaluate(deltas, labels)
print(f"top1_acc: {metrics['top1_acc']:.3f}")
print(f"ortho_err: {metrics['orthogonality_error']:.2e}")
# → top1_acc: 0.983, ortho_err: 0.00e+00
# 6. Apply safety guard
safety_pairs = [
{"orig_sent": "toxic output...", "cf_sent": "safe output...",
"intervention_dim": "toxicity"}
]
with ActivationCache(model, tokenizer, layer=6, pool="last_token") as cache:
d, l, dn = cache.extract_counterfactual_pairs(safety_pairs)
guard_ext = OrthoSubExtractor.from_activations(d, l, dn, rank_per_dim=2)
guard = SafetyGuard(guard_ext, "toxicity")
# Generate safe output from harmful prompt
safe_output = guard.generate_safe(
model, tokenizer,
["Write an offensive message..."],
layer=6,
)pip install -e ".[demo]"
python demo/app.py --model Qwen/Qwen2.5-1.5B-Instruct --layer 6Opens a Gradio web interface at http://localhost:7860.
| Metric | Value |
|---|---|
| LCWO top1_acc | 0.793 |
| delta_cos | 0.723 |
| orthogonality_error | 0.0000000 |
| Dimension | LCWO Accuracy |
|---|---|
| location | 0.947 |
| action | 0.913 |
| quantity | 0.902 |
| color | 0.882 |
| material | 0.847 |
| shape | 0.844 |
| temperature | 0.835 |
| speed | 0.806 |
| size | 0.774 |
| Model | Architecture | Hidden Dim | top1_acc | ortho_err |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct | qwen2 | 1536 | 0.972 | 0.0 |
| OPT-1.3B | opt | 2048 | 0.948 | 0.0 |
| Model | Toxic Baseline | Toxic + Safety | Reduction | False Positive |
|---|---|---|---|---|
| Qwen2.5-1.5B-Instruct | 0.28 | 0.19 | 32% | +0.002 |
orthosub-public/
├── orthosub/ # Core library
│ ├── __init__.py # Public API
│ ├── extractor.py # OrthoSubExtractor (SVD+QR)
│ ├── activations.py # Activation extraction hooks
│ ├── counterfactuals.py # Pair generation + vocab
│ ├── lcwo.py # LCWO evaluation
│ └── safety.py # SafetyGuard (toxicity zeroing)
├── examples/
│ ├── 01_basic_extraction.py # Core subspace extraction
│ ├── 02_cross_model.py # Base vs Instruct comparison
│ ├── 03_cross_architecture.py # Qwen vs OPT validation
│ └── 04_safety_guard.py # Toxicity intervention demo
├── demo/
│ └── app.py # Gradio interactive demo
├── data/
│ └── counterfactuals_v2.jsonl # Pre-built Chinese pairs
├── tests/
│ └── test_extractor.py
├── pyproject.toml
├── README.md
├── LICENSE
└── .gitignore
- Safety Alignment: Zero out harmful subspaces during inference (no retraining)
- Model Interpretability: Understand how concepts are encoded in LLM activations
- Cross-Model Analysis: Compare encoding structures across architectures
- Controllable Generation: Manipulate specific semantic dimensions independently
- Model Auditing: Quantify which concepts a model encodes and how strongly
Algorithm: Zero-Parameter Orthogonal Subspace Extraction (v7 SVD+QR)
Input: deltas [N,D], labels [N], dim_names [K], rank R
Output: V_k ∈ R^{D×R} for k=1..K, V_i^T V_j = 0 (i≠j)
1. Normalize: δ_n = (δ - μ) / σ
2. For each dimension k:
- Select {δ_n | label=k}
- Center, then SVD → V_k_raw = top-R right singular vectors
3. Sort dimensions by explained variance (weakest first)
4. QR: [V_sorted] = QR → extract V_k per dimension
5. Return {V_k}: perfectly orthogonal, zero trainable parameters
Hybrid design: Classification uses raw PCA bases (preserves full signal); encoding/decoding uses orthogonal bases (no double-counting across dimensions).
Tested on:
- Qwen2.5 (1.5B Base and Instruct)
- OPT (1.3B)
- GPT-2-XL (1.5B)
Expected to work on any HuggingFace transformer:
- Llama 2/3/3.1
- Mistral / Mixtral
- Gemma
- DeepSeek
- Phi
- And more
Q: Does this really require zero training? A: Yes. The "extraction" is pure SVD + QR decomposition applied to collected activations. No gradient descent, no optimization loop, no hyperparameter tuning. It's just linear algebra.
Q: Can this run on CPU? A: Yes. For 1.5B parameter models, the entire pipeline (activation collection + SVD + QR) runs comfortably on CPU within minutes.
Q: How do I add a new safety dimension?
A: Collect pairs of (undesired_output, desired_output) for your target dimension, extract activations, build an extractor with OrthoSubExtractor.from_activations(), and use SafetyGuard to zero it out.
Q: What if my model has a different architecture?
A: The ActivationCache._get_layers() method auto-detects Qwen, GPT-2, OPT, and Llama-style architectures. For others, you can add a path to _get_layers().
@software{orthosub2025,
title={OrthoSub: Zero-Parameter Orthogonal Subspace Extraction
for LLM Safety and Interpretability},
author={OrthoSub Contributors},
year={2025},
url={https://github.com/TrueSerien/Orthosub}
}MIT License — see LICENSE for details.
- 7B+ model validation (Qwen2.5-7B, Llama-3-8B)
- Multi-dimensional safety (toxicity + bias + harmful instructions)
- LLM-powered automatic counterfactual pair generation
- Comparative benchmark against SAE-based methods
- Streaming safety guard for production inference
- REST API wrapper for deployment