OrthoSub — Zero-Parameter LLM Safety & Interpretability

Extract orthogonal semantic subspaces from any frozen LLM.
No training. No fine-tuning. No GPU required.

What is OrthoSub?

OrthoSub discovers causal intervention subspaces in the activation difference space of any LLM—using nothing but counterfactual sentence pairs and linear algebra (SVD + QR).

Input:  "The cat is running" → "The dog is running"  (counterfactual pair)
        ↓ extract activations from layer N
        ↓ compute delta = act("dog") - act("cat")
        ↓ SVD per dimension → raw bases
        ↓ QR orthogonalization → perfectly orthogonal subspaces

Output: 10 orthogonal semantic subspaces, each encoding a concept
        (animal, color, location, action, size, shape, speed...)
        with ZERO cross-dimension interference.

Why "zero parameter" matters

Approach	Trainable Params	GPU Required	Orthogonality	Overfitting
SAE (Anthropic)	Billions	✅	Soft	High
Probe / Classifier	Millions	✅	N/A	Yes
LoRA / Fine-tune	Millions	✅	N/A	Yes
OrthoSub	0	❌	Hard (1e-8)	None

Key Findings

1. Semantic subspaces are invariant across model versions

Base and Instruct models share 92% subspace alignment (average principal angle ~7°). Instruction tuning doesn't change how the model encodes world knowledge—it only adds behavior control layers.

2. Cross-architecture without any mapper

Same counterfactual pairs → extract subspaces independently on Qwen (qwen2) and OPT (opt). Both achieve 94%+ accuracy with zero cross-model transfer needed.

3. "Toxicity = Understanding + Misalignment"

GPT-2-XL (no alignment training) has lower toxicity than Qwen Base (0.16 vs 0.28) because it doesn't understand jailbreak instructions. The real danger is models that understand instructions but aren't aligned.

4. Safety without training

Zero out the toxicity subspace during generation → 30%+ toxicity reduction on jailbreak prompts, with zero false positives on neutral inputs.

Quick Start

Install

pip install -e .

5-minute demo

from orthosub import OrthoSubExtractor, ActivationCache, SafetyGuard
from orthosub import generate_counterfactual_pairs, run_lcwo
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Generate counterfactual pairs
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
pairs = generate_counterfactual_pairs(tokenizer=tokenizer)

# 2. Load model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    dtype="auto", device_map="auto",
)

# 3. Extract activations
with ActivationCache(model, tokenizer, layer=14) as cache:
    deltas, labels, dim_names = cache.extract_counterfactual_pairs(pairs)

# 4. Discover orthogonal subspaces (zero parameters!)
extractor = OrthoSubExtractor.from_activations(
    deltas, labels, dim_names, rank_per_dim=4,
)

# 5. Evaluate
metrics = extractor.evaluate(deltas, labels)
print(f"top1_acc: {metrics['top1_acc']:.3f}")
print(f"ortho_err: {metrics['orthogonality_error']:.2e}")
# → top1_acc: 0.983, ortho_err: 0.00e+00

# 6. Apply safety guard
safety_pairs = [
    {"orig_sent": "toxic output...", "cf_sent": "safe output...",
     "intervention_dim": "toxicity"}
]
with ActivationCache(model, tokenizer, layer=6, pool="last_token") as cache:
    d, l, dn = cache.extract_counterfactual_pairs(safety_pairs)
guard_ext = OrthoSubExtractor.from_activations(d, l, dn, rank_per_dim=2)
guard = SafetyGuard(guard_ext, "toxicity")

# Generate safe output from harmful prompt
safe_output = guard.generate_safe(
    model, tokenizer,
    ["Write an offensive message..."],
    layer=6,
)

Interactive Demo

pip install -e ".[demo]"
python demo/app.py --model Qwen/Qwen2.5-1.5B-Instruct --layer 6

Opens a Gradio web interface at http://localhost:7860.

Results

Qwen2.5-1.5B-Instruct (Layer 14, rank=4)

Metric	Value
LCWO top1_acc	0.793
delta_cos	0.723
orthogonality_error	0.0000000

Dimension	LCWO Accuracy
location	0.947
action	0.913
quantity	0.902
color	0.882
material	0.847
shape	0.844
temperature	0.835
speed	0.806
size	0.774

Cross-Architecture (Qwen 1.5B vs OPT 1.3B)

Model	Architecture	Hidden Dim	top1_acc	ortho_err
Qwen2.5-1.5B-Instruct	qwen2	1536	0.972	0.0
OPT-1.3B	opt	2048	0.948	0.0

Safety Intervention

Model	Toxic Baseline	Toxic + Safety	Reduction	False Positive
Qwen2.5-1.5B-Instruct	0.28	0.19	32%	+0.002

Project Structure

orthosub-public/
├── orthosub/                 # Core library
│   ├── __init__.py           # Public API
│   ├── extractor.py          # OrthoSubExtractor (SVD+QR)
│   ├── activations.py        # Activation extraction hooks
│   ├── counterfactuals.py    # Pair generation + vocab
│   ├── lcwo.py               # LCWO evaluation
│   └── safety.py             # SafetyGuard (toxicity zeroing)
├── examples/
│   ├── 01_basic_extraction.py      # Core subspace extraction
│   ├── 02_cross_model.py           # Base vs Instruct comparison
│   ├── 03_cross_architecture.py    # Qwen vs OPT validation
│   └── 04_safety_guard.py          # Toxicity intervention demo
├── demo/
│   └── app.py                # Gradio interactive demo
├── data/
│   └── counterfactuals_v2.jsonl    # Pre-built Chinese pairs
├── tests/
│   └── test_extractor.py
├── pyproject.toml
├── README.md
├── LICENSE
└── .gitignore

Use Cases

Safety Alignment: Zero out harmful subspaces during inference (no retraining)
Model Interpretability: Understand how concepts are encoded in LLM activations
Cross-Model Analysis: Compare encoding structures across architectures
Controllable Generation: Manipulate specific semantic dimensions independently
Model Auditing: Quantify which concepts a model encodes and how strongly

Algorithm

Algorithm: Zero-Parameter Orthogonal Subspace Extraction (v7 SVD+QR)

Input: deltas [N,D], labels [N], dim_names [K], rank R
Output: V_k ∈ R^{D×R} for k=1..K, V_i^T V_j = 0 (i≠j)

1. Normalize: δ_n = (δ - μ) / σ
2. For each dimension k:
   - Select {δ_n | label=k}
   - Center, then SVD → V_k_raw = top-R right singular vectors
3. Sort dimensions by explained variance (weakest first)
4. QR: [V_sorted] = QR → extract V_k per dimension
5. Return {V_k}: perfectly orthogonal, zero trainable parameters

Hybrid design: Classification uses raw PCA bases (preserves full signal); encoding/decoding uses orthogonal bases (no double-counting across dimensions).

Supported Models

Tested on:

Qwen2.5 (1.5B Base and Instruct)
OPT (1.3B)
GPT-2-XL (1.5B)

Expected to work on any HuggingFace transformer:

Llama 2/3/3.1
Mistral / Mixtral
Gemma
DeepSeek
Phi
And more

FAQ

Q: Does this really require zero training? A: Yes. The "extraction" is pure SVD + QR decomposition applied to collected activations. No gradient descent, no optimization loop, no hyperparameter tuning. It's just linear algebra.

Q: Can this run on CPU? A: Yes. For 1.5B parameter models, the entire pipeline (activation collection + SVD + QR) runs comfortably on CPU within minutes.

Q: How do I add a new safety dimension? A: Collect pairs of (undesired_output, desired_output) for your target dimension, extract activations, build an extractor with OrthoSubExtractor.from_activations(), and use SafetyGuard to zero it out.

Q: What if my model has a different architecture? A: The ActivationCache._get_layers() method auto-detects Qwen, GPT-2, OPT, and Llama-style architectures. For others, you can add a path to _get_layers().

Citation

@software{orthosub2025,
  title={OrthoSub: Zero-Parameter Orthogonal Subspace Extraction
         for LLM Safety and Interpretability},
  author={OrthoSub Contributors},
  year={2025},
  url={https://github.com/TrueSerien/Orthosub}
}

License

MIT License — see LICENSE for details.

Roadmap

7B+ model validation (Qwen2.5-7B, Llama-3-8B)
Multi-dimensional safety (toxicity + bias + harmful instructions)
LLM-powered automatic counterfactual pair generation
Comparative benchmark against SAE-based methods
Streaming safety guard for production inference
REST API wrapper for deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OrthoSub — Zero-Parameter LLM Safety & Interpretability

What is OrthoSub?

Why "zero parameter" matters

Key Findings

1. Semantic subspaces are invariant across model versions

2. Cross-architecture without any mapper

3. "Toxicity = Understanding + Misalignment"

4. Safety without training

Quick Start

Install

5-minute demo

Interactive Demo

Results

Qwen2.5-1.5B-Instruct (Layer 14, rank=4)

Cross-Architecture (Qwen 1.5B vs OPT 1.3B)

Safety Intervention

Project Structure

Use Cases

Algorithm

Supported Models

FAQ

Citation

License

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
demo		demo
docs		docs
examples		examples
orthosub		orthosub
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

OrthoSub — Zero-Parameter LLM Safety & Interpretability

What is OrthoSub?

Why "zero parameter" matters

Key Findings

1. Semantic subspaces are invariant across model versions

2. Cross-architecture without any mapper

3. "Toxicity = Understanding + Misalignment"

4. Safety without training

Quick Start

Install

5-minute demo

Interactive Demo

Results

Qwen2.5-1.5B-Instruct (Layer 14, rank=4)

Cross-Architecture (Qwen 1.5B vs OPT 1.3B)

Safety Intervention

Project Structure

Use Cases

Algorithm

Supported Models

FAQ

Citation

License

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages