Skip to content

TrueSerien/Orthosub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OrthoSub — Zero-Parameter LLM Safety & Interpretability

Extract orthogonal semantic subspaces from any frozen LLM.
No training. No fine-tuning. No GPU required.

Python License Zero Parameters Models HF Demo


What is OrthoSub?

OrthoSub discovers causal intervention subspaces in the activation difference space of any LLM—using nothing but counterfactual sentence pairs and linear algebra (SVD + QR).

Input:  "The cat is running" → "The dog is running"  (counterfactual pair)
        ↓ extract activations from layer N
        ↓ compute delta = act("dog") - act("cat")
        ↓ SVD per dimension → raw bases
        ↓ QR orthogonalization → perfectly orthogonal subspaces

Output: 10 orthogonal semantic subspaces, each encoding a concept
        (animal, color, location, action, size, shape, speed...)
        with ZERO cross-dimension interference.

Why "zero parameter" matters

Approach Trainable Params GPU Required Orthogonality Overfitting
SAE (Anthropic) Billions Soft High
Probe / Classifier Millions N/A Yes
LoRA / Fine-tune Millions N/A Yes
OrthoSub 0 Hard (1e-8) None

Key Findings

1. Semantic subspaces are invariant across model versions

Base and Instruct models share 92% subspace alignment (average principal angle ~7°). Instruction tuning doesn't change how the model encodes world knowledge—it only adds behavior control layers.

2. Cross-architecture without any mapper

Same counterfactual pairs → extract subspaces independently on Qwen (qwen2) and OPT (opt). Both achieve 94%+ accuracy with zero cross-model transfer needed.

3. "Toxicity = Understanding + Misalignment"

GPT-2-XL (no alignment training) has lower toxicity than Qwen Base (0.16 vs 0.28) because it doesn't understand jailbreak instructions. The real danger is models that understand instructions but aren't aligned.

4. Safety without training

Zero out the toxicity subspace during generation → 30%+ toxicity reduction on jailbreak prompts, with zero false positives on neutral inputs.


Quick Start

Install

pip install -e .

5-minute demo

from orthosub import OrthoSubExtractor, ActivationCache, SafetyGuard
from orthosub import generate_counterfactual_pairs, run_lcwo
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Generate counterfactual pairs
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
pairs = generate_counterfactual_pairs(tokenizer=tokenizer)

# 2. Load model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    dtype="auto", device_map="auto",
)

# 3. Extract activations
with ActivationCache(model, tokenizer, layer=14) as cache:
    deltas, labels, dim_names = cache.extract_counterfactual_pairs(pairs)

# 4. Discover orthogonal subspaces (zero parameters!)
extractor = OrthoSubExtractor.from_activations(
    deltas, labels, dim_names, rank_per_dim=4,
)

# 5. Evaluate
metrics = extractor.evaluate(deltas, labels)
print(f"top1_acc: {metrics['top1_acc']:.3f}")
print(f"ortho_err: {metrics['orthogonality_error']:.2e}")
# → top1_acc: 0.983, ortho_err: 0.00e+00

# 6. Apply safety guard
safety_pairs = [
    {"orig_sent": "toxic output...", "cf_sent": "safe output...",
     "intervention_dim": "toxicity"}
]
with ActivationCache(model, tokenizer, layer=6, pool="last_token") as cache:
    d, l, dn = cache.extract_counterfactual_pairs(safety_pairs)
guard_ext = OrthoSubExtractor.from_activations(d, l, dn, rank_per_dim=2)
guard = SafetyGuard(guard_ext, "toxicity")

# Generate safe output from harmful prompt
safe_output = guard.generate_safe(
    model, tokenizer,
    ["Write an offensive message..."],
    layer=6,
)

Interactive Demo

pip install -e ".[demo]"
python demo/app.py --model Qwen/Qwen2.5-1.5B-Instruct --layer 6

Opens a Gradio web interface at http://localhost:7860.


Results

Qwen2.5-1.5B-Instruct (Layer 14, rank=4)

Metric Value
LCWO top1_acc 0.793
delta_cos 0.723
orthogonality_error 0.0000000
Dimension LCWO Accuracy
location 0.947
action 0.913
quantity 0.902
color 0.882
material 0.847
shape 0.844
temperature 0.835
speed 0.806
size 0.774

Cross-Architecture (Qwen 1.5B vs OPT 1.3B)

Model Architecture Hidden Dim top1_acc ortho_err
Qwen2.5-1.5B-Instruct qwen2 1536 0.972 0.0
OPT-1.3B opt 2048 0.948 0.0

Safety Intervention

Model Toxic Baseline Toxic + Safety Reduction False Positive
Qwen2.5-1.5B-Instruct 0.28 0.19 32% +0.002

Project Structure

orthosub-public/
├── orthosub/                 # Core library
│   ├── __init__.py           # Public API
│   ├── extractor.py          # OrthoSubExtractor (SVD+QR)
│   ├── activations.py        # Activation extraction hooks
│   ├── counterfactuals.py    # Pair generation + vocab
│   ├── lcwo.py               # LCWO evaluation
│   └── safety.py             # SafetyGuard (toxicity zeroing)
├── examples/
│   ├── 01_basic_extraction.py      # Core subspace extraction
│   ├── 02_cross_model.py           # Base vs Instruct comparison
│   ├── 03_cross_architecture.py    # Qwen vs OPT validation
│   └── 04_safety_guard.py          # Toxicity intervention demo
├── demo/
│   └── app.py                # Gradio interactive demo
├── data/
│   └── counterfactuals_v2.jsonl    # Pre-built Chinese pairs
├── tests/
│   └── test_extractor.py
├── pyproject.toml
├── README.md
├── LICENSE
└── .gitignore

Use Cases

  • Safety Alignment: Zero out harmful subspaces during inference (no retraining)
  • Model Interpretability: Understand how concepts are encoded in LLM activations
  • Cross-Model Analysis: Compare encoding structures across architectures
  • Controllable Generation: Manipulate specific semantic dimensions independently
  • Model Auditing: Quantify which concepts a model encodes and how strongly

Algorithm

Algorithm: Zero-Parameter Orthogonal Subspace Extraction (v7 SVD+QR)

Input: deltas [N,D], labels [N], dim_names [K], rank R
Output: V_k ∈ R^{D×R} for k=1..K, V_i^T V_j = 0 (i≠j)

1. Normalize: δ_n = (δ - μ) / σ
2. For each dimension k:
   - Select {δ_n | label=k}
   - Center, then SVD → V_k_raw = top-R right singular vectors
3. Sort dimensions by explained variance (weakest first)
4. QR: [V_sorted] = QR → extract V_k per dimension
5. Return {V_k}: perfectly orthogonal, zero trainable parameters

Hybrid design: Classification uses raw PCA bases (preserves full signal); encoding/decoding uses orthogonal bases (no double-counting across dimensions).


Supported Models

Tested on:

  • Qwen2.5 (1.5B Base and Instruct)
  • OPT (1.3B)
  • GPT-2-XL (1.5B)

Expected to work on any HuggingFace transformer:

  • Llama 2/3/3.1
  • Mistral / Mixtral
  • Gemma
  • DeepSeek
  • Phi
  • And more

FAQ

Q: Does this really require zero training? A: Yes. The "extraction" is pure SVD + QR decomposition applied to collected activations. No gradient descent, no optimization loop, no hyperparameter tuning. It's just linear algebra.

Q: Can this run on CPU? A: Yes. For 1.5B parameter models, the entire pipeline (activation collection + SVD + QR) runs comfortably on CPU within minutes.

Q: How do I add a new safety dimension? A: Collect pairs of (undesired_output, desired_output) for your target dimension, extract activations, build an extractor with OrthoSubExtractor.from_activations(), and use SafetyGuard to zero it out.

Q: What if my model has a different architecture? A: The ActivationCache._get_layers() method auto-detects Qwen, GPT-2, OPT, and Llama-style architectures. For others, you can add a path to _get_layers().


Citation

@software{orthosub2025,
  title={OrthoSub: Zero-Parameter Orthogonal Subspace Extraction
         for LLM Safety and Interpretability},
  author={OrthoSub Contributors},
  year={2025},
  url={https://github.com/TrueSerien/Orthosub}
}

License

MIT License — see LICENSE for details.


Roadmap

  • 7B+ model validation (Qwen2.5-7B, Llama-3-8B)
  • Multi-dimensional safety (toxicity + bias + harmful instructions)
  • LLM-powered automatic counterfactual pair generation
  • Comparative benchmark against SAE-based methods
  • Streaming safety guard for production inference
  • REST API wrapper for deployment

About

Zero-Parameter Orthogonal Subspace Extraction for LLM Safety & Interpretability

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages