EECS E6895 final project measuring reward-gaming behavior in Gemma 2B with shell-game evals, LoRA SFT, and leakage-aware probes.
lora gemma ai-safety interpretability columbia-university sft reward-hacking linear-probes specification-gaming
-
Updated
May 12, 2026 - Python