-
Superposition as Lossy Compression – Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability
An information-theoretic framework quantifying superposition as lossy compression in neural networks. By measuring effective degrees of freedom through sparse autoencoders, we reveal that adversarial training's effect on feature organization depends on task complexity relative to network capacity.
-
Mechanistic Interpretability for Adversarial Robustness — A Proposal
A research proposal exploring the synergies between mechanistic interpretability and adversarial robustness to develop safer AI systems.
-
Mechanistic Interpretability for AI Safety — A Review
A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety.