-
Mechanistic Interpretability for Adversarial Robustness — A Proposal
A research proposal exploring the synergies between mechanistic interpretability and adversarial robustness to develop safer AI systems.
-
Mechanistic Interpretability for AI Safety — A Review
A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety.