aligning cognition

from simulacra to safe superintelligence.

Mechanistic Interpretability for Adversarial Robustness — A Proposal

A research proposal exploring the synergies between mechanistic interpretability and adversarial robustness to develop safer AI systems.

18 min read · August 19, 2024

2024 · proposal, interpretability, robustness, adversarial, AI safety
Mechanistic Interpretability for AI Safety — A Review

A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable algorithms and concepts, focusing on its relevance to AI safety.

87 min read · July 10, 2024

2024 · mechanistic interpretability AI safety review