Leonard F. Bereska

4.125, LAB42

Science Park 900

1098XH Amsterdam

I’m Leonard, a PhD Candidate at the University of Amsterdam, dedicated to enhancing AI safety through mechanistic interpretability. My research aims to make transformer models more transparent and understandable, contributing to the broader goal of AI alignment.

research focus

My work revolves around reverse engineering neural networks into human-interpretable algorithms. I’m particularly interested in:

Engineering monosemanticity and implementing sparse distillation techniques in transformer models.
Investigating the relationship between mechanistic interpretability and adversarial robustness.
Analyzing truth representations and simulacra in large language models.
Applying singular learning theory to examine phase transitions in algorithmic tasks.
Mechanistically interpreting prior-fitted tabular transformers.
Creating sparse boolean circuits (inspired by computation in superposition) as testbeds and benchmarks for interpretability methods.

If you find any of these topics interesting, please reach out.

As part of the AI Safety Initiative Amsterdam, I’m actively involved in promoting AI safety research and awareness. We organize events, facilitate reading groups, and foster discussions on crucial AI safety topics.

I’m also passionate about nurturing the next generation of AI safety researchers. I’ve been involved in teaching courses and supervising numerous Master’s students on projects ranging from detecting bias, eliciting truth in LLMs, to interpretability in medical AI applications.

beyond research

When I’m not diving into the intricacies of neural networks, you might find me:

Practicing yoga or meditation to maintain balance.
Reading science-fiction novels (recently discovered Vernor Vinge’s work for a plausible treatment of AI singularity).
Playing with my two chihuahuas, Cicchetti and Pancetta.
Exploring Amsterdam’s culinary scene (always on the lookout for the best vegan spots!)
Brushing up on my Mandarin or picking up Dutch phrases.
Engaging in discussions about the future of AI and its implications for society.

latest posts

Aug 19, 2024	Mechanistic Interpretability for Adversarial Robustness — A Proposal
Jul 10, 2024	Mechanistic Interpretability for AI Safety — A Review

selected publications

TMLR
Mechanistic Interpretability for AI Safety — A Review

Leonard F. Bereska, and Efstratios Gavves

TMLR, Apr 2024

Abs Bib HTML PDF

Understanding AI systems’ inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse-engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.
@article{bereska_mechanistic_2024, title = {Mechanistic Interpretability for AI Safety — A Review}, author = {Bereska, Leonard F. and Gavves, Efstratios}, year = {2024}, month = apr, journal = {TMLR}, eprint = {2404.14082}, }
AAAI-SS
Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models

Leonard F. Bereska, and Efstratios Gavves

AAAI-SS, Oct 2023

Abs Bib HTML PDF

As AI systems continue to advance in power and prevalence, ensuring alignment between humans and AI is crucial to prevent catastrophic outcomes. The greater the capabilities and generality of an AI system, combined with its development of goals and agency, the higher the risks associated with misalignment. While the concept of superhuman artificial general intelligence is still speculative, language models show indications of generality that could extend to generally capable systems. Regarding agency, this paper emphasizes the understanding of prediction-trained models as simulators rather than agents. Nonetheless, agents may emerge accidentally from internal processes, so-called simulacra, or deliberately through fine-tuning with reinforcement learning. As a result, the focus of alignment research shifts towards aligning simulacra, comprehending and mitigating mesa-optimization, and aligning agents derived from prediction-trained models. The paper outlines the challenges of aligning simulators and presents research directions based on this understanding. Additionally, it envisions a future where aligned simulators are critical in fostering successful human-AI collaboration. This vision encompasses exploring emulation approaches and the integration of simulators into cyborg systems to enhance human cognitive abilities. By acknowledging the risks associated with misaligned AI, delving into the concept of simulacra, and presenting strategies for aligning agents and simulacra, this paper contributes to the ongoing efforts to safeguard human values in developing and deploying AI systems.
@article{bereska_taming_2023, title = {Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models}, shorttitle = {Taming Simulators}, author = {Bereska, Leonard F. and Gavves, Efstratios}, year = {2023}, month = oct, journal = {AAAI-SS}, volume = {1}, number = {1}, pages = {68--72}, issn = {2994-4317}, doi = {10.1609/aaaiss.v1i1.27478}, }