Mechanistic Interpretability Workshop 2024

ICML 2024 In-Person Workshop, Vienna

July 27, 2024

This is a 1 day workshop at ICML on mechanistic interpretability, held on July 27th in room Lehar 1 at ICML venue of Messe Wien Exhibition Congress Center, Vienna, Austria.

Top Papers Prize

These are our 5 prize winning papers. You can see all 93 accepted papers, showcasing the latest mechanistic interpretability research, here!
  1. First prize ($1000): The Geometry of Categorical and Hierarchical Concepts in Large Language Models
  2. Second prize ($500): InversionView: A General-Purpose Method for Reading Information from Neural Activations
  3. Third prize ($250): Hypothesis Testing the Circuit Hypothesis in LLMs
  4. Honorable mention: Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
  5. Honorable mention: Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Schedule

Time Event
09:00 - 09:30 Welcome + Talk 1: David Bau
09:30 - 10:30 Oral Presentation
10:30 - 11:00 Spotlights 1
11:00 - 12:00 Poster Session 1
12:00 - 13:00 Panel Discussion
13:00 - 14:00 Lunch
14:00 - 14:30 Spotlights 2
14:30 - 15:30 Poster Session 2
15:30 - 16:00 Coffee Break
16:00 - 16:30 Talk 2: Asma Ghandeharioun
16:30 - 17:00 Talk 3: Chris Olah (remote)
18:30 - late Invite-only evening social (apply here)

Introduction

Even though ever larger and more capable machine learning models are being deployed in real-world settings, we still know concerningly little about how they implement their many impressive capabilities. This in turn can make it difficult to rely on these models in high-stakes situations, or to reason about or address cases where said models exhibit undesirable behavior.

One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits[Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.

Though this is an ambitious goal, in the past two years, mechanistic interpretability has seen rapid progress. For example, researchers have used newly developed mechanistic interpretability techniques to recover how large language models implement particular behaviors [for example, Geiger et al., 2021, Wang et al., 2022, Olsson et al., 2022, Geva et al., 2023, Hanna et al., 2023, Quirke and Barez, 2024], illuminated various puzzles such as double descent [Henighan et al., 2023], scaling laws [Michaud et al., 2023], and grokking [Nanda et al., 2023], and explored phenomena such as superposition [Elhage et al., 2022, Gurnee et al., 2023, Bricken et al., 2023] that may be fundamental principles of how models work. Despite this progress, significant amounts of mechanistic interpretability work still occur in relatively disparate circles – there seem to be relatively separate threads of work in industry and academia that each use their own (slightly different) notation and terminology.

This workshop aims to bring together researchers from both industry and academia to discuss recent progress, address the challenges faced by this field, and clarify future goals, use cases, and agendas. We believe that this workshop can help foster a rich dialogue between researchers with a wide variety of backgrounds and ideas, which in turn will help researchers develop a deeper understanding of how machine learning systems work in practice.

Attending

We welcome attendees from all backgrounds, regardless of your prior research experience or if you have work published at this workshop. Note that while you do not need to be registered for the ICML main conference to attend this workshop, you do need to be registered for the ICML workshop track. No further registration (eg with this specific workshop) is needed, just turn up on the day!

Speakers

Speaker

Chris Olah

Anthropic

Speaker

David Bau

Northeastern University

Speaker

Asma Ghandeharioun

Google DeepMind

Panelists

Speaker

Naomi Saphra

Harvard University

Speaker

Atticus Geiger

Pr(Ai)2R Group

Speaker

Stella Biderman

EleutherAI

Speaker

Arthur Conmy

Google DeepMind

Call for Papers

We are inviting submissions of short (4 pages) and long (8 pages) papers outlining new research, with a deadline of May 29th 2024. We welcome papers on any of the following topics (see the Topics for Discussion section for more details and example papers), or anything else where the authors convincingly argue that it moves forward the field of mechanistic interpretability.

We also welcome work that furthers the field of mechanistic interpretability in less standard ways, such as by providing rigorous negative results, or open source software (e.g. TransformerLens, pyvene, nnsight or Penzai), models or datasets that may be of value to the community (e.g. Pythia, MultiBERTs or open source sparse autoencoders), coding tutorials (e.g. the ARENA materials), distillations of key and poorly explained concepts (e.g. Elhage et al., 2021), or position pieces discussing future use cases of mechanistic interpretability or that bring clarification to complex topics such as “what is a feature?”.

Reviewing and Submission Policy

All submissions must be made via OpenReview. Please use the ICML 2024 LaTeX Template for all submissions.

Submissions are non-archival. We are happy to receive submissions that are also undergoing peer review elsewhere at the time of submission, but we will not accept submissions that have already been previously published or accepted for publication at peer-reviewed conferences or journals. Submission is permitted for papers presented or to be presented at other non-archival venues (e.g. other workshops)

Reviewing for our workshop is double blind: reviewers will not know the authors’ identity (and vice versa). Both short (max 4 page) and long (max 8 page) papers allow unlimited pages for references and appendices, but reviewers are not expected to read these. Evaluation of submissions will be based on the originality and novelty, the technical strength, and relevance to the workshop topics. Notifications of acceptance will be sent to applicants by email.

Prizes

Important Dates

All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).

Note: You will require an OpenReview account to submit. If you do not have an institutional email (e.g. a .edu address), OpenReview moderation can take up to 2 weeks. Please make an account by May 14th at the latest if this applies to you.

Potential topics of discussion include:

Besides panel discussions, invited talks, and a poster session, we also plan on running a hands-on tutorial exploring newer results in the field using Nanda and Bloom [2022]'s TransformerLens package.

Organizing Committee

Speaker

Fazl Barez

Research Fellow University of Oxford

Organizer

Mor Geva

Ass. Prof Tel Aviv University, Visiting Researcher Google Research

Organizer

Lawrence Chan

PhD student UC Berkeley

Organizer

Atticus Geiger

Pr(Ai)2R Group

Organizer

Kayo Yin

PhD student UC Berkeley

Organizer

Neel Nanda

Research Engineer Google DeepMind

Organizer

Max Tegmark

Professor MIT

Contact

Email: icml2024mi@gmail.com