ICML 2024 Mechanistic Interpretability Workshop

This is a 1 day workshop at ICML on mechanistic interpretability, held on July 27th in room Lehar 1 at ICML venue of Messe Wien Exhibition Congress Center, Vienna, Austria.

Schedule

Time	Event
09:00 - 09:30	Welcome + Talk 1: David Bau
09:30 - 10:30	Oral Presentation
10:30 - 11:00	Spotlights 1
11:00 - 12:00	Poster Session 1
12:00 - 13:00	Panel Discussion
13:00 - 14:00	Lunch
14:00 - 14:30	Spotlights 2
14:30 - 15:30	Poster Session 2
15:30 - 16:00	Coffee Break
16:00 - 16:30	Talk 2: Asma Ghandeharioun
16:30 - 17:00	Talk 3: Chris Olah (remote)
18:30 - late	Invite-only evening social (apply here)

Introduction

Even though ever larger and more capable machine learning models are being deployed in real-world settings, we still know concerningly little about how they implement their many impressive capabilities. This in turn can make it difficult to rely on these models in high-stakes situations, or to reason about or address cases where said models exhibit undesirable behavior.

One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits[Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.

Though this is an ambitious goal, in the past two years, mechanistic interpretability has seen rapid progress. For example, researchers have used newly developed mechanistic interpretability techniques to recover how large language models implement particular behaviors [for example, Geiger et al., 2021, Wang et al., 2022, Olsson et al., 2022, Geva et al., 2023, Hanna et al., 2023, Quirke and Barez, 2024], illuminated various puzzles such as double descent [Henighan et al., 2023], scaling laws [Michaud et al., 2023], and grokking [Nanda et al., 2023], and explored phenomena such as superposition [Elhage et al., 2022, Gurnee et al., 2023, Bricken et al., 2023] that may be fundamental principles of how models work. Despite this progress, significant amounts of mechanistic interpretability work still occur in relatively disparate circles – there seem to be relatively separate threads of work in industry and academia that each use their own (slightly different) notation and terminology.

This workshop aims to bring together researchers from both industry and academia to discuss recent progress, address the challenges faced by this field, and clarify future goals, use cases, and agendas. We believe that this workshop can help foster a rich dialogue between researchers with a wide variety of backgrounds and ideas, which in turn will help researchers develop a deeper understanding of how machine learning systems work in practice.

Attending

We welcome attendees from all backgrounds, regardless of your prior research experience or if you have work published at this workshop. Note that while you do not need to be registered for the ICML main conference to attend this workshop, you do need to be registered for the ICML workshop track. No further registration (eg with this specific workshop) is needed, just turn up on the day!

Speakers

Chris Olah

Anthropic

David Bau

Northeastern University

Asma Ghandeharioun

Google DeepMind

Panelists

Naomi Saphra

Harvard University

Atticus Geiger

Pr(Ai)²R Group

Stella Biderman

EleutherAI

Arthur Conmy

Google DeepMind

Call for Papers

We are inviting submissions of short (4 pages) and long (8 pages) papers outlining new research, with a deadline of May 29th 2024. We welcome papers on any of the following topics (see the Topics for Discussion section for more details and example papers), or anything else where the authors convincingly argue that it moves forward the field of mechanistic interpretability.

Techniques: Work inventing new mechanistic interpretability techniques, evaluating the quality of existing techniques, or proposing benchmarks and tools for future evaluations.
Exploratory analysis:Qualitative, biologically-inspired analysis of components, circuits or phenomena inside neural networks.
Decoding superposition: Work that deepens our understanding of the hypothesis that models activations are represented in superposition, and explores techniques to decode superposed activations, such as sparse autoencoders.
Applications of interpretability: Can we study jailbreaks/hallucinations/other interesting real-world phenomena of LLMs? Where are places where mech interp provides value, in a fair comparison with e.g. linear probing or finetuning baselines?
Scaling and automation: How can we reduce the dependence of mechanistic interpretability on slow, subjective and expensive human labor? How much do our current techniques scale?
Basic science: There are many fundamental mysteries of model internals, and we welcome work that can shed any light on them: Are activations sparse linear combinations of features? Are features universal? Are circuits and features even the right way to think about models?

We also welcome work that furthers the field of mechanistic interpretability in less standard ways, such as by providing rigorous negative results, or open source software (e.g. TransformerLens, pyvene, nnsight or Penzai), models or datasets that may be of value to the community (e.g. Pythia, MultiBERTs or open source sparse autoencoders), coding tutorials (e.g. the ARENA materials), distillations of key and poorly explained concepts (e.g. Elhage et al., 2021), or position pieces discussing future use cases of mechanistic interpretability or that bring clarification to complex topics such as “what is a feature?”.

Reviewing and Submission Policy

All submissions must be made via OpenReview. Please use the ICML 2024 LaTeX Template for all submissions.

Submissions are non-archival. We are happy to receive submissions that are also undergoing peer review elsewhere at the time of submission, but we will not accept submissions that have already been previously published or accepted for publication at peer-reviewed conferences or journals. Submission is permitted for papers presented or to be presented at other non-archival venues (e.g. other workshops)

Reviewing for our workshop is double blind: reviewers will not know the authors’ identity (and vice versa). Both short (max 4 page) and long (max 8 page) papers allow unlimited pages for references and appendices, but reviewers are not expected to read these. Evaluation of submissions will be based on the originality and novelty, the technical strength, and relevance to the workshop topics. Notifications of acceptance will be sent to applicants by email.

Prizes

Best paper prize: $1000
Second place: $500
Third place: $250
Honorable mentions: Up to 5, no cash prize

Important Dates

Submission open on OpenReview: May 12, 2024
Submission Deadline: May 29, 2024
Notification of Acceptance: June 23, 2024
Camera-ready Deadline: July 14th, 2024
Workshop Date: July 27, 2024

All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”).

Note: You will require an OpenReview account to submit. If you do not have an institutional email (e.g. a .edu address), OpenReview moderation can take up to 2 weeks. Please make an account by May 14th at the latest if this applies to you.

Potential topics of discussion include:

Many recent papers have suggested different metrics and techniques for validating mechanistic interpretations [Cammarata et al., 2020, Geiger et al., 2021, Wang et al., 2022, Chan et al., 2022]. What are the advantages and disadvantages of these metrics, and which metrics should the field use going forward? How do we avoid spurious explanations or “interpretability illusions" [Bolukbasi et al., 2021]? Are there unknown illusions for currently popular techniques?
Neural networks seem to represent more features in superposition [Elhage et al., 2022, Gurnee et al., 2023] than they have dimensions, which poses a significant challenge for identifying what features particular subcomponents are representing. How much of a challenge does superposition pose for various approaches to mechanistic interpretability? What are approaches that allow us to address or circumvent this challenge? We are particularly excited to see work building on recent successes using dictionary learning to address superposition, such as Sparse Autoencoders [Bricken et al., 2023], including studying these dictionaries, using them for circuit analysis [Marks et al., 2024], understanding reward models [Marks et al., 2024], and developing better training methods.
Techniques from mechanistic interpretability have been used to identify, edit, and control behavior inside of neural networks [Meng et al., 2022, Turner et al., 2023]. However, other recent work has suggested that these model editing and pruning techniques often have unintended side effects, especially on larger models [Hoelscher-Obermaier et al., 2023, Cohen et al. 2023, Huang et al. 2024, Lo et al., 2024]. How can we refine localization and editing and pruning behavior in more specific and scalable methods?
To understand what model activations and components do, it is crucial to have principled techniques, which ideally involve causally intervening on the model, or otherwise being faithful to the model's internal mechanisms. For example, a great deal of work has been done around activation patching, such as (distributed) interchange interventions [Vig et al. 2020, Geiger et al., 2021, Geiger et al. 2024], causal tracing [Meng et al., 2022], path patching [Wang et al., 2022, Goldowsky-Dill et al., 2023], patchscopes [Ghandeharioun et al., 2024] and causal scrubbing [Chan et al., 2022].. What are the strengths and weaknesses of current techniques, when should or shouldn't they be applied, and how can they be refined? And can we find new techniques, capable of giving new insights?
Many approaches for generating mechanistic explanations are very labor intensive, leading to interest in automated and scalable mechanistic interpretability [Foote et al., 2023, Bills et al., 2023, Conmy et al., 2023, Kramar et al., 2024, Wu et al.2024]. How can we develop more scalable, efficient techniques for interpreting ever larger and more complicated models? How do interpretability properties change with model scale, and what will it take for the field to be able to keep up with frontier foundation models?
Models are complex, high-dimensional objects, and significant insights can be gained from more qualitative, biological-style analysis, such as studying individual neurons [Goh et al., 2021, Gurnee et al., 2024], Sparse Autoencoder features [Cunningham et al. 2023, Bricken et al., 2023], attention heads [McDougall et al., 2023, Gould et al., 2023], or specific circuits [Olsson et al., 2022, Wang et al., 2022, Lieberum et al., 2023]. What more can we learn from such analyses? How can we ensure they’re kept to a high standard of rigor, and what mistakes have been made in past work?
Mechanistic interpretability is sometimes criticized for a focus on cherry-picked, toy tasks. Can we validate that our understanding is correct by doing something useful with interpretability on a real world task, such as reducing sycophancy [Rimsky 2023] or preventing jailbreaks [Zheng et al., 2024]? In particular, can we find cases where mechanistic interpretability wins in a “fair fight”, and beats strong non-mechanistic baselines such as representation engineering [Zou et al., 2023] or fine-tuning?
There are many mysteries in the basic science of model internals: how and whether they use superposition [Elhage et al., 2023], whether the linear representation hypothesis [Park et al., 2023] is true, if features are universal [Olah et al., 2020], what fine-tuning does to a model [Prakash et al., 2024], and many more. What are the biggest remaining open problems, and how can we make progress on them?
Much current mechanistic interpretability work focuses on LLMs. How well does this generalize to other areas and modalities, such as vision [Cammarata et al., 2021], audio, video, protein folding, or reinforcement learning [Hilton et al., 2020]? What can mechanistic interpretability learn from related fields, such as neuroscience and the study of biology circuits, and does mechanistic interpretability have any insights to be shared there?
A significant contributor to the rapid growth of the field is the availability of introductory materials [Nanda 2022], beginner-friendly coding tutorials on key techniques [McDougall 2023], open-sourced code and easy-to-use software packages (for example, Nanda and Bloom [2022] or Fiotto-Kaufman [2023]), which makes it easier for new researchers to begin to contribute to the field. How can the field continue to foster this beginner-friendly environment going forward?
Mechanistic interpretability is sometimes analogized to the neuroscience of machine learning models. Multimodal neurons were found in biological networks [Quiroga et al., 2005] and then artificial ones [Goh et al., 2021], and high-low frequency detectors were found in artificial networks [Schubert et al., 2021] then biological ones [Ding et al., 2023]. How tight is this analogy, and what can the two fields learn from each other?

Besides panel discussions, invited talks, and a poster session, we also plan on running a hands-on tutorial exploring newer results in the field using Nanda and Bloom [2022]'s TransformerLens package.