What is interpretability?

AI interpretability is crucial because neural networks are "grown" through training rather than explicitly designed, making their internal workings and emergent behaviors difficult to understand.
Researchers employ "mechanistic interpretability" to dissect these complex systems, akin to studying an organism's biology, by building understanding from small units to larger internal mechanisms.
Understanding the inside of AI models enables "model medicine" to diagnose issues, ensure safety and reliability, and proactively steer models towards desired, predictable behaviors.

Interpretability — The scientific field focused on understanding the internal workings, decisions, and behaviors of AI models. Mechanistic Interpretability — A specific interpretability approach that aims to understand AI models by dissecting them from their smallest computational units to larger, functional mechanisms. Neural Network — An AI model structured like interconnected neurons, which learns patterns from data through training to perform tasks. Circuits — The intricate, learned pathways or structures that form within a neural network during training, responsible for implementing the model's observed behavior. Scaffold — The initial, untrained, structural framework of a neural network upon which complex internal "circuits" develop during the learning process. Model Biology — An analogy describing the process of applying scientific methodologies, similar to biological research, to study and understand the internal workings of AI models. Model Medicine — An analogy referring to the practice of using insights gained from AI interpretability to diagnose, troubleshoot, and improve the reliability and safety of AI models.

I work at Anthropic on the Interpreability Team. Interpreability is the science of understanding these AI models from the inside out. Researchers like me are trying to figure out what the networks learned and how they do what they do. It's almost like doing biology of a new kind of organism. We're from this one approach called mechanistic interpability. We try to build from understanding very small units into understanding larger and larger mechanisms. It's often surprising to people that we need to go and do interpretability at all, that we don't understand these systems that we've created. In some important way, we don't build neural networks, we grow them, we learn them. It's a lot like evolution. It's a lot like the way that we started with little molecules bouncing against each other and then you got very basic proteins and then maybe you got cells. And in the end, you have us, right? But no one designed us to make sense. Just every generation gives us grand progression of refinement and change over time. The models are the same way. We start with a kind of blank neural network. It's like an empty scaffold that things can grow on. And then as we train the neural network, circuits grow through it. They implement the model's behavior. And so when the situation where we understand the, you know, we understand that initial scaffold and we gave it. And we understand the process that incentivizes those circuits to form. But we don't know what those circuits are or what they do or how they work. Turns out that's challenging because the circuits get packed very densely. And if you want to understand them, you sort of need to pull apart those overlapped pieces. And so if we want to understand neural networks, we're then left with this challenge of going and studying this thing that we grew, rather than something that we designed from scratch. A child can pass a test at school because they actually learned like the material or they can pass the tests because they cheated. As the model developers, both of those look like the same outcome. And we can't, you know, without interpretability letting us see inside the model, we can't actually tell those to apart. We want these models to be safe and reliable. By studying how they work inside, by doing this kind of model biology, we can do some kind of model medicine that can diagnose and cure what else it and help it do what it's trying to do. The power of interpretability is that it gives us a different lens to go in and ask that question, to go and see potential problems. You could imagine developing techniques to steer models towards the correct behaviors. But if we actually understood all the nuts and bolts, then it seems like we ought to be able to intervene in ways that change what they do. AI is a really interesting moment in its development where we've figured out some things that work. But we don't know the limits of that. And we're just beginning to even find the right words to talk about what's happening. The early 1900s, this like golden age of physics, where sort of quantum mechanics was discovered and special relativity and general relativity and we like finally could understand things about solid state physics and things like all of a sudden we're starting to make sense. And it feels like we're sort of speedrunning that right now in interpretability. The exciting part is it's just, it feels like we're in a position to really understand the core of like, like what is thinking? How does thinking work? Having these hard problems and these deep, really difficult questions and also having just a little bit of traction on them, that's sort of the most I feel like that a scientist can ask for. If you want to really discover deep things and really exciting things. And so I think there's a way in which we're very fortunate to have such interesting and difficult questions to go and grapple with.

TL;DR

AI interpretability is crucial because neural networks are "grown" through training rather than explicitly designed, making their internal workings and emergent behaviors difficult to understand.
Researchers employ "mechanistic interpretability" to dissect these complex systems, akin to studying an organism's biology, by building understanding from small units to larger internal mechanisms.
Understanding the inside of AI models enables "model medicine" to diagnose issues, ensure safety and reliability, and proactively steer models towards desired, predictable behaviors.

Takeaways

AI models are grown, not designed: Neural networks develop complex, often opaque, internal "circuits" during training, meaning their emergent behaviors are not explicitly engineered but rather evolved.
Mechanistic interpretability is essential for understanding: This approach aims to build knowledge of AI models from their fundamental units up to their larger mechanisms, treating them like biological systems.
Interpretability reveals true learning: It allows developers to determine if a model genuinely learned the material or if it found unintended "cheating" shortcuts to pass tests, which look identical from the outside.
Enables model safety and reliability: By seeing inside the model, interpretability helps diagnose problems, implement "model medicine," and develop techniques to steer models towards safe and correct behaviors.
Addressing densely packed circuits: A significant challenge is the dense and overlapped nature of learned circuits within neural networks, which requires specialized methods to disentangle and understand them.
Potential for deep intervention: A comprehensive understanding of a model's internal "nuts and bolts" could lead to more precise and effective interventions to directly change or improve its behavior.
AI interpretability is a groundbreaking field: It's likened to a "golden age of physics" for AI, offering the potential to uncover fundamental insights into intelligence and how thinking works.

Vocabulary

Transcript

Feedback / ReportSpotted an issue or have an improvement idea?