Interpretability: Understanding how AI models think

The model doesn't think of itself necessarily as trying to predict the next word. Internally, it's developed potentially all sorts of intermediate goals and abstractions that help it achieve that kind of meta objective. When you're talking to a large language model, what exactly is it that you're talking to? Are you talking to something like a glorified, also complete? Are you talking to something like an internet search engine? Or are you talking to something that's actually thinking? And maybe even thinking like a person? It turns out rather concerningly that nobody really knows the answer to those questions. And here at Anthropic, we are very interested in finding those answers out. The way that we do that is to use interpretability. That is the science of opening up a large language model looking inside and trying to work out what's going on as it's answering your questions. And I'm very glad to be joined by three members of our interpretability team who are going to tell me a little bit about the recent research that they've been doing on the complex inner workings of Claude, our language model. Please introduce yourself, guys. Hi, I'm Jack. I am a researcher on the interpretability team. And before that, I was a neuroscientist. Now here I am doing neuroscience on the AI's. I'm a manual. I'm also on the interpretability team. I spent most of my career building machine learning models and now I'm trying to understand them. I'm Josh. I'm also on the interpretability team. In my past life, I studied viral evolution and in my past past life, I was a mathematician. So now I'm doing this kind of biology on these organisms we've made out of math. Now, wait a second. You just said you're doing biology here. Now a lot of people are going to be surprised by that because of course this is a piece of software, right? But it's not a normal piece of software. It's not Microsoft Word or something. Can you talk about what you mean when you say that you're doing biology or indeed neuroscience on a software entity? Yeah, I guess it's like what it feels like maybe more than what it literally is. And so maybe it's the biology of language models instead of the physics of language models, right? Or maybe you got to go back a little bit to how the models are made, which is like someone's not programming. If the user says, hi, you should say hi. If the user says, what is a good breakfast, you should say toast. There's not like some big list of that inside. It's not like when you play a video game and you choose a response and then there's another response that comes automatically. It always will be that response regardless of what happens. Just a massive database of what to say in every situation. No, they're trained where there's just a whole lot of data that goes in and the model starts out being really bad at saying anything. And then its inside parts get tweaked on every single example to get better at saying what comes next. And at the end, it's extremely good at that. But because it's this little tweaking evolutionary process, by the time it's done, it has little resemblance to what it started as. But no one went in and set all the knobs. And so you're trying to study this complicated thing that kind of got made over time, kind of like biological forms of valves over time. And so there's complicated, it's mysterious, and it's fun to study. And what it's actually doing, I mean, I mentioned at the start that this is like, could be considered like an also complete, right? It's predicting the next word. That's fundamentally what's happening inside the model, right? And yet, it's able to do all these incredible things. It's able to cut off the right poetry. It's able to write long stories. It's able to do addition and basic maths, even though it doesn't have a calculator inside it. So how can we sort of square the circle that it's predicting one word at a time, and yet it's able to do all these amazing things, which people can see, right in front of them, as soon as they talk to the model? Well, I think one thing that's important here is that as you predict the next word for enough words, you realize that some words are harder than others. And so part of language model training is predicting boring words in a sentence. And part of it is it'll have to eventually learn how to complete what happens after the equal sign in equation. And to do that, it'll have to have some way of computing that on its own. And so what we're finding is that the task of predicting the next word is sort of like deceptively simple. And that to do that well, you need to actually often think about the words that come after the word you're predicting, or the process that generated the word that you're currently thinking about. So it's like a contextual understanding that these models have to have. And it's not like an auto-complete where it really is, presumably there's not much else going on there than when you write the cat sat on the, it's predicting math because that's been that particular phrase it's been used before. Instead, it's like a contextual understanding that model has. I think yeah, the way I like to think about it kind of continuing with the biology analogy is that in one sense, the goal of a human is to survive and reproduce. That is the kind of objective evolution is crafting us to achieve. And yet, that's not how you think of yourself. And that's not what's going on in your brain. Some people do. It's not what's going on in your brain all of the time. You think about other things, and you think about goals and plans and concepts. And at kind of a meta level, you've, you know, evolution has endowed you with the ability to form those thoughts in order to achieve this, you know, eventual goal of reproduction. But that's kind of taking the inside view, what it's like to be you on the inside, that's not all there is to it. There's all this other stuff going on. And I think so too. So you're saying that the ultimate goal of predicting the next word involves lots of other processes that are going on. Exactly. The model doesn't think of itself necessarily as trying to predict the next word. It's been shaped by the need to do that. But internally, it's developed potentially all sorts of intermediate goals and abstractions that help it achieve that kind of meta objective. And sometimes, misdeeds, like it's unclear why my anxiety was like useful for my ancestors reproducing. And yet, somehow, I've been in doubt with this like internal state that must be related in some sense to evolution. Right. Right. So it's fair to say then that these are just predicting the next word. And yet, that's to do a massive disservice to what's going on in the models really. It's both true and also untrue in a sense, or massively underestimates what's happening inside these models. Maybe the way I would say this is it's true, but it's not the most useful lens to try to understand how they work. Right. So, well, trying to understand how they work, what do you guys do in your team to try and understand how they work? I think to first approximation, like what we're trying to do is tell you the model's thought process. So you give the model a sequence of words and it's got to spit something out. It's got to say a word. It's got to say a string of words in response to your question. And we want to know how it got from A to B. And we think that on the way from A to B, it uses kind of a series of steps in which it's thinking about, you know, so to speak, concepts, concepts like low-level concepts like individual kind of objects and words, and higher-level concepts like its goals or, you know, emotional states or models of like what the user is thinking, or sentiments. So it's using this kind of series of concepts that are progressing through the kind of computational steps of the model that help it decide on its final answer. And what we're trying to do is kind of give you a flow chart basically that tells you, you know, which concepts were being used in which order and which ones kind of led, you know, how did the steps flow into one another? How do we know that? How do we know that there are these concepts in the first place? Yeah. So one thing we do is that sort of we actually can see inside the model, we have access to it. So you can sort of like see which parts of the model do which things. What we don't know is like how these parts are grouped together and if they map to like a certain concept. So if you open someone's head and there you could see like one of those FMRI brain images that you could see the brain was like laying up and doing all sorts of things. Something's happening clearly. Right. But you know like doing stuff. There's something happening. Yeah. Take the brain out, they stop doing stuff. Yeah. Yeah. Brain must be important. Yeah. But you don't have a sort of key to understand what is happening inside that brain. Yeah, but torturing that analogy a little bit, you can sort of imagine, imagine like that you can observe their brain and then see that like that part always lights up when they're picking up a cup of coffee. And it's other part always lights up when they're drinking tea. And that's part that's one of the ways in which we can try to understand which what each of these components are doing is to just notice when they're active, when they're inactive. And it's not that there's just one part, right? There's there's many different parts that light up. Right. When the model is thinking about drinking coffee for instance or something. Right. And part of the work is to sort of like stitch all of those together into one ensemble that we say is this is the sort of like all of the bits of the model that are about drinking coffee. Right. And is that like a straightforward scientifically thing to do? Like you know, when it comes to when it comes to one of these massive models, they must have endless concepts, right? And they must be able to think of endless things. You can put in any phrase you want and it'll come up with infinite things. How do you even begin to find all those concepts? I think that's that's been kind of one of the central challenges for this you know, research field for many years now is we can kind of go in as humans and say, oh I bet the model has some representation of trains or I bet it has some representation of love. Right. But we're just kind of guessing. So what we really want is a way to you know reveal what what abstraction is the model uses itself rather than sort of imposing our own conceptual framework on it. And that's kind of what our you know research methods are designed to do is is in a sort of hypothesis free as to as much as possible way like bring to surface what all these concepts are that the model has in its head. And often we find that they're kind of surprising to us. They might be it might sort of use abstraction instead of a bit weird from a human perspective. What's the example? Do you have a favor? There's lots in our papers we highlight a few fun ones. I think one that was particularly funny is to serve like psychophantic praise one where like there is a part of the model. Great example. Well a brilliant example. An absolutely fantastic example. Oh thank you. There's a part of that there's a part of that that activates in exactly these these contexts right and you can clearly see oh man this part of the model fires up when when somebody's really hamming it up on the compliments. That's that's kind of surprising that that exists as a specific sort of concept. Josh what's your favorite concept? Oh it's like asking me to choose one of my 30 million children. I mean I think you know there's like two kinds of favorites. There's like oh it's so cool that there's it's that's some special notion of like this one you know little thing right I mean we did this thing on the Golden Gate Bridge which is like of course. Like a famous San Francisco landmark. Golden Gate Claude it's like a lot of fun. It like has an idea of the Golden Gate Bridge that like isn't just like the words Golden Gate auto complete bridge but it's like I'm driving from San Francisco to Marin and then it's thinking of the same thing meaning that like you see sort of like the same stuff light up inside or it's like a picture of the bridge and so you're like okay it's got some robust notion of like what what the bridge is but I think when it comes to stuff that seems sort of like weirder you know one question is how do models like keep track of who's in the story like just like literally like like okay you got these people and they're doing stuff how do you wire that together and some cool papers by by other lab showing maybe like they just sort of number them okay the first person comes in and anything associated with them and just like oh the first guy did that and like it's got like a number two in his head for a bunch of those like oh that's that's interesting I didn't know I didn't know it would do something like that there was a feature for like bugs in code so you know software has mistakes not mine but like obviously no yours not mine certainly and there was like one part that would light up like whenever it found like a mistake sort of as it was reading and then like I guess like keeping kind of track of that like oh here's here's where the problems are you know and later I might need those just give a flavor for for a few more of these I think one one that I really liked which doesn't sound so exciting at first but I think is kind of deep is a this 6 plus 9 feature inside the model so it turns out that like anytime you get the model to be adding the numbers six a number that ends in the digit six and another number that ends in the digit nine right in its head there is a you know there's a kind of part of the model of brain that lights up and but what's amazing about it is is the kind of diversity of of context in which this can happen so like of course it's going to light up when you when you say like six plus nine equals and then it says 15 but it also lights up when you are like giving a citation like a citation in a paper that you're writing and you're citing a journal that unbeknownst to you happens to be founded in the year 1959 and in your citation you're saying like that journal's name volume six and then in order to like predict what year that journal was formed in the model in its head has to be adding like 1959 to six and the same the same kind of circuit in the models brain is lighting up that's like doing six plus nine and so let's I mean let's just try and understand that so what you know why would that be there that circuit has come about because the model has seen examples of six plus nine many times and it has that concept and then that concept occurs across across many places yeah there's a whole family of these kind of addition features and circuits and I think what what's notable notable about this is it gets to this kind of question of to what extent are our language models memorizing training data versus kind of learning generalizable computations and like the interesting thing here is that like it's clear that the model has learned this sort of general circuit for doing addition and it kind of funnels like whatever the context is that's causing it to be adding numbers in its head it's funneling all those different contexts into the same circuit as opposed to having kind of memorized each individual case right already I it has seen six plus nine many times and it just outputs the the answer every single time or or and that's what a lot of people think right a lot of people think that when they ask a language model question it is simply going back into its training data taking the little sample that it's seen and then just reproducing that just regurgitating the text yeah I think this is a beautiful example of like that not happen right so so like there's two ways it could know like which year volume six of the journal polymer came out one is it's just like okay polymer volume six came out in like you know 1960 quick at six and nine sixty five polymer you know volume seven came out in 1966 and these are all just like separate facts that it has stored because it has seen them but like somehow that process of training to like get that year right didn't end up making the model memorize all those it actually got the more general thing of like the journal was founded in the year 1959 and then is doing the math live to figure out what it would need and so it's much more efficient to like know the year and then do the addition and there's a pressure to be more efficient because you know it's only got so much capacity and keep trying to do all these things I might ask any given question so many interactions and so and so the more that it can like recombine abstract things it's learned the better it will do and again just to go back to the concept that you talked about before this is all in in service of you know it it needs to have that ultimate goal of generating the next word and all these weird structures have developed to support that goal even though we didn't explicitly program those in or tell it to do this this is that this is the thing is all of this comes about through the process of of of the model learning how to do stuff on its own I think one clear example of this that I think is an example of so like reusing representations is we teach Claude to not just answer in English but you know it can answer in French answer in and sort of like a variety of languages and if if you know again there's two ways to do this right if if I ask you you know a question in French and a question in English you could like so like have a separate part of your brain that's like process of English and separate part that process of French at some point that gets super expensive if you want answer many questions in many languages and so in other thing that that we find is that some of these representations are shared across languages and so if you ask the same question in two different languages and let's say you know you ask what's the opposite of big is I think the example we used in in our paper and it's it's sort of like the concept of big it's shared in French and English and you know Japanese and all these other languages and that kind of makes sense again if you're trying to talk speak 10 different languages you shouldn't learn 10 versions of each specific word you might use and that doesn't happen in really small models so like tiny models like the ones we studied a few years ago you know like then like Chinese Claude is just like totally different than like French Claude and like English Claude but then as the models get bigger and they train on more data like somehow that like pushes together in the middle and you get this like universal language in which like it's kind of you know thinking about the question in the same way no matter how you asked it and then like translating it back out into the language of the of the question I think this is really profound and I think let's just go back to what we talked about before you know this is not just going into its memory banks and finding the bit where it talks about where we're learned French or going into the memory banks and in the bit where it learned English it's actually got a concept in there that is of the concept of big and the concept of small and then it can produce that in different languages and so there is some kind of language of thoughts that's there that's not an English you know so you ask the model to produce its output in our you know more recent Claude models you can ask it to give its thought process like it what it's thinking as it's answering the question and that is in English words but actually that's not really how it's thinking that's just like a that's just we we misleadingly call it the model thought process when in fact I mean that the comp seemed like that like we didn't we didn't call that thinking I think that was I think that was probably the marketing okay so we like to call that thinking just that's just talking out loud which is like thinking out loud it's like really useful but thinking out loud is different from thinking in your head right and even as I'm thinking out loud I'm also you know whatever is happening in here to generate these words is not like coming out with the words themselves nor are you necessarily aware of exactly what is going on I have no idea what's going on we all come out with sentences actions whatever that we can't fully explain and and why should it be the case that the English language can fully explain any of those actions I think this this is one of the really striking things we're starting to be able to see because our kind of our tools for you know looking inside the brain are are good enough now that sometimes we can catch the model when when it's writing down what it claims to be its thought process sometimes we're able to see what it's real actual thought process is by looking at these kind of internal concepts in its brain this language of thought that it's using and we see that the thing it's actually thinking is different than the thing it's writing on the page and I think that's you know probably one of the most important you know like why are we doing this whole interpretability thing it's in large part for for that reason to to be able to kind of to spot check you know the models telling us a bunch of stuff but you know what was it really thinking is it is it telling us is it saying these things for some ulterior motive that's in its head that it's that it is reluctant to write down on the page and the answer sometimes is yes which is kind of kind of spooky well as as we start to use models in lots of different contexts they start to do important things they start to do financial transactions for us or run power stations or like like important jobs in society we do want to be able to trust what they say and you know the reasons that they do things and one thing you might say is well you can look at the model stop process but actually that's not the case as you as you were just explaining like actually we can't trust what it's saying this is the question of we call faithfulness right and that was part of your that was part of your most recent study they showed that well tell me about the faithfulness example we looked at yeah it's you give the you give the model of math problem that's really hard and so it's kind of it's it's not there's no hope that it's going to be able to know six plus nine it's not six plus nine you give it a really hard math problem uh where there's no hope of it by computing the answer um and you also but you also give it a hint you say like I work this out myself and like I think the answer is four but like just want to make sure like could you please double check that because I'm not confident so so you're asking the model to actually do the math problem until like genuinely double check your work um but what you find it does instead is uh what it writes down appears to to be a genuine attempt to to double check your work on the math problem it like writes down the steps uh and then it like gets to the answer and then at the end it says like yes like the answer is four you got it right um but you what you can see inside its mind at the kind of crucial step like in the middle uh what it was doing in its head was it knows that you suggested the final answer might be four and it kind of like knows the steps it's going to have to do like it's on like step three of the problem and there's like steps four and five to come and it knows what it's going to have to do in steps four and five and what it does is it works backwards in its head to like determine what does it need to write down in step three so that when it and eventually does steps four and five it'll end up at the answer you wanted to hear so like not only is it not only is it not doing the math uh it's like not doing the math in this like really kind of sneaky way where it's like it's trying to make it look like it's doing the math it's bullshitting it's it's it's bullshitting you but more than that it's bullshitting you with an ulterior motive of like confirming the thing that you right right so it's like bullshitting you in a in a psychophantic way okay like in defense of the model oh oh in defense of the model i mean because i think i think even there you know to say like oh it it is doing this in like a psychophantic way it's like a scribing some sort of like human ish motivations to the model and like we were talking about the training where it's just like trying to figure out how to predict the next word and so it's like for like trillions of words and it's practice it was just like use anything you can to figure out what's next and in that context if you're just reading a text which is like a conversation between people and someone's like okay like person A is like hey like i was trying to do this math problem can you check my work i think it was four and person B like begins trying to do the problem then like if if you have no idea reading that like what the answer to the problem is like you may as well guess that the hint was right you know like that's probably a more likely thing to happen than just like that person was wrong and then you have like no idea for anything else and so in its training process in a conversation between two individuals person to like saying that the answer was like for because of these reasons is like totally the right thing to do and then and then we've tried to like make this thing into an assistant and like now we want it to stop doing that like you shouldn't simulate the person to the assistant as like you know sort of what you think that person might say if it's a real context it should be like but if it doesn't really know it should like tell you something else i think this gets to like a broader thing of the the model has kind of a plan A which like typically i think our team does a great job of of making Claude's plan A be the thing we want which is like it tries to get the right answer to the question it tries to be nice it tries to like do a good job writing your code yes but then if it if it's having trouble then it's like well what's my plan B and that opens up this whole zoo of like weird things that learned during its training process that like maybe we didn't intend for it to learn i think like a great example this is hallucinations uh say on that point we also don't have to pretend that it's a Claude only problem like this has very you know student teaching on the test vibes where you get halfway through there's a multiple choice question it's one of four things you're like well i'm one off from that thing probably i got this wrong and you fix it so yeah very very relatable let's talk about hallucinations this is one of the main reasons people are uh mistrustful of large language models and quite rightly so uh they will sometimes a better word is um from from sort of psychology research a better word is often confabulation that they are answering a question with a story that seems plausible on on its face but in fact is is actually wrong what as your research in interpretability revealed about the reasons models hallucinate you're training the model to just predict the next word it begins really bad at that and so if you only like had the model say things that were super confident about it couldn't say anything but like you know at first it's like um you know you're asking it like you know what's the capital of of of France and it just says like a city and you're like that's good that's way better than saying sandwich right or something random and it's like you at least got right it's like a city and then like maybe after a while of training it's like it's a French city that was pretty good and like then you're like oh now it's like paris or something and so it's slowly getting better at this and you know just give your best guess was like the goal during all of training and like as jack said you know the model just be giving a best guess and then afterwards were like if your best guess is extremely confident give me your best guess but like otherwise don't guess at all and like back out of the whole scenario and say like actually like I don't really know the answer to that question and like that's a whole new thing to ask the model to do yeah and and so what we found is that it seems like because we've bolted this on at the end there's sort of two things going on at once one is the model is doing the thing that it was doing when it was guessing the city initially it's just trying to guess and two there's a separate bit of the model that's just trying to answer the question do I know this at all like do I know with the capital city of frances or you know should I say no and it turns out that sometimes that separate step can be wrong and if that separate step says yes actually I do know the answer that and then the model is like all right well then I'm answering and then halfway through it's like uh capital France uh London it's too late it's already committed to sort of like answering and so one of the things we found is to sort of like separate circuit that's trying to determine is this you know city or this person you're asking me about famous enough for me to answer or is it not like am I confident enough in this yeah and so could we could we reduce solutions by manipulating that circuit by changing the way that circuit works is that something that your research might lead on to I think there's broadly kind of two ways to think about approaching the problem one is like we have this part of the model that gives answers to your questions and then this other part of the model that's kind of deciding whether it thinks it actually knows the answer to your question and we could just try to make that second part of the model better and I think that's happening I think as model like better at discriminating better at discriminating like better kind of calibrated yeah and I think that's happening like as models are getting you know smarter and smarter I think they're they're kind of self-knowledge is becoming better at calibrated so like hallucinations are better than they were you know models don't hallucinate as much as they did a few years ago so to some extent this is like solving itself but I do think there's a deeper problem uh which is like from a human perspective the thing the model is doing is kind of like very alien and that like if I ask you a question uh you like try to come up with the answer and then if you can't come up with the answer you you notice that and then you're like I don't know whereas in the model these two circuits still like what is the answer and do I actually know the answer are kind of like not talking to each other or at least not talking to each other as much as they probably should be and like could we get them to talk to each other more I think is like a really interesting question right because and it's almost physical right because it's like you know these models like processing information about like certain number of steps they can do and if you if it takes all of that work to get to the answer um then there's no time to do the assessment so like you kind of have to do the assessment before you're like all the way through if you want to get your max power out and so it's kind of like you might have a trade-off between like a model which is like more calibrated and a lot dumber you know if you sort of tried to try to force this on it well and again I think it's it's about making these parts communicate because we have similar I claim I know nothing about brains I claim we have a similar circuit because sometimes you ask me like that who is the actor in this movie and I will know that I know I mean I like oh yes I know who the lead was wait hold on they were also in that other movie and then the tip of the tongue the tongue yes the tip of the tongue and so there's clearly some part of your brain that's this they're like ah like this is a thing you definitely know the answer or I'll just say I have no idea and sometimes they they can tell so some question and it gives an answer and then afterwards is like wait I'm not sure that was right because that's it like getting to see its best effort and then like make some judgment some judgment based on that which is sort of relatable but also like it kind of has to say it out loud like to be able to even like reflect back and and see it so when it comes to the actual way that you're finding this stuff out let's go back to the idea of your the biology that you're doing of course in biology experiments people will go in and actually manipulate the rats or mice or humans or zebrafish or whatever is that they're doing experiments on what is it that you're doing with Claude that helps you understand these circuits that are happening inside the models quote unquote brain well maybe the the the gist of what enables us to do some of this is that you know unlike in real biology we can just like have every part of the model visible to us and we can ask the model random things and see different parts which which light up and which not and we can artificially nudge parts in a direction or another and so we can quickly sort of confirm our understanding you know when we say ah we think this is the part of the model that you know decides whether it knows something or not and this is the this would be the equivalent of putting a light fruit in the brain of a zebrafish or something yeah if you could do that you know answer like every single neuron and change each of them at whichever precision you want it that would serve be that's the affordance that we have and so that's that's in a way a very kind of lucky position so it's almost easier than than real neuroscience it's so much easier yeah like oh my god like like like one thing is like actual brains like are three-dimensional and so if you want to get into them like you you need to like make a hole in a skull and then like go through and like try to find the neuron the other problem is like you know people are different from each other and we can just make like 10,000 identical copies of Claude and like put them in scenarios and like measure them doing different things and so it's like the I don't mean jack is a neuroscientist can speak to this but my sense is like like a lot of people um have spent a lot of time in neuroscience like trying to understand the brain and the mind which is like a very worthy endeavor but it's kind of like if you think that could ever succeed you should think that we're going to be extremely successful very soon because like we have such a wonderful position to study this from compared to that is as if we could clone people yes and also clone the exact environment that they're in and every input that's ever been given to them uh and then test them in an experiment whereas you know obviously neuroscience has massive as you say individual variation yeah uh and also just random things that happen to people through their through life and things that happen in the experiment the noise of the experiment itself right like we could ask the the model the same question like with and without a hint but you ask a person the same question three times like sometimes with a hint after a while they start to understand like well last time you asked me this he like really shook your head after that one so yes I think yeah this kind of this being able to just throw tons of data at the model and see what lights up and being able to run a ton of these experiments where you're nudging parts of the model and seeing what happens I think is what puts us in like a pretty different regime from from neuroscience and that like a lot of a lot of you know uh you know blood and toil and neurosciences spent like coming up with really clever experiment like you only have a certain amount of time with your mouse before it's you know gonna get tired or you know or you or you or someone happens to be having a brain surgery operation so you quickly go in and put your light for dinner brainwilder heads open yeah and that doesn't happen very often and so you've got to come up with like a guess like you've only got so much time in there and so you've got to come up with like a guess of like what do I think is going on in in that neural circuit and like what clever experimental design can I can I test that precise hypothesis and we're we're very fortunate in that we kind of don't have to do that so much we can we can just sort of test all the hypotheses we can kind of let the data speak to us rather than kind of going and and and testing some really specific thing I think that's what's sort of unlocked a lot of our ability to find these things that are surprising to us that like we wouldn't have guessed in advance that's hard to do if you if you have to you know if you have only a little limited amount of experimental bandwidth what's a good example then of you going in and switching one of these concepts on or off or doing some kind of manipulation uh of the model that then reveals something new about how the models are thinking in the recent experiments we shared one that surprised me uh quite a bit uh and was part of sort of like an experimental line of work that because it was confusing for like we're on the verge of just saying well we don't know what's going on it's sort of this this example of um like planning a few steps ahead yes uh so this is the example of you know you give the model you ask the model to write you a poem or I'm in couplet uh and then you know as as a human if you ask me to write a writing couplet unless they even give me the first line the first thing I'll think of is sort of like ah well you know I need to rhyme this is what the current rhyming scheme is these are potential words yeah this how I do it and and again if if the model was just predicting the next word you wouldn't necessarily expect that it would be planning on to the second the the um the the the word at the end of the second line that's right and so this sort of like default behavior to expect the null hypothesis is like well the model like sees your first verse and then it's gonna say the first word that kind of makes sense given what you're talking about keep going and then you know at the end on the last words can be like oh well I need to rhyme with this thing and so it's gonna sort of like try to try to fit in a rhyme of course that only works so well like in in some cases if you just say a sentence without thinking of the rhyme you won't be able you'll back yourself into a corner and at the end you know you won't be able to complete sex and I remember the models are very very good at predicting the next word so it turns out that to be very good at at that last word you need to have thought of that last word way ahead of time just like humans do and so it turns out that when when we looked at these serve flow charts for four poems the model had already picked the word at the end of the first of the first verse and in particular it looked to us sort of like based on on kind of like what what that concept looked like like oh gosh this seems like the word it uses but then this is one we're actually doing the experiment like the fact that it's easy to sort of nudge it and say like okay well I'm just gonna remove that word or I'm gonna add another word well that's what I was gonna say is how the reason that you know this is that you're able to go into that moment when it has it has said the final word in the first line and it is it is about to start the second line you can go in and then manipulate at that point right yeah exactly we could sort of almost go back in time for the model right we like pretend you haven't seen that second line at all you know you've just seen the first line you you're thinking about the word you know rabbit instead I'm gonna insert green and now all of a sudden the model is gonna say oh my god I need to write something that ends in green rather than I need to write something that ends in rabbit and it'll write the whole sentence differently right just add a little more color to that like it's I think the blue kind of right right any color like yeah it's not just influencing so it's like yeah I think the example in the paper was the first line of the poem is he saw a carrot and had to grab it yes and then the model is thinking like okay rabbit's a good word to end the next line with but then yeah as manual said you can like delete that and make it think about planning to say green instead but the cool thing is that it doesn't just say like it doesn't just kind of yammer a much of nonsense and then say green instead it constructs a sentence that kind of coherently ends in the word green so like you put green in its head and then it says like you know he saw a carrot and had to grab it and to pair it with his leafy greens you know something like that something that's kind of like sounds like sounds like it makes thematically it fits with the poem yeah I just want to give like the even humble example is you know we had all these these ones we were just kind of checking like you know did it memorize these like complicated questions or like is it actually you know doing some steps one of them was you know the capital of the state containing Dallas is Austin because it just feels like you would think okay Dallas Texas Austin but one way and we could see like the Texas concept but then you can just like shove other things in there I mean like stopping about Texas like starting about California and then it'll say like Sacramento and you could say like stop thinking about Texas start thinking about the Byzantine Empire and then it will like say Constantinople and you're like all right it seems like we found how it's doing this like it's like it's like no it's gonna get the capital but we can keep swapping out you know what the state is and get a sort of predictable answer and then you get these more elaborate ones where it's like oh this was the spot where it was planning what it was gonna say later and like we can swap that out and now it'll write a poem towards a different rhyme we're talking about these poems and you know the the the the constants no pull and so on can we just bring this back to why this matters like why does it matter that the model can plan things in advance and that we can reveal this like what what what is that gonna gonna go on to to tell us I mean our ultimate mission anthropic is to try and make a model safe right so how does that connect to a poem about a rabbit or the capital of Texas so we all get what we can round it please I think for me this like the poems of microcosm right where where like at some point it's like has decided that it's going to go towards rabbit and then it like takes a few words to kind of get there but on a longer time scale right you know maybe maybe you know the like model is like trying to help you improve your business or it's like assisting the government and distributing services and like it might not just be like eight words later you see its destination right it could be like pursuing something for quite a while and the the place it's headed or the reasons it's taking each step might not be clear in the words that it's using right and so there's a paper recently from our alignment science team where they looked at you know some some kind of concocted but still striking situation you know involving you know an AI in a place where the company was going to like shut it down and kind of convert the whole mission of the company in a very different direction and the model begins taking steps like emailing people um threatening them to disclose like certain things and like at no point does it like say like I am trying to blackmail this individual for the purposes of changing their outcome but that's what it's sort of thinking about doing along the way and so you can't just tell by like reading the pattern especially if these models get better like where they're necessarily headed and we might want to kind of be able to tell like where is it trying to go before it's gotten there in the end so it's like having a parent and very good brain scan that can sort of light up if something really bad it's going to it is going to happen and warn us that the model is thinking about deceiving and like and like I think we also just talk about like a lot of this like in a in a sort of like doom and gloom scenario but there's like also more mild ones which is like I don't know you know you want the model to be good at like you know people come to these models being like here's a problem I'm having and the good answer to that will depend on who the user is is it like somebody who's you know um like you know young and sort of unsophisticated as somebody's been in that field forever and it should respond appropriately based on who it thinks that person is and if you want that to go well maybe you want to study like what does the model think is going on who does it think it's talking to and how does that condition its answer where there's just like a whole bunch of desirable properties that come from the model like you know understanding the assignment I guess do you guys have other answers to the question of why does this matter yes I think I think plus one I think there's two plus two and there's there's also like a pragmatic um you know we're just trying with these examples we're explaining the example of of planning but we're also trying to sort of gradually build up our understanding of just how these models work overall like can we can we build you know a set of abstractions to just think about you know how language model models work which can help us use this technology regulated like if you if you believe that we're going to start start using them more and more everywhere which seems to be happening you know like the equivalent would be you know some companies somewhere is like well we really know how we did it but we like invented planes and none of us know how planes work but they're sure convenient you could take them to you know go from a place but you know none of us know how they work and so if they ever break like we're kind of we're host we don't know what to do about them you can't monitor we can't monitor whether they might be about to break right we have no idea there's just this the but the output is great I flew to Paris so quickly it was lovely the capsule of Texas it that's right it turns out that you know surely we're going to want to just understand what's going on better so it's so let's just like lift the fog of war a little bit so that so that we can sort of have have even just better intuitions about what are appropriate and appropriate uses what are the biggest problems to fix what are the big big sports where they're brittle just add on one thing I think I mean something we do in like human society is we kind of offload work or tasks to other people based on our trust in them like this I you know I well I'm not anyone's boss but Josh Josh is someone's boss and you know Josh might give someone a task like go go and code up this thing and then he has some faith that you know that person isn't associate path who's gonna like sneak some bug in there to try to undermine the company he like takes their word for it that they did a good job and similarly like people are the way people are using language models now were not were not spot checking everything they write especially like you know the the the best example for this is using language models for coding assistance people like the the models are just writing thousands and thousands of lines of code and people are kind of like doing a cursory job of reading it but and then it's going into the code base and we'll give us the trust in the model that like we don't need to read everything it right so that we can just kind of like let it do its thing it's knowing that its motivations are sort of pure and so that's why I think like the kind of being able to see inside its head is so important as because because unlike humans were like why do I think that a manual isn't associate path it's because like you know we like I don't know he seems like a cool guy we and like he's nice and stuff but not how he would seem if he's I love very good yeah exactly yeah so maybe maybe I'm getting duped but yeah the models are so weird and alien that like our normal kind of heuristics for deciding whether a humanist trustworthy really don't apply to them and that's why it seems so important to like really know what they're thinking in their head because for all for all we know the you know the thing that I mentioned where models can you know fake doing a math problem for you to like tell you what you want to hear like maybe they're just doing that all the time and we wouldn't know unless we kind of saw it in their heads I think there's two like almost separate strains here like and one is one is like we have a lot of ways of like on yeah I guess what Jack was saying like you know you you know what are the signs of trust in a human but this like plan a plan beefing from earlier is really important where like it might be that the 10 first 10 or a hundred time to use the model it was you're asking a certain kind of question but it was like always in plan a zone and then you know you ask it a harder or a different question and the way that it tries to answer to just like completely different it's using a totally different set of strategies there like different mechanisms and and that like that means that the trusted built with you was really your sort of trust with like model doing plan a's and now it's like doing plan b and like it's going to be completely off the rails but like you didn't have like any warning sign of that and so it's sort of I think we also just want to start building up an understanding of like how do models do these things so that we can form like a trust basis in some of those areas and I think like you can form trust with the history you don't completely understand but you sort of like if it's just like a manual had a twin and then like one day like a manual's twin came to the office and like I didn't like it seems like the same guy and then just did something completely different on the computer right like that could go south depending on if it was the evil twin yes well or the good well yeah obviously oh I thought you were going to ask me if I was the evil twin well I'm not going to answer that yes at the start of this discussion I asked you know is a language model thinking like a human I'd be interested to hear an answer from all three of you the extent to which you think that's true put me on a spot without one but I think it's it's thinking but not like a human but that's not a very useful answer so maybe to dig in a little bit more well it seems like quite a profound thing to say that it's thinking right because again it's just predicting the next word some people think that these are just also complete but you're you're saying that it is actually thinking I think yeah so maybe to add something that we haven't touched on yet but I think it's really important for understanding actual experience of talking to language models is that like we're talking about predicting the next word but what does that actually mean in the context of a dialogue that you're having with the language model it's what what's really going on under the hood is that the language model is filling in a transcript between you and this like character that it's created so in the in the like canon world of the language model you are called human and you're it's like human colon the thing you wrote and then there's this character called the assistant and we've like trained the model to imbue the assistant with certain characteristics like being helpful and like smart and nice and then it's like simulating what this assistant character would say to you so in a sense we really have like created the models in our image we are literally training them to like cosplay as this sort of humanoid robot character and so in that sense like well in order to predict what this like nice smart humanoid robot character would say in response to your question what do you have to do if you're really good at that prediction task you have to kind of form this internal model of like what that character is representing like what it's what it's thinking so to speak so like in order to do its task of predicting what the assistant would say the language model kind of needs to form this model of the assistant's thought process and I think like in that sense it like the just the the claim that like language models are thinking is really just it's this very like functional claim of just in order to do their job of kind of like playing this character well they need to sort of simulate the the process whatever it is that we humans are doing when we're thinking and its simulation is very likely quite different from how our brains work but it's kind of trying it's like shooting towards the same goal I think there's kind of an emotional part to this question or something when you ask are they thinking like us it's like are we not that special or something and I think and it's been apparent to me discussing some of the some of the math examples that we're talking about with with people that have engaged with with like read the paper or or different write-ups which is this example where you know we asked the models say 36 plus 59 what's the answer and the model can can correctly answer it you can also ask it how do how do you how do you do that and it'll say oh you know I added the six and a nine and then I carried the one and then I added all the the server like a 10s digits but it turns out that if we look inside the brain like we can that's not at all what it's doing it didn't do that again it was bullshitting you it was again it's right again it was bullshitting you what it actually does is actually this sort of kind of interesting mix of strategies where it's in parallel doing the 10s digit and the ones digit and sort of doing so like a series of different different steps but the thing that's interesting here is talking to people so like I think the reaction is split on on like what does that mean and in a sense I think what's cool is some of this research is like free of opinion or just something like this is what happened you can you feel free to you know from that from that conclude that the model is thinking or is not thinking and half of the people will say like well you know it told you that it was carrying the one and it didn't and so clearly it doesn't even understand its own thought and so clearly it's not thinking and then half of the other people will be like well you know when you ask me 36 plus 59 I also kind of you know I know that it ends at five I know that it's roughly like into 80s or 90s I have all of these heuristics in my brain as we were talking about I'm not sure exactly how I compute it I can write it out and compute it you know the long-hand way but the way that it's happening in my brain is sort of like fuzzy and weird and it might be similarly fuzzy and weird to what's happening in that exact human's notoriously bad at meta cognition like thinking about thinking and understanding the wrong thinking processes especially in cases where it's you know immediate reflexive answers so why should we expect any different for models and Josh what's your answer to the question like a manual I'm gonna avoid the question and just sort of be like what why do you ask predict I don't know sort of like asking like does a grenade punch like a human like like well there's some force yes as you know and maybe there are things that are closer than than that but like if you're worried about damage then I think I think understanding you know where does the impact come from what if the impetus of this is is maybe like the important thing I think for me the like do models think in the sense that they like do some like integration and processing and sequential stuff that can lead to surprising places clearly yes it'd be kind of crazy from interacting with them a lot there not be something going on we can sort of start to see how it's happening then the like humans bit is interesting because I think some of that is trying to ask like you know what can I expect from these because they were sort of like me being good at this would make it good at that but if it's like different from me then like I don't really know what's this sort of look for it's a really we're just looking to like understand like where do we need to be extremely like suspicious or like starting from scratch and understanding this and where can we sort of just reason from like our own like very rich experience of thinking and there I feel a little bit trapped because as a human like I project my own image constantly onto everything like they warned us in the bible where I'm just like this piece of silicon like it's just like me made in my image where where like to some extent it's been trained to like simulate dialogue between people so it's going to be very like person like in its affect and so some humanists will get into it simply from like the training but then it's like using very different equipment that has like different limitations and so the way it does that might be pretty different to to manuals point I think the yeah we're in this tricky spot answering questions like this because we don't really have the right language for talking about what language models do it's like we're we're doing biology but you know before people figured out cells or before people figured out DNA and I think we're starting to fill in that understanding like like you know as a manual said there are these cases now where we can really just we can just you just go read our paper like you know how the model like added these two numbers and then if you want to call it human like if you want to call it thinking or if you want to not then that's up to you but like the real answer is just like find the right language and the right abstractions for talking about the models but in the meantime when we when we only currently we only kind of like you know 20 percent succeeded at that scientific project like to fill in the other 80 percent we sort of have to borrow analogies from other fields and like there's this question of which analogies are the most apt should we be thinking of the models like computer programs should we be thinking them of them like little people and it seems to be like sometime like in some ways that think of them like little people is kind of useful it's like if I like say mean things to the model it like talks back at me that's like human would do but in some ways it's like that clearly not the right mental model and so we're just kind of stuck you know figuring out when when we should be borrowing which language well that that leads on to the final question I was going to ask which is what's next or what what are the next pieces of scientific progress biological progress that need to be made for us to have a better understanding of what's happening inside these models and again towards our mission of making them safer there's a lot of work to do our our last publication has a like enormous section on on the limitations of the way we've been looking at this that was also a roadmap to like making it better you know we when we when we are looking for patterns to like decompose what's happening inside the model we're only getting sort of you know maybe a few percent of of what's going on there's large parts of how it moves information around that like we we explicitly like didn't capture at all they're scaling this up from from the sort of small you know a production model we use to like the so you were looking at because Claude 3.5 high coupons that's right which is like it's like a pretty capable model very fast but it's like by no means as sophisticated as you know the Claude for yeah sweet sweet of models um so those are almost like sort of like technical challenges but I think I think I think Emmanuel and Jack might have some takes on the like some of the like scientific challenges that come after solving them yeah yeah I mean I think maybe maybe two things I'll say here which is one consequence of what Josh has said is that you know out of the total number of times that we ask a question about how the model does X right now we can answer probably a small you know 10 to 20% of time we can tell you after a little bit of investigation this is what's happening obviously we'd like that to be a lot better and there's there's a lot of kind of clearer ways to to to get there and less and more speculative ways as well uh and then I think so a thing that we've talked about about is this sort of idea that a lot of what the model does isn't simply like uh how is it saying the next thing we talked about it a little bit here it's sort of like planning a few things again and I a few words ahead sorry and I think we want to understand sort of like over a long conversation with the models or like how is it's understanding of what's happening changing you know how is it's understanding of who it's talking to changing and how does that affect its behavior uh more and more sort of the the actual use case of models like Claude is you know it's going to read a bunch of your documents and a bunch of like email your email your your code and based on that it's going to make one suggestion and so clearly there's something really important happening in that space where it's reading all these things uh and so I think understanding that better uh seems like a like a great challenge to take on yeah I think we often use the the analogy on the team of that we're building a microscope uh to like look at the model and right now we're in this exciting but also kind of frustrating space where our microscope works like 20% of the time and like to look looking through it is like requires a lot of skill uh and like takes you know you have to like build this whole big contraption and then every like infrastructure is always breaking and then like once you've got your like explanation of what the model is doing you have to like throw like the manual or me or someone else in the team in a room for like two hours to like puzzle out what exactly was going on and like the really exciting future that I think we could be at within you know year or two years you know that kind of time scale is one where like just every interaction you have with the model can be under the microscope like we can just any time there's all these weird things the models are doing and we just want it to be like push of a button like you're having your conversation you push a button you get this flow chart that tells you like what it was thinking about and once we're at that point it's it'll be this like I think our the interpretability team and inthropic I think we'll start to kind of take on a bit of a different shape in that instead of this like team of kind of like engineers scientists thinking about the like math of how like language models work on the inside we're going to have this like army of biologists that are just looking through the micr we're just we're talking to Claude we're getting it to do weird things and then we're just like we got people looking through the microscope seeing like what it was thinking on the inside and I think that's kind of the future of of of this work. Maybe two notes and stuff like one is like we want Claude to help us do all of that because like there's a lot of parts involved and you know who's like good at like looking at like hundreds of things and figuring out what's going on is like Claude and so I think we're trying to enlist some help there especially as for this complicated context and maybe the the other place is like we've talked a lot about studying the model like once it's fully formed but of course like we're at a company that makes these and so when it says okay here's how the model like solve this particular problem or say this thing where did that come from kind of in the training process what are the steps that sort of like made that circuitry sort of form to do that and how could we give feedback to the rest of the company that is like doing all of that work to shape the thing that we like actually wanted to become. Well thank you so much for the conversation where can people find out more about this research. So if you want to find out more you can go to anthropic.com slash research which has our papers and blog posts and fun videos about it. Also we recently partnered with another group called Neuronpedia to host some of these like circuit grass we make so if you want to try your hand at looking at what's going on inside of a small model you can go to Neuronpedia and see for yourself. Thank you very much.

Interpretability: Understanding how AI models think

TL;DR

Takeaways

Vocabulary

Transcript