Scaling interpretability

Josh Batson and I'm here with other members of the Interoperability Team at Anthropic to talk about some of the engineering work that went into our big recent release about interpreting the insides of Claude 3 on it. So why don't we start with some introductions, Jonathan, who are you? I'm Jonathan Marcus. I've worked on the Interoperability Team for a amazingly long eight months prior to this. I worked at Jump Trading doing quantitative finance for like 13 years. Great, Adley. Yeah, my name is Adley. I'm also on the Interoperability Team. I've been here doing Nick Shale learning stuff and as far as Adwin Carter stuff for about the last 14 months. Before this, I was working on efficient, large-lingual model inference at another startup. T.C.? Yeah, and I'm Tom or T.C. I've been on the Interrupt Team about the last year working on the on dictionary learning. Before that, I worked at the same company, Jonathan did. So Jump doing doing high-framency trading. And before that, I was at Facebook for five years doing kind of back-end infer work there. So the reason we're here now is because there was a big interpretability release recently. What were you trying to do there and why? Yeah, I think that the best way to describe this is that back last year, we published paper called Towards Monocentricity, which really demonstrated that this technique could work to extract interpretable features on a very, very small language model. And then in the month since then, we've just been scaling this up until we reach the size of getting really good features from one of the models that is deployed into production by Anthropic. Help me understand what's the difference between like a small language model and the one you were tackling for this work? Yeah, the small one would be so different from any language model you've actually used. Like if you tried to ask it any sort of question that you might think a language model would be very good at, it's going to totally fail everything. It's just kind of a very, very poor model. So it was helpful for the early work we were doing because we think it has a lot of the same structure as a large model, but it's much smaller so it's much easier to work with. But it's kind of not useful for any actual task. And even if you asked it a fairly basic question like what do cats say? I'm not confident it would actually get that right. It wouldn't me out. No, I don't think so. But maybe we didn't actually try. So instead of good analogy, it's just like eight months ago, it's like we said, hey, I think the earth is made of dirt. And so I like we had a hand drill and we went a couple inches down like, hey, there's dirt there. And now we made this like giant laser drill and went into the first mantle like, hey, there's lava there. I guess I know that person was you. Somebody. Somebody. I think that's a really good, you said that it really stuck with me because it's like, yes, it's technically the same thing. And yes, we expected there to be lava. But it's just been a huge engineering effort to actually drill down, and figure out what's down there. And we've actually found a lot of what we expected. But it's really cool now that we're there. I think the thing that I want to add there is just how rewarding it is to look at these large language models that can actually do all of these really powerful things. So in the one layer model, we were finding we were finding features, but the features corresponded to things like counting to 10 and generating the random string of letters and numbers that you see after URLs. And when you when you scale this up to a much more powerful model, the same technique can find features that really just are interesting, not just from a scientific perspective, but that just represent interesting nuance topics and that can really just shine a light into how the system is able to perform really hard tasks. And in particular, large language models can perform tasks that we don't know how to program computers to do. They really just like have all of these capabilities that we don't understand. And so if you can find the features from them, then you can really get this really fascinating insight into how they are able to do these things. What are some of the features that you saw that were the most striking or moving to you? I really liked the functions that add numbers in a code feature and that it was kind of not not very narrow and not just firing on functions that are kind of obviously a plus b. But if there's some function which calls some other function which is which is adding, then the feature also lights up for that. So it's has some like deeper understanding of what a function that adds is than the very basic one. I thought that was super cool and maybe surprising that it exists. But for a lot of these features, you kind of when you first see them, you're shocked. And then when you think about it more, you're like, oh yeah, that seems like a useful and very reasonable thing for the model to like do. Maybe I shouldn't have been so surprised that it was there. I'm refurst finding the veganism feature that it was really cool. I did not expect to see it. It was not even the biggest model. Not not a vegan. But disclaimer. But it was really interesting to see that the same model could identify the concept of not eating meat, of being worried about factory farming and not wearing leather, but also in a lot of different languages. And the fact that it was able to tie all of these concepts together was a sh. I couldn't believe what I was saying. The model actually, you know, it's not just repeating, it's not just repeating arbitrary words in some random way. It like definitely very concretely this model, this connection was already built into the model. And we just discovered it in there. And that just that really blew my mind. Yeah, I think one of the things which was kind of impressive for me was getting a sense of like how the model's thinking about this stuff. I think when language models first started getting big, there was some notion that maybe they're just repeating things. They've seen in the training data. And when it says something, it's because there was an extremely similar sentence out there. And it just kind of grabbed that and then gave you whatever somebody said in that context. But seeing these features that are like multimodal, multilingual about something. What do you mean multimodal? Multimodal, I mean that the image of the thing makes the same feature fire as the text of the thing. One of my favorite ones there was there was a feature about back doors in code that also fired on images of like hacked USB thumb drives and various forms of subterfuge. Yeah, yeah, it I think there were like five or six different devices with like hidden cameras in them and it fired for like various various hidden cameras in like in like everyday objects. Which again in hindsight seems like totally like it makes total sense. But I definitely wouldn't know guess that. Right, the part of the model that literally recognized like a line of code that had a problem would would be the same thing, which is like there's a pan with a camera in it. Yeah. And then even you could act you could like artificially activate that and say, hey, can you finish my function? And it would write it while introducing a subtle vulnerability that could be that could be hacked later. That one really blew my mind. I did appreciate that Claude was kind enough to label the function back door. Any other any other favorites? Can we talk about golden gate Claude? Yeah, I mean golden gate Claude was was so much fun. What is golden gate Claude? What is golden gate Claude exactly? Golden gate Claude was so we had a feature in the paper. It was kind of the headline and we really liked it where Claude would activate on descriptions of the golden gate bridge. You know the iconic majestic towering bridge between Moran and San Francisco. So someone had the amazing idea of hey do you think we can talk to golden gate golden gate Claude? And this was my favorite parts about Anthropic. I thought this was going to be hard. But then someone from the engineering team just went into our code base, figured it out, and implemented golden gate Claude as an experiment like hey do you think I can actually take the results of your dictionary learning paper and just use it? And then we tried it and it worked. And then everybody started playing with it. That was such a cool experience to actually have the results of our paper brought to life like that. Yeah, it was incredible. We put out the paper on Tuesday morning and then we were all out to dinner on Tuesday night. And people inside the company were excited by that figure where when we turned on the golden gate feature and asked, Claude, what is your physical form? Claude said that it was the majestic golden gate bridge itself. And that was just like a little static thing. And Oliver was like let's do it. Let's make golden gate Claude. And while we were eating and celebrating, they started working and 36 hours later. We should have. There was a polished product that could be shipped and the world got to talk to a model that had this feature we discovered only weeks before amplified and getting a feeling for what it means to drive the model in one direction or another. So I know the model was really big. Claude's really big compared to the ones we were working on before. What did you have to do to take the dictionary learning technique for finding the features and and scale it up to work on something like that? Yeah, first of all, I just want to say this is a really long question to answer because this was probably the bulk of our work between the publishing of towards Monos and Manthusy and the publishing of this paper. Like this was just there were just so many things and we should get into them, but I just want to emphasize that we are just not going to be able to talk about a fraction of the things that had to be done here because this was just such a really big effort. I also want to frame it slightly differently. When we first got the results in towards Manosy, Manthusy and we started thinking about what we're doing next. We didn't immediately go, oh, we're going to scale this to like, Sonic. This will definitely work. We didn't know if this would work at like larger scale, so we didn't want to spend eight months just like scaling and then check if it actually worked. So it was much more of a back and forth between the engineering side and the research side to what experiments can we do to scale this up to give us more confidence that scaling it up actually works. So it was kind of not this like monolithic thing that we could just plan. It was something where we were kind of scaling it in pieces, confirming that this still looks good and then scaling it more as we got more confidence. So what does scaling actually look like? So I think one example here is very illustrative. This is something that came up pretty early on in the process as we're scaling this up where when we are working on towards Manosy Manosy, all of our models fit on a single GPU. Every sparse outer encoder that we trained fit on a single GPU. And what we realize very quickly is if you're going to keep scaling this up, this is no longer going to work. You're going to have to chain a bunch of GPUs together and implement something that we call upsharding where you take the parameters of the sparse outer encoder and distribute them among a large number of GPUs. Okay, what is the sparse autoencoder? Maybe not everyone knows what they are? So an autoencoder is something that takes in some data, some vector and transforms it into another representation from which you can read back the original. That's the auto is that you get back the original encoder as you change the representation. Sparse autoencoder is one where this new representation you get is very sparse. There's only a few elements in there that are non-zero. And this can be really nice because if you're trying to understand what does this data mean? And there's exactly three non-zero elements in the encoding. You can just go look at each of those, whereas if the original vector had the thousand components or something, it might not make any sense. And the basic bet we made here, which is shocking to me that this worked at all, was we took some of the latent states, the vectors inside clawed, trained a sparse autoencoder to see if we could represent those as some of just a few pieces each time. And the answer was yes. And then when we looked at each of those pieces, they were shockingly interpretable. From a math perspective, I think a sparse autoencoder is really simple. There's just two matrices involved. From an engineering perspective, I think it proved to be a lot less simple. And so you could write down the math from our paper in October. And we could copy and paste basically the same math through our paper now. And that is not how it worked like on the silicon. Oh my god. I think one of something that feels really interesting here to me is that when we first started this project last year, we were experimenting with a bunch of different techniques for this. And we were experimenting with a bunch of more complicated techniques. There's a lot of fancy math out there that addresses this problem. It's possible that that math might still work better, but we really just saw a lot of success with sparse autoencoders because you could just really scale them up. We tried running all of these other techniques, but you could only run them on a small amount of data. And one of the things we realized is that in order to see really good results, you have to run this on a lot of data, much more data than anybody in the classic mathematical literature ever does. And so sparse autoencoders are in some way just mind-bogglingly simple in a pretty beautiful way. And it's just this philosophy that if you take a simple algorithm, and it's scalable, and you can just really turn up the numbers, you can get really beautiful stuff out of it. But turning the numbers up, that's the hard part. I do almost no math. You do math. You do math. But there's so much so much goes into, hey, let's make this thing 10, 100, 1000 x, and so on, bigger. And that just breaks all our abstractions, breaks all our code in so many different ways. It just becomes too big in all these dimensions that you didn't, that you were totally unprepared for. It causes weird bugs. I think one of you told me one of those dimensions was around like shuffling the data. Someone talked through what the shuffle problem was and what you had to do. Yeah, so there's this prenatally hard problem in machine learning where you have your input data. And you want to make sure that if you've got like a whole bunch of A's and a whole bunch of B's and a whole bunch of C's and a whole bunch of D's, and you feed them through your model, if you just feed them in an order, it's going to learn, hey, I should only learn A's. Hey, I should only learn B's. But if it's all mixed up, then it has to learn whole distribution at every step. And this shuffle process is very easy when your data is small. You just you load into memory, you know, do like random shuffle and then you write back out. That's not that hard. Now you have what do you do when you have petabytes of data? It's like, oh, so I guess it might be like if you have to shuffle a deck of cards, you can just do it with your hands. Yeah. And if somebody gave you seven consecutive miles of cards stacked end to end, it's like not clear how you would shuffle that deck. Yeah, actually, that's a really good analogy. So right, it's like I have a hundred warehouses full of cards. Yeah. And so we did, I ended up like we talked about a lot, figured out, hey, is there a way to do this in parallel? And it's like, well, if you're in a shuffle, a hundred warehouses of cards, first you're going to shuffle one warehouse of cards and you'll break up into a hundred subproblems and then how do I shuffle 101 warehouse of cards? I'm going to break up and like do it by section. And then you're going to like mix the different sections in some, in some like provably, in some provable way that makes sure that every section gets mixed with every other section. And like, I don't know, that sounds really simple. And a sense of kind of was once we had understood the problem that oh, we just need to make this like multi-stage parallel shuffle where we break it out like oh, probably anyone could, not anyone, but like a lot of people can like implement that as part of like a coding interview. It's like not that hard of an algorithmic problem. But to get to the point where we even realized like that, oh, just framing the problem was 90% of the work. Once we did that and we could conceptualize as oh, we need to do a multi-stage parallel shuffle. And then once you get your like recursion properly defined, then it's a pretty easy task to scale it to something. Oh, you want to do 100 terabytes, you want to do 10 petabytes. Cool. Just add another layer. I think some more context there's interesting because we're going to focus on when like the part after we decided we're going to speed up shuffle. But I think the part before that kind of shows a lot of what this job is about, where what was happening is we were scaling things up and then running experiments as we scale. And the shuffle step before we made it better was taking longer and longer. So we knew that like this step is not scaling well and it's making it slower to get research results. But also we know there's something better there, but there wasn't something that I could do in like a couple hours. I was like, oh, this is maybe a few days, a few weeks. So we're kind of putting it off because we can still get get get experimental results until eventually this thing's taking 24 hours, something like that. And we're like, okay, we finally need to like fix that. And then I think the fix that we did you kind of could do something that is maybe like more like perfect or totally nail shuffle. But we're not focused on like what is the like optimal the platonic ideal of parallel shuffle. What we care is our job's taking 24 hours. How can it not take take 24 hours? So I think a lot of this job is like we want to get to get experimental results. That's that's our focus. And then given that goal, how do you get those? So it's generally not how do you make any step perfect? It's how do you make any step as good as we need to to get the results that we need right now. And then as those results come back, you gain more or less confidence in the approach which you have. And as you have more confidence that this code base and this and this approach is something we're going to be using six months later a year later, you're willing to invest more time into making things better. And I think the kind of the heart of the job is how do you make that trade off of how much time to invest into any one piece of this of this of this whole pipeline? I want to draw that out more. It sounds like this kind of engineering where there's an experimental result at the end feels like a different process than maybe producing a product. Could you say more about what it's like to do engineering for research? Yeah, I think it's interesting to compare it to my first job at Facebook and I was building a service, a back-and-service which powered the Facebook website. And I would say the big difference is kind of the requirements of the code at that at that job at Facebook, it never really changed. We always knew that this was going to run at scale. It like it couldn't couldn't crash. We like cared about the like cost of the servers we were running on. But I was there a few years and it was kind of always the same goals where in a kind of research engineering job now, like you don't know which bits of the code you're going to be throwing out in two weeks and then which bits of the code you're going to be using years later. And a lot of the original dictionary learning code was methods which we've deleted they're gone. We're never touching it. And spending time making that code perfect would be totally wasted because it's deleted. But also over this year-long process we've honed in on what we're doing is working and this core thing is good. We need to make this better. We're going to be using this longer. And if the code quality is crap, it's going to be slowing us down for years. So we need to really go and polish this more. So I think you constantly have to keep those trade-offs in the back of your head and they're kind of changing under you as you work. There's another dimension to this that I'd like to talk more about which is there's a whole bunch of ideas that we want to try. And when you're looking at implementing these ideas you're thinking about how to design the infrastructure. And like with any software design, certain infrastructure designs you're going to make certain things easy and certain things hard. So there's this really tricky thing and I think in some ways it's an impossible problem and you can only try to do this very poorly. But trying to anticipate which directions you want to go in in the future, trying to anticipate what general categories of ideas you might want to try and trying to anticipate how do we make these general categories easier. And what are we closing off? What are we making harder to do? And trying to make those trade offices a really, really difficult challenge that we try our best at but it's something that is just impossible to be perfect at. Did you make any big mistakes? None. I think a lot of the errors here feel more like we maybe should have cleaned something up a month sooner. So it's kind of like oh maybe we should have done this sooner but because your trade offices are changing under you, if you should have, like if right now you're like I'm not really sure should we do this should we not do this in a month it's frequently blindingly obvious. Like oh yeah we should definitely do that. So you do lose that month of like it would have been better if you got there sooner. But I generally think you kind of get shoved in the right direction eventually. But I also think that there's an important point that I am not a professional scientist where I'm just looking to publish papers. I'm also not a professional engineer where I'm just looking to build the most perfect beautiful harmonious well-obstracted system. Like we have a specific target which is you know being able to figure out and do enough science to figure out interpretability so that we know how these machines work to achieve a specific safety goal. That we have to do enough science to get there. I have to build enough engineering stuff to support the science but ultimately it's quite possible at the end of the day we will throw away every single thing we've built except for that one end result. And so I don't want to spend any additional time researching stuff that's not going to help. I don't want to spend any additional time building stuff that's not going to help. And like getting that trade off is super hard. I'm not always so good at it but you guys are something. It's also always easier in hindsight. So what was the what was the most confounding bug in this process? Yeah so one of the really dangerous parts of machine learning and especially when you're doing machine learning on this weird undiscovered topic is that it's really hard to know if you've written your code right. I remember my machine learning professor told me this in college and I'm like that doesn't seem so hard. This like can't possibly be such a big problem. And then you realize that this is just the thing that is going to eat up more of your time than any other problem. So we had cases where you know we just lost weeks of effort because we had some thing and we had some evaluation metrics and the evaluation metrics are bugged in a way that makes them too good to be believed and really exciting and we spend a lot of time chasing that down before we realize that there's just some really subtle bug in our metrics and it's very hard to test for that. And you you basically end up needing to spend a lot of engineering time trying to make sure that these things work and that you can trust your evaluations here. Bugs and metrics are scary because if you're trying to make the number go down and the number is going down you're like this is great everything's great and then it turns out you were just like chasing a complete illusion for weeks. So how do you deal with that? Like what is it test? Like what's testing like for kind of research research code? I think I think kind of correctness bugs like that are very difficult to test because it's kind of not clear what the correct answer is. So you're you're like classic unit tests kind of doesn't really cover this well. I think the thing that helped here was to like kind of log as many metrics as you possibly can. Can you while this process is is like training? Can you think of every possible number you can like log and then graph those and then for your runs you can stare at these graphs and be like what should this look like does this make sense? And I think there's no easy answer here. It's just time. I think the other piece of this is just really going through the like the like code carefully and being like I know what the like the math for the ml says this should be doing but like what is it actually doing? And we've had a number of times where that where that didn't match and I think tracking those down is is a very important thing and I would also say that that like there's kind of latent bugs in master that you're worried about. I think there's like another way that this comes up of every time you have a new idea for how the ml will change you code that up and you run it and then sometimes the results are like oh this is worse than than your baseline and you're not really sure was the idea bad or when I coded it up was it bugged and you don't know and I think that's kind of a difficult trade off of what to do next because you can go and you can stare at the code you can go and stare at graphs and try and understand like like does this thing like was it bugged in some ways but at some point you have to decide this idea doesn't work and I'm giving up and I'm moving on to like something else. One of the striking things for me who is more precise background working on a team of really skilled engineers has been realizing the power of like pulling some of the engineering work forward to increase your iteration time and I think that the more the more that your ideas matter you know then you want to spend a lot of time thinking but if you have no idea of what's going to work or not then making it so you can test a lot of ideas quickly really pays off and this kind of relentless looking at how would I run this experiment okay could I run that experiment in a day instead of in a week could I run it in an hour instead of in a day could I kick it off in a minute and like your ideas might be better but like no no one has ideas that are like 200 times better such that you would rather you know take that long to to run an experiment. Speak for yourself. I think this comes back to the short term versus long-come trade off which is I think really just like one of the fundamental tensions about doing this sort of research engineering where you have to decide how much effort to invest into making things better long-term versus how much you want to just try something try it in the hackiest possible way and get results quickly and I think that unlike in unlike in a lot of traditional engineering you don't just want to lean all the way towards the long-term thing. There depends on a lot of factors it depends on how confident you are that something in this general area will work it depends on how reusable do you think this infrastructure is going to be in the future and how easy is this going to be to code up and get working really well. But it's also you know informed by the science of do we think that dictionary learning is a process that we should be going so all in on just is having the fit the like the faith guided by our scientific intuition that if we keep pushing here we're pushing blindly like we don't we don't actually know if we're going to be going towards anywhere until until we like you know drill down far enough and oh there's lava like it's just it's just a lot of dirt and then all of a sudden you you like pull back up and you realize oh my gosh we've actually gone so far and we've actually found something but for a while you're just it's funnily in the dark and like nothing works nothing looks good nothing makes sense but you just have to believe that like if we keep researching this direction like maybe there's signs of life and eventually we're going to see something useful. Personal question why do you like doing this work? So for me in my previous roles I am at the company used to work on on the inference team so the inference team is there is much less of the research aspect of it we kind of know exactly the operations that that that need to be done the like math and we just need to make them go really really fast and it leads you to these really interesting kind of systems low low low low level GPU optimization problems but to me it's like I can kind of plan out what the next six months will will look like you can kind of figure out we're going to design it exactly like this and we need to do ABC and D and you have this like exact plan and then you go and do it and I kind of found the work of doing that exact plan a little tedious or boring there are plenty of people at the company who love that I just don't personally where on this team like we can't plan six months out right and we don't know what to actually build and and you're following where the research results lead you and kind of everything's constantly changing so I really love that that piece of this job. Adley what do you like about this work? Yeah I think there's two questions there which is why I love the research part and why I love the engineering part because really I love both of them and I love the research part just because honestly there's no better way to describe than this is just a really beautiful problem and it's really fascinating to try to understand this and it feels amazing when you can shine a tiny little bit of light in the black box of models. One of the things that I like about this is the engineering is a lot of fun sure but it's also the problem itself it's like and this goes back to why do I like how does this compare to my previous job doing quant finance versus this? Just studying markets was actually very fascinating. The they're always changing there's a lot of interesting modeling to be done but here we're essentially doing like computational neuroscience on an artificial mind and no one's ever done that before in history because these things have never existed no one's ever done we're like among the first people right now to ever have access to artificial minds is big with the amount of computational infrastructure that it takes to analyze them we are literally like trying to figure out how these things think we're studying cognition in a very quantitative way and that's it's so mind blowing to me that almost the same skill set that I was previously using to predict the next price now becomes decoding thought and I love to finance for many many years but this just feels so much more meaningful to me. And I think the really exciting part about trying to tackle these problems with engineering is that it makes them solvable. If you ask yourself how do you do neuroscience and an artificial mind that's not the type of problem that you're really like going to solve or maybe you could solve it but you can you don't have high confidence in anything. There is something about building the infrastructure to do this and building the infrastructure to do a lot of experiments that makes it feel possible to say we are actually going to do this. Engineering is just a way of making this successful and making this possible. So for the people listening to us who think this sounds kind of cool do you have any advice about getting into you know into interpretability research or AI research from an engineering side? The first thing I'd say is I think a lot of people think the work of the interpretability team is much more needs much more of the research skill set than it actually does like the research skill set is important but the engineering skill set like really really matters too. So we are not just looking at people who have only done like math and like ML. We need people who are who are very strong at like coding too and like currently we're bottlenecked by hiring kind of very very strong engineers. So we need more people like that kind of asking us for for for for for jobs would be the first thing. What what you can do if you're interested in this and you're a great engineer ask us for a job because we are hiring people like you. That's silly but I think I think it's very easy to underestimate the contributions that you're able to make especially if you think of yourself more than engineer coming into this and I really the advice is just to apply. The other thing I would note on the on the engineering skill sets kind of what we're looking for what people might learn is that I think we need a lot of breath of that we are not like we need to make GPUs go fast for the work that we do but we're not pushing things to the to the bleeding edge right. So we need people who can kind of do a bunch of different skills and come in and and notice like oh I can do a quick change which gives us a big win. We aren't people who we weren't really looking for the skill set of I can spend two months to to use the graphics card 10% more more efficiently here. We're not going to spend two months on that we're going to move on to to to like paralyzing other jobs figuring out why some Python codes really slow so you kind of need this breath to be able to figure out like which point in this complicated pipeline is the bottleneck right now and let's go make that like a bit better in a few days is is kind of a big skill that that we'd really really love to see more of yeah it seems like the team has a lot of full stack engineering where the stack you know goes down to like you know you could do few GPU kernels and all the way up to building front and interfaces for like looking at how images make claw talk differently and that you never know where in that entire chain might be the thing you need to do you know there was a a member of front and bug the other day that actually turned out to be like an upsharing bug so you thought this might be okay the the server like is is rejecting your request and then it just turned out that no actually we had shuffled around these tensors in the transpose way and then that needed to be what's fixed and so it actually means there's a ton of ways to contribute and also this kind of breadth and fluency can really pay off so Josh your your scientist much more than us like on the I'd say we all shade pretty on the engineering side what's your biggest frustration with people like me I mean people like you are so charming no I don't think I don't think there's a there's a there's a frustration I think it makes for very good collaborations because oftentimes you know that we're we're so early days that there's often a lot of room for improvement and sometimes it turns out that like we should just be like plotting the correct metric or changing the initialization scheme for a matrix that could also speed up the training process by 5x and it could be that you need to speed up the training process by 5x by parallelization and so I think that there's just these opportunities this what I mean by the full stack actually continues all the way into like the mathematics and all of these pieces of it so I think that it's really helpful to have a very interdisciplinary approach to this stuff because sometimes you can you know sharpen the like did you really need to run your relations over the entire data set or are you trying to estimate a scalar at which point statistics tells you you need a thousand samples and then you're like pretty much good and you can save save a lot of time I think also I've actually really enjoyed the even though you're on like separate sub teams I don't get to work with you nearly enough I really enjoy the few times that we did get to collaborate because I think we have such complimentary skill sets where I've said it for I'm not that great at the math I don't I still know I'm sorry guys I'll leave but like I really like the culture of collaboration that lets we're like you and I will just sit together and pair program on a problem and we have very complimentary interests and skills where when we work together we are just like very very powerful and I think that that's a lesson the reason I bring this up is for people considering hey do you think I could come into interp and be useful it's like yes if you are good at some of these things but not all there's so much value when you pair with other people who have different skill sets and we really benefit from that collaboration I think that one of the really fun things about this is you start to learn from those collaborations like the shape of a problem that could be solved which is like well in advance of having any idea of how to solve it but I'm like mmm like I bet Jonathan could help with like this part of the thing feels stuck and I don't know enough to yet be able to do that but then we can sit together and like oh yeah that's the kind of thing that I could like bang out right now or on the visualization side I just like I feel like I'm clicking around between 17 windows right now and I'm actually my we've gotten the parallelization down it's like super fast to run these jobs and now it's taking me like 30 minutes to like look at the results and then we bring in pierce is like oh yeah yeah like we can totally make that part better and then when you put that all together you get this like really incredible like scientific system where you actually all of the parts sort of work and like you know at what comes out the other side of some of the more beautiful papers I think I've ever been involved in or actually got me in part to join the team was just like you just see these like jewel like figures that come from you know people obsessed with like working in Figma to just like that all the details in which is not something that I thought you know maybe working in Figma is a part of the standard like engineering toolkit but it turns out that like that also is a force multiplier. Yeah one one like explicit thing that I want to mention that was kind of baked into those answers is like how the team is structured here. I think a lot of people think that there's a separate research team and a separate engineering team and kind of throughout the conversation here we've been talking about the interplay between those so like separating those like just doesn't work like we don't do that there isn't like like these separate researchers who are telling the like engineers like build this like these these problems are fundamentally like entwined together and you have to work on them together so the way the like whole company works not just the the the interpretability team is kind of the research and the engineering always goes goes goes together and that's just absolutely crucial for this job. Adley if a friend came to you and was like what was the most fun or weird or quirky quirky thing you got to do. Yeah I think that there is like a surprising collection of problems that comes in after you have trained 34 million features and now you want to as silly as it sounds you want to see what these features do and this is a tricky problem at scale because these features only activate on very specific sequences of text that's what the sparse in sparsato and couture means and so if you want to really visualize all of them you have to run a lot of features through a lot of text and then do things like we also want to visualize what does this feature do on the nearby text and what does what does the distribution of this feature look like and solve a bunch of problems like that. That I believe at this point it's something like a 10 or 12 step very distributed pipeline just because this is one of the things that breaks really quickly once you scale up the problem and there's just so many steps that something is always breaking and something different is always becoming the bottleneck and so it's this process of just looking at this finding the bottleneck and trying to distribute that further. Yeah sometimes things like even matrix multiplication doesn't work anymore where you realize that you want to understand interactions this is on my team between 34 million features here and 34 million features there and genuinely you could just multiply the matrices but then you couldn't store the result anywhere or put the result anywhere and see you're starting to do some like fancy looping indexing in compression to compute a product just big numbers times big numbers are very big numbers. One of the things which we hit is the default PyTorch map mode implementation for certain shape matrix multiplies is just much slower so we're like profiling jobs and we look at it and most of our time is in matrix multiplies so we think this is great we're running really fast but we calculate efficiency numbers efficiencies not great so we then go to someone else else at the company who's kind of more of an expert in this narrow area and he tells us that oh yeah try try this other matrix multiply implementation it'll be much faster and we're generally doing that of like when we get to the really thorny problems like that we just ask someone else else at the like company because we're not experts at that but it does matter and we do need to make these things faster. So we are using the unbeknownst to us like a slow version of multiply these matrices? Well it is the default it is a version that is normally fast but for the specific shapes of the tensors we were running matrix multiply on it was not fast and there's kind of different different implementations for that so under the hood for for for matrix multiply what generally happens is based on the shapes of the of the matrices there's like different different ways the like GPU kernels actually work so some some implementations kind of pick pick though the like wrong the wrong approach and are just randomly slower so we kind of run into problems like that it's randomly slower how do we fix this and yeah you kind of don't have the time to go be an a like expert in this area you just need to kind of quickly find something that'll that'll speed it up I think this is such a fun example because you would think that matrix multiplication is just heavily optimized but in a very physical sense our problem was just a weird shape it was a weirdly shaped matrix and so we just run into all of these problems because interpretability research is just doing really weird things like this and so you run into all of these weird things that happen yeah thinking thinking like distributed is sort of funny for this too we were doing some like attribution calculations where you you're just multiplying a vector by bunch of other vectors and like you have to think carefully about where they are living and which direction you send information because if you send this over here you get to send some scalars back but you send this over here it's like a matrix is going back and all of a sudden like you've spent just enormous amounts of time shuttling data back and forth where like again I'm was trained as a mathematician you write the equation and all of the letters are on the same line right there's no communication bottleneck between the a and the v that it's next to yeah I was looking at a at a open source implementation of of sparse auto encoder training that only runs on a single graphics card and I was just shocked by like this is so simple like this is so easy why do we have so much code and then you go through all the various points where where where we we had to like scale this up a thousand times bigger and it's just like that is where all the all the all this code code comes from and there's kind of so many little battles there of like this random thing doesn't doesn't scale is like to x slower that that that like we've put in which we didn't have to do back when when like we were doing very very small jobs which just fit on a on a single graphics card I think that also speaks some to some of the complementarity of the of the work that can kind of happen in academia or more open source environments and what you can do it accompany with the scaled models where like you can try out a lot of ideas at small scale and it like isn't that hard from an engineering perspective and then to get that to actually work on models that are many orders of magnitude larger you're just like entering new realms of physical difficulty to get anything off the ground sometimes it feels like there's the gift though which is that in the bitter lesson that Richard Sutton talks about which is sometimes the scalable thing is better because you can always put more scale and if you do the engineering and you hit the upper limit of being clever and so even though some of these methods are quite conceptually simple it's turned out that like on the rich data distributions that actually make up these networks they show really amazing things it's really fun I think that the bitter lesson applies not just to training a model but also to interpretability where I think people often think of interpretability as trying to get this like very principled understanding and there is some of that but there's a lot of that that just really has the same properties as the bitter lesson where you just take something simple and do it at scale and you pick the scalable thing and it is really beautiful to me that that works not just for making good models but also for understanding models. The other point I'd make with like scaling and the bitter lesson is that the company has given us access to the like compute which we need to to actually scale this and it's been really fun that like the thing blocking us from scaling further is like whether the ML actually works at that scale or the infrastructure works at that scale it like hasn't been can we actually get the graphics cards to like to like run on which would be kind of a much more frustrating reason to not be able to scale. Where do you see interpretability in a year? I think that where I see interpretability in a year is if everything goes well like this is a super bullish case but we will figure out so we did one slice through the middle layer of summit and I would want to analyze the entire the entirety of every layer every piece of all of our production models and not just analyze them. Right now we only found features we don't know how they fit together we don't know how they work in a variety of different contexts and I and I really want us to do the circuits work to figure out like what do these features mean on their own what do they mean together working in concert? Yeah one thing that I think I'm just surprisingly excited about is just actually continuing to scale this up there is a lot about what we need to do that is going to need to be different there is definitely going to be lots of opportunities to change the way we do things but at the same time these things seem to work better as you keep scaling them up and so I'm really excited about just trying to eke out the last few orders of magnitude and see what happens and if you would like to help us with that we are hiring we would love to work with you. Can I just say I love the phrase the last few orders of magnitude? There's so much in that one those few words. So why are we doing interpretability? I think one of the things I want to emphasize here is I have a lot of uncertainty about the types of safety challenges that are going to rise with large language models and I'm very uncertain about the direction things will go in the future but interpretability feels very robust and very excited to work on this because I think you can help with a really wide range of problems in a really wide range of scenarios it's just understanding model seems good and if you can do that better that's probably helpful. Yeah understanding model seems good and if you can do that it seems like it'll help you with any of the behaviors you might and maybe that's something I really like about interpretability or rather the approaches we're taking which are sort of completionist right it's trying to map the full diversity of the model because if you can do that you can zoom in to the parts that you need later whereas if you're just focused on like one particular behavior of interest it might not generalize or it might be missing the sort of the important part of the story and so you can do interpretability focused on like one behavior at a time but if you want the whole picture you need to scale and that's why people like the ones at the table who can make the scaling happen. You're here you're here. All right hands in one two three cloth one two three

Scaling interpretability

TL;DR

Takeaways

Vocabulary

Transcript