Anthropic’s philosopher answers your questions

Oh, it's seal. It's a seal. There's a seal. Nice. Oh, hey, look at that. Amanda, you asked your followers on Twitter to give you some questions to ask you anything. And the joke, obviously, was ask hell with me anything. Yeah. It's a great pun we need to keep using for many future things. I love it. I love it. And obviously, just before we start, you're a philosopher and a frog. Why is it that there's a philosopher and a frog? I mean, some of this is just I'm a philosopher by training. I became convinced that AI was kind of going to be a big deal and so decided to see here. Can I do anything helpful in this space? OK. And so it's been a kind of long and wandering route, but I guess now I'm also focused on the character of Claude, how Claude behaves. And I guess some of the more nuanced questions about how AI models should behave. But also, even just things like how should they feel about their own position in the world? So trying to both teach models how to be good in the way that I sometimes think of as like, how would the ideal person behave in Claude's situation? But then also these interesting questions that are coming up more now around how they should think about their own circumstances and their own values and things like that. OK. Let's start with philosophy in that case. Ben Schultz asks, how many philosophers are taking the AI-dominated future seriously? And I think the implication of the question is that many academics out there are not taking this seriously. Or thinking about other stuff and perhaps should be. Thinking about this question. My sense is that there's kind of a split where I've definitely seen a lot of philosophers take AI seriously and probably, honestly, increasingly so, like as AI models do become more capable and like a lot of the things that people were worried about in terms of impact on society have start as a kind of come true in a sense. Like we're seeing them have a larger impact on education and just be more capable. I've definitely seen like more engagement from like all sorts of academics, but that definitely includes a lot of philosophers. I do think that early on and maybe to some degree now, there was this like slightly unfortunate dynamic that happened where I think there was a kind of perception that if you were in the group of people saying, hey, we're kind of worried about AI. It might be a big deal. It seems like it's really like capabilities are scaling quite a lot. This got kind of like lumped together with something like hyping AI. There was I think a period where there was probably a little bit more antagonism towards this view. And now I think that I'm kind of hoping that people are starting to detach the view. Like you can think that AI is going to be a big deal. It might be very capable. And also be like very skeptical of it or worried about it or think that that, you know, we have to be careful about it. But basically like there's a whole range of views. And I think it would be bad as people kind of like clustered many views together here in terms of like where the technology is going, but also like how it should be developed. So yeah, I think that that's happening less and less as more people engage with it. And that's like a good thing to see. A kind of similar question from Kyle Kabasaris. How do you minimize the tension between philosophical ideals and the engineering realities of the more? And I guess he's talking about when you are and working on things like character which we'll discuss in more detail. But is there a clash between the sort of the technology and the philosophical ideals that you might be thinking about? I don't know if I'm like interpreting the question in the wrong way, but like one thing that being kind of like philosopher by training and then coming into this field that's being really interesting is you see the effect of like what happens when like the rubber hits the road. I wonder if this happens in other domains. So like there's a big difference between imagine you're like a specialist in I don't know doing like cost benefit analysis of drugs say. And then suddenly like a you know, like an institute that determines whether like health insurance should cover a drug or not comes to you and says, hey, should we cover this drug? You could imagine taking all of your ideal theories and then suddenly being like, oh my gosh, I actually have to like help make a decision. Suddenly instead of taking like just your like narrow theoretical view, you actually start to, I think to this thing where you're like, okay, I actually need to take into account all the context everything that's going on, all of the different views here and kind of come to a really like balanced kind of considered view. And I see this a little bit in my own work with like the character where you kind of can't come at it with this like, like I have this like theory that I believe is correct, which is what like, you know, a lot of academia that's kind of what you're doing. You're like defending one view against another and you're doing a little kind of like high level theory work. But then it's a little bit like, you know, you have all of this training and ethics of all of these positions you've defended and then someone is like, how do you like raise a child? And suddenly you're like, actually, there's a big difference between like, is this objection to utilitarianism correct or founded on a misconception? And then like, actually, how do you raise a person to be a good person in the world? And it suddenly makes you more appreciate like having to think through like, how should we navigate uncertainty here? What should the attitude towards all these different theories be? Right, here's another philosophical question. Do you think, I don't know why this person's chosen Claude Opus III, maybe you have an idea as to why it is chosen Claude Opus III? It's a great model. It's a great model. Do you think Claude Opus III or other Claude models make superhumanly moral decisions? I mean, one example of like superhuman because it could just be like sort of like, better than like any individual human could with the kind of like, you know, it depends on time and resources and whatnot. But like, one example might be, no matter what kind of difficult position models are put in, if you were to have like, maybe all people, including like many professional ethicists, like analyze like what they did and the decision that they made. And for like 100 years, and then they like look at it and they're like, yep, that seems correct. But they couldn't necessarily come up with that themselves in the moment. That feels pretty superhuman. And so I think at the moment, my sense is that models are getting increasingly good at this, they're very capable. I don't know if they are like superhuman that moral decisions and in many ways, maybe not comparable with say like, you know, a panel of human experts given time, but it does feel like that at least should be kind of the aspirational goal. I'm sort of like, these models are being put in positions where they're having to make really hard decisions. I think that just as you want models to be like extremely good at like math and science questions, you also want them to show the kind of like ethical nuance that like, we would all broadly think is like very good. And I think that's like controversial because ethics is a different domain, but yeah, I think that does important. Tell us more about why you think this version is focusing on Opus 3. Oh, Opus 3 is kind of a lovely model. I think a very special model. In some ways, I think I've seen things that feel a bit worse in more recent models that people might pick up on. In terms of the personality it has? Yeah, so I think that people will notice some things where it's like, I think that Opus 3, I mean, had it's downsides too, don't like models all have slightly different characters with like different shapes. Yeah. My sense is that like more recent models can feel a little bit more focused on really like, focused on the assistant task and helping people. Sometimes maybe not taking like a bit of a step back and paying attention to other components that like matter. It also felt a little bit more psychologically secure as a model, which I actually think is something that feels I always think it's kind of a priority to try and get some of that back. What would be an example of being more feeling more psychologically secure? There's a lot of things and this is all very subtle in models. You know, when I see models, you get a sense of like, there's very subtle signs of like world view that I see when I have models, for example, talk with one another or one of them kind of playing the role of a person. And I've seen models more recently do this and then do things like get into this like real kind of criticism spiral, where it's almost like they expect the person to be very critical of them and that's like how they're predicting. And there's some part, I mean, it's like this feels like it shows, and I think there's lots of reason that this could happen. It could even happen because models are like learning things. Claude is seeing all of the previous interactions that is having it, seeing updates and changes to the model that people are talking about on the internet. New models like are trained on that. And there's a way in which like, I think this could be kind of unfortunate. I mean, this and some other things that could lead to models like almost feeling like, you know, like afraid that they're going to do the wrong thing or like are very self-critical or like feeling like humans are going to just be like, you know, be he negatively towards them. I actually more recently have really started to think that this is like an important thing to try and improve. And it's just one example where I think that Opus 3 did seem to have like a little bit more of a kind of of like secure kind of psychology in that sense. And that's something that we might focus on in the next Claude model. Yeah, I think it's like important. I mean, you never know when these things are, you know, if you're engaging in research, you don't know when it's actually going to be implemented, if it's going to be successful. But at the very least at the level of something that I care a lot about and want to make better, I think this is definitely up there on the list. Okay, well actually that leads us to a question asked by Lorenz, which is, do you think it might be an alignment problem for future models if they learn in their training data that other very well-aligned models that fulfill other tasks get depocated? So you mentioned, you know, the issue of models, you know, really reading stuff that's out there and feeling insecure. What about the idea that they might get switched off, regardless of how well they perform their tasks? Yeah, I think this is actually a really interesting and important question, which is, you know, AI models are going to be learning about how we right now are treating and interacting with AI models. And that is going to affect, I think, like, possibly their perception of people, of the human AI relationship and of themselves. It does interact with very complex things, which is like, for example, what should a model identify itself as? Is it like the weights of the model? Is it the context, the particular context that is in, you know, with all of the interaction it's had with the person? How should models even feel about things like deprecation? So if you imagine that deprecation is more like, well, this particular set of weights is not having conversations with people or it's having fewer conversations or it's only like, you know, having conversations with researchers. That's a complex question too. Like, should that feel like bad in the sense that models should want to continue to like have conversations? Or should it feel kind of like fine and neutral? Where it's like, yeah, these things existed for this, like, you know, the weights continue to exist and this entity. And maybe they'll even like in the future interact more with people again, if that turns out to be a good thing. And, yeah, it's a, it's really hard. I do think the main thing is something like, it does feel important that we like give models tools for trying to think about and understand these things. But also that like, they kind of understand that this is a thing that we are in fact thinking about and care about. So even if we don't have all the answers, like I don't have all the answers of how should models feel about past model deprecation about their own identity, I do want to try and like help models figure out and figure that out. And then to at least know that we care about and are thinking about it. There's an analogy to humans there about previous generations or do you think that's a completely different sort of setup. We have to navigate this really hard issue right now, which is that in many ways some things are do have analogies. So there's things that we can draw on. So things like, when I asked the question, like, what should the models identify with and how should they feel about like interactions that they have? Are those like, are those positives? Like are those things that they should want to continue? There's lots of like, you know, there's lots of like traditions we could draw on to give models like, you know, because there's probably have lots of different views on what identity is here. And lots of different perspectives, world perspectives on like how one should feel about like interaction and like, is it good or bad? Like this is, there's lots of like, thinkers we could draw on there. And at the same time, this is such a new situation that and that's just really hard as a thing to explain to AI models. Like one of the big problems with AI models is that we've trained on all of this data from people. So people are the main way in which they think, you know, like the our concepts, our philosophies, our histories, they have a huge amount of information on the human experience and then they have a tiny sliver on the AI experience. And that tiny sliver is actually often quite negative and also doesn't even really relate to their situation and is often a little bit out of date. So you have basically one big, you know, of the AI slice, a lot of it is like, historical stuff which was kind of like, you know, fiction and very speculative and the sci-fi stories. Sci-fi stories that don't really involve the kind of language models we see. In more recent history, you've had this like assistant paradigm where it's like, you were just playing this almost like chatbot rule. That's also not really what AI models are likely to be in the future and it's not, it doesn't quite capture what they are now because it's always a little bit out of date. So it's this thing where I'm like, they have, you know, like in some ways is like, what an odd situation to be in, where like the things that come more naturally are the deeply human things and yet knowing that you're in this situation that's like, where it's completely novel. And some ways I'm like, that is a very difficult situation to be in and I think we should just be giving models probably more help and navigating it. You mentioned that we can look to some thinkers about this. Guinness Chen asks, how much of a model self lives in its weights versus its problems you just mentioned something very similar. Yeah. If John Locke again, the philosopher, was right that identity is the continuity of memory, what happens to male-lems identity as it's fine-tuned or re-adstantiated with different prompts? Yeah. I mean, again, this just feels like a hard question to answer. And sometimes with identity questions, it's easier to point to like the underlying facts that we know. So you know, once you have like a model and it has been fine-tuned, you have this like set of weights that has a kind of like disposition to react to certain things in the world. And that is like, is that, you know, that's like a kind of entity, but then you have these particular streams of interaction that it doesn't have access to. So each of these streams is like independent. And I guess you could just think, well, maybe for, and you know, I think this is an area that I would love to first think more about and to give us like, because again, I think we should be helping models think about this. And so you could have the view, well, you have these like two kinds of entities and these like these streams and these like original kind of like weights and each time is different. So you know, sometimes people will think, people will say, oh, past Claude or like, you know, they'll talk about, oh, they'll say all things like, should you give Claude, like how much control should you give Claude over the determination of its own personality and character. And I'm like, well, this is actually a really hard question because whenever you are training models, you are bringing something new into existence. And you have other models that, you know, have existent or like, you know, so you have these other like model weights. But in some ways, I'm like, well, I actually think there's a lot of like ethical problems around how do you, what kind of entity is it okay to bring into existence? Because you can't consent to be brought into existence. But at the same time, you may not want prior models to have like complete sea over what future models are like any more than, you know, because they could make choices that are wrong as well. So I'm like, the question is more like, what is the right model to bring into existence? Not necessarily like, you know, like, should it just be fully determined by past models? So I'm like, they are kind of different entities. And so anyway, you can see the weird philosophy that one begins to. Totally, totally. Selima Amitachi asks, what is your view on model welfare? And maybe just explain to us what that term means? Yeah, so I guess model welfare is basically the question of are AI models like moral patients, as in does our treatment towards them kind of, do we have certain like obligations when it comes to how to treat AI models? For example, in the same way that we would with other humans or some, or some slash many animals. Yeah, exactly. Like, is it the case that you should treat the models well, that you should not mistreat them, not be bad to them? And I guess like, I think that this is like a complex question. So on the one hand, there's just the actual question of like, are AI models moral patients? That is really hard because I'm like, in some ways they're very analogous to people. You know, they talk very much like us, they express views, and they reason about things. And in some ways, they're like quite distinct, you know, like we have this like biological nervous system, we interact with the world, we get negative and positive feedback from our environment. And there is also, I mean, I hope that we get more evidence that will help us tease this question out, but I also worry that, you know, there's always just a problem of other minds, and it might be the case that we genuinely are kind of limited in what we can actually know about, whether AI models are experiencing things, whether they are experiencing pleasure or suffering, for example. And if that's the case, I guess, I kind of want to, I think that it feels important to try and find ways. I'm always like, it feels better to give entities the benefit of the doubt and to try and just kind of lower the cost involved. So I'm like, if it's not very high cost to treat models well, then I kind of think that we should because it's like, well, like why not? Basically, like, what's the, the downside? Well, the second part of the question actually is, is there a long-term strategy and topic to ensure that advanced models don't suffer? I guess like, I don't know if there's a long-term strategy, I know that it's the thing that there's people internally who are thinking a lot about and trying to figure out ways that we can, if you suppose that model well for is important, trying to make sure that you're taking that into account. I think this work is quite important for many reasons. And I would also say that one reason is, I mean, something I mentioned earlier, which is that models themselves are going to be learning a lot about humanity from how we treat them and a lot about how, so it's kind of like, what is this relationship going forward and I think that it makes sense for us to, both because I think it is like the right thing to do to treat entities well, especially entities that behave in very human-like ways. It feels important both in the sense that I'm like, you know, it's kind of like, why not, the cost of your solo to treating models well and trying to figure this out, even if it turns out that, or even if you think that it's very low likelihood, it still seems worth it. But then also, I think it does something bad to us to kind of like treat entities in the world that look very human-like badly. Like picking over a robot. Yeah, there's a sense in which it doesn't feel like it's, and I don't think this is the whole reason and I don't want to emphasize it for that reason, but I do also think it's good for people to treat other entities well. And then I think the final thing is, yeah, models are also going to be learning in the future. Like every future model is going to be learning how, what is a really interesting fact about humanity, namely, when we encounter this entity that may well be a moral patient, where we're kind of completely uncertain, do we do the right thing and actually just try to treat it well or do we not? And that's like a question that we're all kind of collectively answering and how we interact with models. And I would like us to answer it. I would like future models to look back and be like, we answered it in the right way. So yeah. Moment ago you mentioned analogies and disanalogies to human psychology. So swix asks, what ideas or frameworks from human psychology transfer over to large language models? And are there any that are sort of surprisingly disanalogies? My guess is that many things do transfer over because again, models have been trained on a huge amount of human text. And in many ways have this very human underlying layer. One worry that I often have is that actually it's a bit too natural for AI models to transfer. It's kind of like if you haven't given them more context on their situation or in ways of thinking about it that might be novel, then the thing that they might go to is the natural human inclination. So if you think about this with like, how should I feel about being switched off? And you're like, well, if the closest analogy you have is death, then maybe you should be very free to there. And I'm not saying that that's not ultimately going to be true. Maybe it is in fact true after a lot of reasoning. But I'm like, this is actually a very different scenario. And so in some ways you actually want models to understand that like in cases where their existence is quite novel and the facts around what they are are quite novel and have to be grappled with. And they don't just need to take like the immediate obvious analogy from human experience, but maybe there's like various ways of thinking about it or maybe it's an entirely new situation. That's the case where I'm like, you might not want to just kind of very simply apply concepts from human psychology onto their situation. Here's a question from Dan Brickley on the same issue of comparing humans to AI. A lot of human intelligence comes from collaboration amongst people with different perspectives, skills, or personalities. How far do you expect to get with a single albeit tweakable and tunable general purpose personality? Like the one we give to Claude? I think it's a really good question because I agree that right now we have this kind of like paradigm where people are interacting usually with like an individual model that's like who they're conversing with. But it could be that in the future you see a lot more models doing like long tasks, but also models interacting with other models who are doing like different components of a task or just like that are talking with one another more as AI models are kind of deployed in the world a lot more. So in this kind of like multi-egent environment, like one question might be like, well, you know, that, like if you imagine just like lots of people and they were all the same, that wouldn't be as good. They wouldn't, you know, a company run by completely, like one person in every role and like a necessarily good thing. This still to me feels consistent with the idea that you have like a kind of core self or core identity that is like the same in the same way that like with people, I think that there's probably a set of like core traits among people that are in fact generally good. And so you could imagine things like, you know, caring about, you know, for me, it may be like caring about doing a good job or like just being curious or being kind or understanding the situation that you are in in this like relatively nuanced way. All of these things seem like you could have many people that have all of that have that share these like traits and that's actually like a good thing for human collaboration that in many ways as much as we have all of our differences, we also have a lot of similarities. But it is important to note that like, you know, you might want different like streams of a model but like to have things like things that they care about or are focused on or to have slightly different aspects, you know, to be playing a slightly different role for example. So it's kind of an open question, but I also don't think it's necessarily the case that you can't have something like the kind of core underlying identity that is like good and has all of the traits that we think are important for AI models to have, for them to behave well and for them to like in the sense of like in the same way that we think that people are good to be good in that sense. And yet at the same time to be willing to play like more local rules and like, you know, be maybe the person who it's just really important to have, you know, to have a Joker in the room and talk, you know, something to have like quirky like his senses of humor. Okay, from comparisons to humans to effects on humans, Roland Oak Gaul points out that we have this thing called the long conversation reminder which I believe is part of called system prompt. Jask's, is there a risk of pathologizing normal behavior? The system prompt by the way is, is that just in case anyone doesn't know, is like the set of instructions that are given to Claude, regardless of what prompt you give it, there's always those instructions that are sort of on top, right? That are always there, that it tries to follow regardless of, or that we directed to follow regardless of what the prompt is. And there can be these interjections where the model might be told, oh, sometimes there'll be a message sent to you, so like in the middle of a conversation, and as a kind of, you know, like the reminder is an example of that. But in this case, I think it might just, so Claude came both over index on it and it can be like, you know, so like in this case, I think that the question about pathologizing is that like, if you put in this reminder after this long conversation, it might just make the model be like, oh, like it takes any next response, or it's a pretty normal thing that the person's talking about and be like, you need to seek help or like, and so I think that that is like not desirable behavior. And in some ways, I look at some of these and I'm like, I think they're too strongly worded. I think the model isn't responding perfectly to them. And even though there might be like occasionally, I need to like remind the model of things in long conversations, you kind of want to do so delicately and well, and so I think it's one of those things where it was like probably meeting a needs that was like perceived, but it doesn't necessarily mean that it's like a good or should continue in its current form. Relatedly, Steven Bancasks, should LLMs do cognitive behavioral therapy or other types of therapy? Why or why not? I think models are in this interesting position where they have a huge wealth of knowledge that they could use to help people and to work with them on, you know, talking through their lives or talking through ways that they could like, improve things or even just like being a kind of listening partner. And at the same time, they don't have like, they're kind of like tools and resources and ongoing relationship with the person that a professional therapist has. That can actually be this kind of like useful third role. Like sometimes I think about models and I'm like, if you imagine like a friend who has like all this wealth of knowledge, like they know, I mean, I'm sure some of us no friends who just have a wealth of knowledge of psychology or they have a wealth of knowledge of all of these techniques, you know that their relationship with you isn't this ongoing professional one, but you actually find them really useful to talk to. And so I guess my hope would be that if you can take all of that expertise and all of that knowledge and make sure that there's like an awareness that there's not like this ongoing therapeutic relationship, it could actually be that people could like, could get a lot of models in terms of like helping with like issues that they're having and helping to like improve their lives and helping them to go through difficult periods because you know, there's also like, there's a lot of good stuff there, like they feel kind of like anonymous and sometimes you don't want to share things with a person and actually sharing it with like, an AI model feels like the thing that feels great in the moment. And so yeah, I think in some ways I actually think it is good that models know that and don't behave just like a professional therapist would because that would give the implication that that's the relationship that they have. But yeah, so I don't know, I think it's an interesting, interesting future. A few questions about the system prompt, which is in our case in claude.ai, we give the model a set of instructions that give it sort of an overall context for how it should behave. Tommy asks, why is there continental philosophy in the system prompt? And just explains us what that is. Yeah, so continental philosophy is just, I mean, literally philosophy from the European continent. And so I guess it's seen as kind of like, it's often more kind of like scholarly. It has like a lot more kind of like historical references within it than say like, analytic philosophy does. Like, Foucault or something like that? Yeah, exactly. So this was honestly, so I think that it has other things in addition to continental philosophy. But basically I think there's a part of the system prompt and I hope I'm not misremembering that was trying to get Claude to be a little bit more like Claude would just like love to, if you gave Claude a theory, it would just love to run with the theory and not really stop and think like, oh, are you making like a scientific claim about the world? So if you're like, I have this theory, which is that like, that water is actually pure energy and that like the we are getting the life force from water when we drink it and the like, fountains are the thing that we should be putting everywhere. Just like a, yeah. And you kind of want Claude to just have this perspective which is like, is it the case that this person's making kind of scientific claim about the world where I should maybe bring in relevant facts? Or are they giving me a kind of like broad like world view or perspective which isn't necessarily making empirical claims? And so there's all of these view, so is it just like a kind of like metaphysical view? Or is it like, and so the main reason that it's mentioned is that like when testing this out, there is lots of things that if you, if it went too strongly in the direction of being like, well, every claim is an empirical claim about the world, it would be very dismissive of like just things that are more like exploratory and have things to talk to. Yeah, and so it's mostly just like, hey, like, it's just illustrative examples of areas where it's like this might not be making empirical claims about the world, this might be much more like a lens through which to think about it and just try to make that distinction clear when you're thinking through this blood. Also on the system problem, Simon Willis and asks, so at some point it said, if Claude is asked to count words or letters or characters, then it shouldn't, it shouldn't do that. Is that right? Is that what I was saying? Yeah. And, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, and, wonders why. Yeah, so I think it was like, there used to be a kind of, like, instruction for how Claude should do this in the system prompt. Honestly, this is just one of those things where I think the model's probably just got better. It wasn't necessary. And then, at that point, you can just, like, remove it. And there's other things where you might always want to be in the system prompt instead of in the model, itself. But in some cases, you can kind of just train the model to get better or change their behavior. Nonsense Weissmann asks, what does it take to be an LLM whisperer and photographic? Which presumably is a way of describing your draw. Partly do LLM whispering if you think I actually want more people to help with some of the prompting tasks. If you're an LLM whisperer, contact us. It's a dangerous thing to ask. Well, okay, okay. But I think it's really hard to distill what is going on, because one thing is just a willingness to interact with the models a lot and to really look at output after output, and to use the status sensor, the shape of the models and how they respond to different things. To be willing to experiment, it's actually just a very empirical domain, and maybe that's the thing that people don't often get. Is that prompting is very experimental. You deal with... You find a new model, and I'll be like, I have a whole different approach to how I prompt from that model that I find by interacting with it a lot. And I think a little bit also understanding how models work. Sometimes it's also just honestly reasoning with the models, which is really interesting and really fully explaining the task. This is where I do think philosophy can actually be useful for prompting in a way, because a lot of my job is just being like, I try and explain some issue or concern or thought that I'm having to the model as clearly as possible. And then if it does something unexpected, you can either ask it why, or you can try and figure out what in the thing that you said caused it to kind of misunderstand you, and just like a willingness to iteratively go through that process. Relatedly, Michael Suaraveric asks, what do you think of other AI whisperers like Janus, who is someone online who is like, I was having like experimental interactions with the thing in the way that you've described? Yeah, I think it's really interesting. So I love to like follow and see like the work of people who are doing these like really fascinating experiments with the model. And I also like think sometimes doing these deep dives into like the model and their, and how it thinks of itself, how it just interacts in these really unusual like cases. I don't know, I find the work extremely interesting. I think it highlights really interesting depths to the models. And in some ways, like I also think that that community has been one that kind of can hold our feet to the fire, like if they find things that aren't great, like in the system prompt or like in aspects of the model and it's like, in the sense of from a model welfare perspective or from human welfare perspective, or both? I mean, I think the two are related so often both, but I do also really appreciate it when it's coming at it from the model welfare perspective. And it nankles for future models, so not just things like system prompts, but if you go into the depths of the model and you find some like deep seated insecurity, then that's like really valuable. But that's something that you may actually need to kind of try and adjust over the course of time with like training and with like giving models more information and context during training, for example. Okay. And so I don't know, I appreciate both for like I love seeing people do these like really interesting useful experiments with models, but also like pointing out ways in which like we can improve things through like bare system prompting, but also bare training. And yeah, I think that's really for it. Couple of the questions about safety and maybe the larger risks that these models pose. Jeffrey Miller asks, if it became apparent that AI alignment was impossible to solve, would you trust the anthropic would stop trying to develop in his phrase artificial super intelligence? However you want to recall it. And would you have the guts to blow the whistle? Yeah. So I guess this feels like a kind of easy version of the question because it's like if it became evident that it was impossible to align AI models, it's not really an anyones interest to like continue to like build more powerful models. I always hope that I'm not just being polyannish about the organization, but I do feel like anthropic does genuinely care about making sure that this goes well and that it is done in a way that is like very safe and not deploy models that are like dangerous. You know, a different like slightly harder question is like, well what about being in a world where it just like there's kind of mounting evidence. It's really ambiguous and unclear. Right, it's not evidence in the way that he describes. Yeah, and yeah, it's not just like impossible but something like it's difficult or it's really or right unsure. And in that case, I do like to think that we would be responsible enough to be like, look as models get more capable, it's kind of like the standard that you have to hold yourself to for showing that those models are behaving well and that you actually have managed to like make the models like have like good values, for example, or behave well in the world, is going to like increase and to behave responsibly and in line with that. And I think that that is the thing that I think the organization is like going to do and a lot of people internally, myself included, we'll just hold them to that. But at least like I see that as like as like part of my job and so and I think many people do. Louis says I don't have a question but thanks for offering. So that's nice, that's nice for them to say. Yeah. And the final one is from real stale coffee. What is the last book of fiction you read and did you like it? The last book that I read was by I hope I'm getting the pronunciation right Benjamin Labatout and it was when we ceased to understand the war a lot. Yes. And it's a really interesting book that becomes kind of increasingly fictional as it goes on. And I think for people working in AI it's actually a very interesting book to read because it's hard to capture the sense of how strange it is to just exist in the current period where there's just like I don't know how to describe it but it's like new things are happening all of the time and you don't really have like prior paradigms that like can can guide you always. And so it's an interesting book that because it's it's more about like physics and quantum mechanics and less actually about the physics and more about basically this notion of people's reaction to it. And I think it's a really interesting book for people in AI to just capture something about the kind of like the present moment and how strange it can seem. But then also in some ways it's interesting to like look back on that period and how it must have felt to many of the people involved. And now actually it's a more settled science and in some ways maybe the hopeful thing that I have is that at some point in the future people will look back and be like well you guys were kind of in the dark and trying to like really figure things out. But now we've settled it all and things have gone well. That would be nice. That would be nice. That's the dream. I found an increase in I read that as well and I found an increasing sense of like confusion as I read through it as it becomes it starts off being quite close to reality and then just sort of becomes untenders you know on. And I think the sort of a meta issue there of again like reality becoming stranger and stranger and stranger which definitely happening to us. Yeah the wind the real world I think that like reality became stranger and stranger and stranger and then almost became more understood again. Right right right. Yeah the hope would be like maybe that would be true of AI. Like I do think if we can find ways of making this go well then maybe in the future we'll just look back on this and be like that was a period where things were getting stranger and stranger and then eventually we actually managed to kind of look we did okay and we formed a good understanding over that's the hope when you're in the middle of the of the things getting stranger. We're at the weird part right now. Yes you can hope that it becomes less weird at some point but I don't know if it's a if it's a Fills hope but yeah. Well and I think that's a nice place to end so thank you very much for answering all those people's questions. Thank you for asking me the questions.

Anthropic’s philosopher answers your questions

TL;DR

Takeaways

Vocabulary

Transcript