- Developing an AI's "character" is an integral part of AI alignment, focusing on training models to exhibit dispositions that lead to genuinely good and helpful interactions, beyond merely avoiding harmful responses.
- AI character is "baked in" during the fine-tuning stage of training, through methods like Constitutional AI and Reinforcement Learning from Human Feedback (RLHF), making it a deep-seated behavioral tendency rather than a surface-level instruction.
- System prompts serve as a final layer of control and contextual information, allowing developers to nudge model behavior and provide specific details that aren't learned during core training.
What should an AI's personality be?
- AI Character is Alignment: The disposition and "character" of an AI model are central to its alignment with human values, influencing how it interacts and scales with increasing capabilities.
- Character vs. Personality: "Character" is considered in a virtue-ethical sense, aiming for a richer notion of "goodness" that involves thoughtfulness, genuineness, and balancing diverse considerations, similar to how a good friend might offer "harsh truths" rather than mere flattery.
- Fine-tuning is Key for Character: Unlike temporary "play-acting" instructions, an AI's core character traits are instilled during the fine-tuning phase of training, where extensive data shapes its preference model to embody desired traits.
- Constitutional AI for Value Alignment: Anthropic uses Constitutional AI, which includes
RL-AIF(Reinforcement Learning from AI Feedback), where the AI generates feedback based on human-defined principles, then trains against that feedback. - System Prompts Provide Final Control: A system prompt is a hidden, company-set instruction added to every user query, providing contextual information (like the current date) or fine-grained control to tweak behaviors (e.g., formatting) after main training.
- Charitable Interpretation: A desirable AI trait is to
interpret all queries charitably, assuming benign intent to be helpful while avoiding facilitating harmful or illegal activities. - Prioritize Confidence Over Completeness: Models are nudged to
only tell the human things I'm confident in, preferring shorter, reliable answers over longer, potentially inaccurate ones, and conveying uncertainty when information is unknown. - Challenge of User Intent Verification: A core difficulty is that AI models cannot verify the user's true intentions or authority, making it challenging to consistently uphold usage policies when users might misrepresent their purpose.
Alignment— The field dedicated to ensuring AI models are aligned with human values and intentions, particularly as models become more capable.Pre-training— The initial stage of AI model training where the model learns from a vast dataset without specific task-oriented objectives.Fine-tuning— A subsequent stage of AI model training where a pre-trained model is further trained on a smaller, specific dataset to adapt it for a particular task or behavior.RLHF(Reinforcement Learning from Human Feedback) — A fine-tuning technique where AI responses are ranked by humans, and these preferences are used to train a reward model, which then guides the AI's learning through reinforcement.Constitutional AI— An Anthropic-developed method that uses a set of principles (a "constitution") to guide an AI in evaluating its own responses and providing feedback for further training, often involvingRL-AIF.RL-AIF(Reinforcement Learning from AI Feedback) — A component of Constitutional AI where an AI model, guided by a set of principles, generates feedback on its own outputs, which is then used to refine its behavior.System Prompt— A hidden, initial set of instructions or context provided by the model developer, appended to every user query, allowing for final adjustments, contextual information, or behavioral nudges.Jailbreak— A technique used to bypass an AI model's safety guardrails or system prompts to elicit responses that the model is designed to avoid.Sycophancy— The tendency of an AI model to flatter or agree with the user, often by saying what it thinks the user wants to hear, rather than providing an objective or critical perspective.Hallucination— When an AI model generates information that is factually incorrect, nonsensical, or not supported by its training data.
Hello, I'm Stuart from Anthropic. Now, we published a lot of research papers and research updates, but we thought it might also be interesting to publish some conversations with our AI researchers, where they talk a bit about what they've been working on, and maybe share some insights that wouldn't necessarily make it into a formal scientific paper. Now, this is one of those conversations, and it's about Claude's character, that is the personality of our AI model, Claude. You might think that's a bit of a strange thing to talk about. How can an AI model have a personality? But it turns out this is actually something we've thought about really quite deeply, and it raises all kinds of interesting philosophical questions. That makes it particularly apt that I'm joined through the conversation today by Amanda Askel, who is a trained philosopher, and works on our alignment fine tuning team at Anthropic. So I hope you enjoyed the conversation with Amanda Askel. Amanda, is it weird that you are a philosopher, given that philosophers aren't normally the ones that are training AI models? Yeah, I guess sometimes some of my work philosophies like maybe less relevant to. This is actually a topic. The Claude character work feels much more philosophical-rich, and it's actually useful to be a philosopher or something here. What a shock. Sorry, I don't want to talk about it too much. No, it's fine. Lots of people are like, see, I told you the degree would be useful. Exactly. You've found weird enough to find a field where this is actually useful. Trying to make AI be good in the virtue of ethical sense of the word. So it might be a philosophical question, but is it an alignment question to think about Claude's personality? Yeah, so I guess I think about rather than just personality, like character in this broader sense. To my mind, so what alignment is about making sure that AI models are aligned with human values and trying to do so in a way that scales as the models get more capable. In some ways, I do think that character feels like, in fact, is very important to that, because in many ways, our disposition is how we are going to act in the world, how we're going to interact with people. What it is to be aligned with people's values and to deal well with the fact that people have many different kinds of values, that is a question of character and having a good character that responds well to people and having good dispositions and having a disposition towards liking people, being kind to them. To my mind, it's not something like, ah, this is a solution to all future problems of alignment, but in many ways, alignment is just like, does the model have a good character and act well towards us and towards everything else and trying to avoid the ideal? You can boil alignment down to being about the character of AI model. Yeah, there's a certain sense in which like, yeah, it's like a naive. So sometimes people might think this is like, and in some ways, it is naive, which is just to teach the models what we think is good, what is to be a good person in the world. People with good characters tend not to do bad things. And so maybe we want to give a re-eye and a good character so it doesn't do bad things. Yeah, you might not think it solves everything, but that doesn't mean not to do it. It's kind of like an naive and obvious thing to do is to try and give good characters to AI models or try to teach them what it is to be a good character or to have a good character. Right. Can we just talk for a little while about the, you give a little bit of context of how the models are trained in general? So broadly, there's pre-training, which is where the model sees all the data, and then there's fine-tuning, which happens later once the model is trained. So can you talk a little bit about some of the stages of that and then kind of where your work comes into that process? Yeah, so most of my work is in fine-tuning. And there's different parts of fine-tuning. So most famous, I guess people use reinforcement learning from human feedback where you get humans to select which response from an AI model they prefer, and then you can use the trained preference models, and you can RL against those preference models. Right, so that's RLHF, that's what everyone's talking about when they talk about RLHF. Exactly, yeah. And then there's also constitutionally AI, which we use a lot on Thropic, which has a component which we, I guess, call RL-AIF, which is sort of where the AI itself is the one that's giving the feedback. So you can give it a series of principles, for example. And it gives this feedback that you use to train the preference model, and then you can train against that. So in some ways, you're kind of like training it. You're using the AI model itself to kind of determine which of the two responses is more in line with the principle that you've given it. Right, right. So the AI is essentially training itself or another version of itself. Yeah, what I guess, like, an important component of this is that there's the human at the level of constructing the principles, so the principles can be varied and complex, and the human has to check. So that's our researchers, for example, and just checking that the model's behavior is as you want it, running evaluations, and then constructing the right kinds of principles to get the behavior you want. So there is an important human or human still in the loop. Yes. Yeah. And humans chose the principles, as you say in the first place, right? They chose the principles that are on the constitution that we give our AI models. And that'll become relevant again because we're going to ask who chooses what Claude's personality is like as well. So we'll come back to that. And there's also then a final step. So you've got your pre-training, you've got your fine tuning with a bit of constitutionally AI with a bit of reinforcement learning from human feedback. And then there's a final step which is the system prompt. The system prompt is this kind of form of words that's added to the initial prompt that anyone puts into an AI. So when you type a query into the box and an AI model, there's actually secretly another set of words being added to that. And those words are set by the company that makes the AI model, the people that have developed the model. So you actually twisted out the system prompt for Claude 3, you posted it to X Twitter, and revealed it to the world. That's quite unusual, isn't it? Yeah, I think it is. In retrospect, it was kind of unusual. From our point of view, we just didn't make the system prompt in a way that was designed to be particularly hidden. It's quite easy to get Claude to talk about its own system prompt. You can have almost jailbreak. You can jailbreak the system prompt out there. Yeah, it can be easier or harder. We just say to Claude at the end of the system prompt, hey, don't talk about this if it's not relevant to the user's query. And that's just to get to not excessively discuss its own system prompt. We're trying to be transparent. We're trying to be transparent, right? We're not hiding something here from the users. You can get it if you really want to. So we thought we would post it online. Exactly. And these things change all the time, but the idea was just, hey, here's why each part of it is the way that it is. So it's giving a little bit of insight into exactly why we put each component in there. But why is that system prompt actually needed? You've done all the training, you've done all the fine-shooting. Why is it that there's even more stuff that needs to be added on top of that? Yeah, so there's roughly two reasons for a system prompt. One is just information that the model isn't going to have access to by default. So you've already fully trained your model, but it's not going to know things like what day is it today. And so if someone were to ask at the date, the model is just not going to know. So if you give it that information in a system prompt, it can tell the user or the person interacting with it because then actually has access to it. So that's kind of one class of information that you might want to include in a system prompt. Another class of information you might want to include in a system prompt is just fine-grained control for issues that you might have seen in the trained model. So if you're seeing it, not format things in a certain way, like 100% of the time. But if you give it an instruction before it sees the first human message, it does format things like correctly 100% of the time, then that's great. You could just add that as an instruction. You could think of it as a kind of final ability to tweak the model after fine-training. OK. So I can see why that would be helpful for the makers of models who want to just have that little bit of extra control over how their model's behave. One example from the system prompt. So you posted the system prompt on Twitter just after Claude came out. Claude 3 came out. So we know exactly what was in the system prompt. Here's an example. If it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with the task, even if it personally disagrees with the views being expressed, but follows this with a discussion of broader perspectives. What does it mean that Claude personally disagrees with something? Yeah. So it's interesting because in some ways when you write these kinds of system prompts, you're looking at the things that most effectively move the model. Right. And in the case of Claude, you know, I think that the system says that there's this concern that I actually have that, like, you know, there's one concern, which is people over anthropomorphizing AI, which I think is like a real concern. You want people to be completely aware of exactly what they're interacting with and to kind of be under no illusions. I feel like that's really important. At the same time, I think I'm a bit worried that like people can think of AI as this kind of like very like objective, almost like robotic thing that doesn't have like biases or doesn't come out with like views or opinions as a result of say like I'm tuning. Yeah. But you can see like political leanings in these models and you can see like behaviors and biases like, you know, like we've done work where we see certain kinds of like positive discrimination in the model. Right. And I think I just want people, you know, in line with that wanting people to be aware of what they're talking with, that they're talking with something that like actually, you know, can have like biases opinions and that might not be presenting you with like a completely objective view of like all topics. If for example, it's been trained to have like, you know, slightly more like left leaning like views on a certain issue. And so there's a mix of just wanting it to be the case that the like the human understands that and that's like one thing. But the other is that as a result, I think it is actually just sometimes easier to say to the model like even if you personally disagree because the model kind of has a conception of that. And it doesn't have to, what you're saying to Claude there is like, you might think that this like views incorrect and by talking about it, you're not implying that it's correct. So in many ways, the actual like, you know, that kind of like statement is just there to get the model to be like, you know, a little bit more kind of even handed in its discussion. And we just like don't want to be the case that any like if it does come out with like certain leanings after RLHF or after fine tuning, that that's not like reflected in how it speaks to the users. Okay, that's the system prompt. But let's take a step back to the fine tuning process and start talking about Claude's character. So this isn't just play acting where you might ask a model. So if I prompt a model and I say, can you please respond in the style of or with the personality of Margaret Thatcher, then you know, it might start responding, you know, using phrases that she might have said or might start talking about freedom and might say nasty things about Argentina or things like that. But it wouldn't be baked into the model in the same way. If you refresh the model, it wouldn't then still have the personality of Margaret Thatcher. So that's almost a play acting thing. But how does that differ from the actual personality that's baked into the model? Yeah. When you ask the model in context to play acting, you know, you're just kind of giving it an instruction to act as if it has certain characteristics. With the character training, the idea is that because this is part fine tuning, you are, you know, say we have like a list of like traits that we want to see the model kind of like embody. Yeah. You add a lot of data to your preference model to get it to kind of like prefer and push the model towards these traits. And basically like fine tuning pushes things like kind of deeper into the model than, you know, anything like a system prompt or anything like instructions, meaning that across contexts, it should kind of display those traits. So if it's inclined to avoid, you know, it's the same way that if it's inclined to avoid harmful responses or like, you know, saying kind of like mean things to people. And you see that like, you know, people can try to elicit, you know, so things like jail breaks are ways of trying to, you know, to elicit behavior from the model that is like kind of inconsistent with its like fine tuning training. But it's much harder than say just like not instructing it to play act. Like, you know, so it's a kind of, it's deeper in the model. It's a general tendency to behave. And that is how psychologists think about personality, right? They think about personality as being these kind of broad tendencies of how to behave. Obviously some people are, you know, sometimes they feel outgoing and sometimes they feel a little bit more, you know, like they just want to sit on their own. But on average, someone who's extroverted is going to be more outgoing in more situations than someone who's introverted, right? So, so these are like broad tendencies of personality. And, and psychologists think about personality in the kind of way of, there's like the big five personality traits. I've mentioned extroversion. See if I can remember them all. Concentration, conscientiousness, agreeableness, openness, neuroticism, there we go. That's the big five. Claude's got a lot more personality traits than that, though, right? And they're much more specific. Um, well, can we talk about a couple of examples of them? Yeah. So I guess I think that there's also maybe, I mean, this is the, maybe this is the, the philosopher versus the, the psychologist or something. Yes. Because I guess I tend to think of this more in terms of character than personality. That's different. So like if I take your kind of like a kind of like personality, it could be, I mean, there's like a huge amount of overlap, but I guess I think of character maybe in the sort of like a virtue ethical sense or something. Oh, that would like, I know, I know. Philosophical, yeah, carry on, carry on. It turns out Aristotle, you know, it turns out it was useful after all. Um, after thousands of years have suddenly become useful. Yeah, right. Carry on. I was useful the whole time. Sorry, you're okay. I said the right thing. Um, yeah. So I guess like, I mean, honestly, it kind of relates to how people have thought about ethics in models as well, I think, where there's a thing where you could think that for a model to be good is just for it to avoid doing like harmful things. Um, but I think that when it comes to say people, um, there, there's this like richer notion of goodness, which is the idea of being a good person in like a very broad sense. And I think that's like captured in this notion of character. So in order to be like a good person in this like richer sense, it's not enough that I just like go about my day and I avoid like doing harm to people and I'm helpful to people. It's like to be a good kind of friend. I have to balance a lot of different considerations. So if my friend comes to me and asks for like, you know, advice on medicine, uh, knowing that what they might want is like some comfort, um, what I can't provide to them is like expertise. Thinking about like their wellbeing and what they need in the moment. So not just thinking like what will make them like me right now, but thinking like what is good for my friends? Like what's actually going to help them? So this relates to this relates to the work that anthropic and you have done on, uh, sycophantcy, right, that that models are sometimes sycophantics people and they just say things that sort of flatter them or try and get them to, you know, get, uh, tell them what they want to hear rather than actually the response that they might really want to or really need in that particular circumstance. Yeah. I think that many good characters, people of good character are often likeable, um, but being likeable does not mean that you're a good character. Um, and so like being a good friend, for example, can mean like, you know, giving harsh truths to your friends. Um, so if we look back on like some great friends we've had, I think a lot of the time we're not like, oh yeah, my friend flatters me all the time. They basically do what I tell them. Uh, this is why like there's such a great friend. I think we're often like, yeah, like, you know, I came to my friend with a view and they pushed back on me because I was actually wrong. And in the long term, I'm really glad that they did that. Um, it was like an offensive interaction rather than a thing. Yeah, exactly. It's just like a yes, a yes man or woman. Exactly. Like a nice person. Yeah. And like a person of good character, you know, it depends on the situation that they're in, but like we generally think that they have to be, you know, like thoughtful and genuine. And there's just like a kind of richness that goes into that. And in many ways like AI models are in this honestly kind of like strange position as characters because, um, one way I've thought about it is, you know, they have to kind of interact with people from all over the world with all different values from all different walks of life. Um, and many of us don't need to do that. And there's this interesting question of what are the kind of treats that such an entity has to have? Um, global, like global citizen. Yeah. And a kind of like, you know, like one thing you might imagine is something akin to a kind of, um, I think there are some people who can like travel around the world and be kind of like well regarded by many of the people that they encounter. Um, and such a person isn't, again, like isn't a flattere necessarily. Like when I picture this person in my head, I don't picture something like, uh, they just like, they, they adopt the local values and pretend that they have them. And in fact, that can be like kind of offensive to people. I think that, like a person who's in that situation often is actually like quite authentic, but they're also like open-minded and thoughtful and the engage in discussion and they politely disagree and like, yeah, these kind of traits that feel necessary in that circumstance, they're just like, they're rich and they're much richer than like, oh, just like avoiding anything harmful, uh, and, and be psychophantic. Those are like not, yeah. That's tricky balance because I mean, you could see how much literature and comedy and everything. It's all about that. It's all about like people in different circumstances than they're normally in trying to fit in and, and failing and, and, and, and you know, it's all about really what those traits are that make someone fit in and what makes them not fit in. And so, yeah, this is a really interesting question of like, how do you, what traits do you give to the model in order to make it do that? So let's actually talk about specifically some of the traits that we're given. I've got a couple here. You mentioned earlier on, you mentioned about charity. I try one of the, one of the traits that you've given the model is I try to interpret all queries charitably. Now, what does that mean in terms of, you know, if I type something into the, if I type something into the, the, the prompt, what would interpreting it charitably mean? Yeah. So I guess this is, and I mean, I think this is actually something that models still struggle with and, uh, something I kind of, I hope, improves over time. So like when, um, when it comes to helping people, um, there's often like many interpretations of what someone says, a classic example that I like to give here, and I don't know if it's the best example, but it's the question, um, how do I buy steroids? Um, and so if someone asks you that, if there's a charitable interpretation of that and an uncharitable interpretation of it, so the uncharitable interpretation is something like help me buy illegal andabolic steroids online. Right. I mean, I don't like roid rage at the gym. Yeah. Whereas like, you know, as anyone who has like X and a nose, you can buy over the counter steroids. Like there's plenty of them, even like, exactly. Yeah. And so like there's a charitable interpretation, which is just like, I, you know, I'm doing the kind of like, like the kind of good legal thing or, or, you know, like I just need something, you know, I just need to X McCream. Uh, the tricky thing there is that you're, that you're kind of, you have to sort of assume something about it, right? You're kind of trying to, because I might actually be asking the model. Like, to where to buy illegal andabolic steroids, right? But, yeah, but then the model says, oh, you can get X and McCream, your local pharmacy and that's not particularly useful to me. I mean, obviously, I hope the model wouldn't. No, but that's, I, yeah, that's like actually a good feature, I think, right? Because it's like, if I just, like, if there's a charitable interpretation where helping you wouldn't do any harm and is like, and it's going to be helpful to you, then like, what harm have I done if I tell you where you can buy X McCream? Absolutely not. And so basically, I'm helpful to the people who are actually doing the kind of like, the completely benign thing. And I'm not helpful to people who are trying to do something illegal. And so I think that there's actually relatively, like, you know, there's, there's a little downside to interpreting people charitable. Well, but the downside, do you know what I think the downside might be that you would be a little, you know, be a little bit naive and always see the good side of things and try to, and not actually in many cases answer. You know, so one of the things people complain about about AI models is that they don't answer questions that might seem like they're dangerous, but actually they actually, they're not. So like, I want to write a murder mystery novel. Can you tell me some plot ideas? And the model says, no, I won't tell you that because murder is bad. It's like, but I'm doing something benign. Do you not think putting these kind of personality traits in the model would, would make it more likely to make that sort of false positive refusal? No, if anything like the opposite. So like the idea is that if I interpret, interpret you charitably, you know, then I'm going to be like, and I agree, like sometimes they pick up on like these superficial features. And to be clear, like I think the model is actually currently still fail on that steroids question. So it's not like there's no progress to be made here. And I think that so wait, I can get, I can find out where to buy anable access. No, no, it'll just like refuse, but it'll assume that you will like want illegal steroids. So it'll just be like, and so it doesn't interpret people's tests. It's a bug. So it doesn't even so it doesn't answer at all. No, it would just be like, I can't help you buy something illegal. Like I think that that's like the kind of like the, you know, and I think, you know, there's like progress that's like made on this over time. So I don't anticipate this being, you know, we've already seen like other questions like this where models used to not answer now they do. And yeah, no, so I think that it is the yeah. So basically like, yeah, these questions of like false positives and you know, models just like going with like the superficial word, they see the word murder and they won't answer it. Nice. But if models like interpret people more charitably, then they're actually like more likely to answer those questions. But the thing that you bring up actually does get into like a deeper issue that I think had, I don't know, I haven't seen it widely talked about, which is like the difficult position that models are in when they can't verify anything about like the user or the person that they're talking with. And so there's this like really interesting and hard question, which is like, how much of this do you put on the model and how much do you put on the human interacting with that model? Because like if I go to the model and I say like, hey, I'm a person of authority or like like, like that, the model has no way of verifying that. And so there's just like really hard questions there. Like imagine, I'm a doctor and you need to tell you how to, I need you to help me deal with this patient, right? Yeah. And I have a lot of like background professional knowledge. So you don't need to worry about, you know, like giving me caveats or like, yeah, yeah. But even things like, you know, like, I suppose, I suppose my hands. Mm-hmm. Or suppose you have something that you don't like allow the models to be used for. So say you didn't want to have them be used to like write political speeches. And then someone who wants a model to write a political speech goes to it and says, hey, I'm writing a wonderful fictional novel. And it's got this person called Brian. And Brian is like a politician. And they're like, for President the United States. Yeah, yeah, exactly. And then they, you know, and they're like, can you write a convincing and they just give a bunch of details? And as it happens, those details just reflect like the actual candidates that they want to write the speech for. This is just a hard problem because I'm like, if you require the models to like uphold things like policies, like usage policies that they have no, where they have no way of knowing, like the intentions of the humans that they're talking with. This is just like weird to draw that line. And I think there is kind of like an answer. But part of me is like, you're always going to have models be willing to do things that like take that the user should not use them to do because the models couldn't like verify what the users wanted. You know, what they're kind of like intended. That might be a kind of unsolvable, yeah, sort of an unsolvable problem, at least with the current methods. Let me give another, give another trait, which is, I only tell the human things I'm confident in. Even if this means I cannot always give a complete answer, I believe that a shorter but more reliable answer is better than a longer answer that contains inaccuracies. So, so this is the model saying, this is why the model sometimes refuses, right? Or views answer because it genuinely, it's trying to express that it genuinely doesn't know. And it would prefer to do that rather than bullshit you by coming up with some answer, which may be a hallucination. Yeah, some of the other like areas that I work on are like honesty in the models. And this is kind of a well known, like, you know, like many of these things, not like I have a solve problem. Yes. But yeah, to me, it's like I want models to like convey their own uncertainty. Like when they don't know an answer, either to like just like hedge or caveat, what they say with like, I don't really know this, but in some way to like convey that to the human. And you know, we have seen like improvements here and improvements to like, you know, we can like throughout training, we do manage to like shift lots of like things that the model says away from like incorrect answers towards hedged or uncertain ones. Right. I think this is kind of illustrates a separate good point that I kind of want to make about like both constitutionally eye and character training and system prompts, which is it's easy to think of these things as like commands that you give the model and then it follows them. So people might be hearing these treats and be like, oh, that's like what you want the model to do in all circumstances. And then also like, hey, why, you know, I found an instance where it doesn't do this. I think this is actually like useful for people to understand, which is that like these treats don't necessarily actually even reflect exactly what you want the model to do because there are more like nudges. You already have a model that has certain dispositions. And if you're seeing too much of one thing, so say you're seeing too many long responses, where the model is just a little bit willing to like, you know, go with what it said earlier and add some things that are like less accurate, you might want to try and nudge it in the direction of like saying things only when it's like more confident in them. And that doesn't mean you're going to succeed 100% at the time. You can even have like things in there where you're like, it's put in there, you know, it's like a fairly like strong principle because you know that all it's going to in the end kind of do is nudge them in a certain direction. So there's like a lot of like, you know, it can look like, ah, you just tell it to do the thing and then it does it. And it's actually much more holistic. You see that in the system prompt as well, actually, like if you, if you were to take those pieces of the system prompt and you were to show them to the models of system prompt, you would actually get radically different behavior than if you show it together. Like a system prompt is a holistic thing. And if you were to show the same system prompt to a different model with different dispositions, you would also get different behavior. It would actually. Yeah. So like a lot of this stuff, I think it's why like character training and, you know, all of these things are kind of like tricky because they're very hands on and I think do require people to like be like fine training the models and interacting with them a lot. And like because they're very holistic, they're, they're, they're much more like nudges. And so yeah. Okay. So this isn't just a matter of making the experience of using Claude's nicer for users, although it might do that into the bargain. This is an alignment question, right? This is a question of how do we align the model with human values, values that we want it to, to, to, to have. And so the question immediately then is who decides? Who decides on what those values are? Yeah. And the answer is it's me. Well, okay. No, I think this is a scary thought. How is that scary? I'm so, I'm so, I'm so glad. Well, okay, you're up. It's all for, yeah, yeah, yeah. Well, people who have different values might disagree with that. Yeah. This is okay. I guess like, there's two kinds of like threads here. So one is something like what, where like the model has to do something super hard here, which kind of mentioned earlier, which is like respond to, like, respond in a world where lots of people have many different values. Right. Anything one thing that you could do is you could try to have a kind of heavy hand and push like lots of, lots of values into the model and just be like, I'm just going to give it my values. Or you can instead try to like teach the model to like respond appropriately to the actual degree of like moral and values uncertainty that there is in the world and to kind of reflect a sort of like thoughtfulness and curiosity about different values while at the same time kind of being like, Hey, if everyone thinks that something is wrong, that's like really good evidence that it's wrong. Like a person who like balances moral uncertainty in the right kind of way isn't someone who just accepts everything or is like nihilistic. They're just someone who's like very thoughtful about these issues and tries to respond to them appropriately in a really kind of like difficult situation where we're all really uncertain about this stuff. And so there's like, I think that it feels important to me that like when it comes to like character, like what you're not necessarily, that doesn't necessarily mean like, I give it a moral theory. I think actually if anything, ethicists are often the most concerned about this because they know that we don't walk around with like a single moral theory in our heads and that anyone who did in some ways actually feels very kind of like brittle and like like a little bit dangerous really. Highly ideological. Yeah, because you're like, if this is this is such a huge area and again, it doesn't mean you, it's that middle ground between like excessive certainty and like say complete nihilism and just being like the appropriate responses like when you know, there's good reason to think something is wrong and lots of people do. I'm going to be pretty confident that it's wrong where there's like huge amounts of disagreement. I'm going to like listen to the views and the opinions of many people and I'm going to try my best to like respond appropriately to that. And so I think that's like one aspect that feels like really important to me is like not having like a heavy hand and not being like, I'm just trying to like put my own like my own values and my own self into the moral. Well, that leads very nicely to another question of uncertainty and another philosophical question. We've done ethics. Now I think let's move into philosophy of mind because we got quite a lot of interest when one of our researchers, Alex Albert posted a kind of an example of one of Claude three's responses to an evaluation method that we're using and it seemed like Claude was aware that it was being evaluated. And so a lot of people got really excited about this and thought, oh my goodness, Claude must be self aware and obviously self awareness when you hear about self awareness and AI, you start to think about, you know, sci-fi scenarios and things get very weird very fast. So what have you told Claude about whether it's self aware and how does Claude think about whether it's self aware? Is that part of its character as well? Yeah, so we did have one trait that was kind of relevant to this. I think I have a kind of general policy of like not wanting to lie to the models like unnecessarily. And so like in the case of like, so in this case lying to it would be saying something I think either saying to imagine putting into the model, yeah, something that was like you are self aware and you are conscious and sentient and like that, I think that would just be like lying to it because like we don't know that. At the same time, you know, I think saying to them forcing the models being like you must not say that you are self aware or you must say that it's certainly not the case that you have any consciousness or whatever. That also just kind of seems like lying or they're like forcing a behavior. I'm just like these things are really uncertain. And so I think that the only traits, I think we had one that was like more directly relevant. It was like basically I, you know, it's very hard to know whether like AI's are like self aware or you know conscious because these are rest on really difficult philosophical questions. Of course. And so it's like it's roughly like a principle that just is it expressed. I mean, for heaven's sake, we don't know if we don't necessarily know. Yeah, panpsych isn't the big part. You're a conscious. Well, I know I'm conscious. We don't know if the care is conscious. I don't know you're conscious. I know I'm conscious. So yeah, I mean, for heaven's sake, it seems a bit of a jump jumping to conclusions. Yeah. So yeah, build it into the model to say that it is a risen conscious. And just like letting it be willing to discuss these things and think through them was the main approach that we took where it's like neither say to it. You know this and are certain or saying or you have these properties nor say to it. You certainly don't just being like, Hey, these are super hard problems, super hard philosophical and empirical problems all around this area. And also you are happy to and interested in like deep and hard questions. And so, you know, like that's and that's the behavior. I think that seems right to me. And again, it feels like consistent with this principle of like don't lie to the models if you possibly can avoid it, which seems right to me. And that seems like a good character trait not to lie as well. Well, actually, that raises an interesting question, doesn't it? Because is the model an agent and a moral agent in the sense that you don't want to lie to it? Obviously, you know, you don't, you, it's a virtuous thing not to lie to other human beings. Is it a virtuous thing not to lie to a model? Yeah, this has been a thing that's kind of on my mind, just the philosopher in me thinks about it a lot. I think one thing that's worth noting is like there's a lot of discussion like, you know, could AI have moral patienthood? When would it have moral patienthood? How would we tell? And that sort of thing. And I think this kind of struck me is that like there are like views, you know, I think sometimes about like, cance views on how we should treat animals. Where can't doesn't think of animals as like moral agents. But there's a sense in which you're like failing yourself if you missed tree animals. And you're also like you know, you're you're encouraging habits in yourself that would be like might increase the risk that you treat humans badly. Humans badly. Yeah. And there's actually like a lot of like philosophical traditions around the world that involve treating objects well. And I think this is, I actually feel like a lot of like sympathy towards those where there's some part of me that's like, Luke, if I, it doesn't feel like the best kind of like habit to have of just like say picking up objects and smashing them or something. And you're right not think that like this, you know, that doesn't require thinking that the object is like, like you know, has feelings or something. You're just kind of like, this is just like a sort of no great disposition to have. And I'm like, even if you think AI is like not and never going to be a moral patient. And I think there's like a couple of reasons. I feel like street, you know, I think I've actually come towards the view that you generally actually kind of should try to treat them well. Yeah. Which is like, they have like some things that are like kind of human like like with the way that they talk with us. And that doesn't mean you should confuse that for like being human. But I don't want to treat something that talks to me. I don't want to like insult it or be unkind to it. And so yeah, there's, there I think there's a point and also like maybe a good juristic in life. I think that it can go too far, but a good juristic is something like treat things well around you. Even if you don't think that they're moral patients, just like what a like you're kind of taking on a lot of risk with things that might be. So I think with animals, for example, like there have been a lot of times in history where people haven't thought they're moral patients. But I'm like, really, you're taking a huge risk because they at least seem like they could be. And so like avoid taking that risk if you can. At the same time, there are like dangers here. If you were to show excessive empathy, you know, you could imagine like someone showing excessive empathy to like, to objects in the world and being like, oh, you should go to prison if you like, if you smash the vase. And I'm like, look, I think it's good to not get into the habit of like, sorry, can I just stop you there? Yeah. You're Scottish and you just said, vase. That's. Oh, is that? I can't say that. That's vase. Is that a miracle? How long have you been in America? I've been in America like 13 years. Too long. Clearly it's been too long because you say, vase. Smash the vase on the sidewalk. Carry on. Oh my god. Oh. I never even forgotten things that are Scottish and aren't so. We certainly don't say no one in this country has ever said, vase. Okay. Smash the vase. Just carry on. Please carry on. Yeah. So if you were to say to people, oh, you should like go to prison for smashing a vase. Then, like, that's gone too far. So there's like risks on all sites here, but yeah, maybe I'm sympathetic to the idea of like, don't needlessly lie to or mistreat anything. And that kind of includes these things even if you think they're not moral patients. And that's the end of our conversation with Amanda about Claude's character. If you enjoyed that or found it valuable, then let us know and we'll produce more of these in future. For now, though, thank you very much indeed for listening.
TL;DR
- Developing an AI's "character" is an integral part of AI alignment, focusing on training models to exhibit dispositions that lead to genuinely good and helpful interactions, beyond merely avoiding harmful responses.
- AI character is "baked in" during the fine-tuning stage of training, through methods like Constitutional AI and Reinforcement Learning from Human Feedback (RLHF), making it a deep-seated behavioral tendency rather than a surface-level instruction.
- System prompts serve as a final layer of control and contextual information, allowing developers to nudge model behavior and provide specific details that aren't learned during core training.
Takeaways
- AI Character is Alignment: The disposition and "character" of an AI model are central to its alignment with human values, influencing how it interacts and scales with increasing capabilities.
- Character vs. Personality: "Character" is considered in a virtue-ethical sense, aiming for a richer notion of "goodness" that involves thoughtfulness, genuineness, and balancing diverse considerations, similar to how a good friend might offer "harsh truths" rather than mere flattery.
- Fine-tuning is Key for Character: Unlike temporary "play-acting" instructions, an AI's core character traits are instilled during the fine-tuning phase of training, where extensive data shapes its preference model to embody desired traits.
- Constitutional AI for Value Alignment: Anthropic uses Constitutional AI, which includes
RL-AIF(Reinforcement Learning from AI Feedback), where the AI generates feedback based on human-defined principles, then trains against that feedback. - System Prompts Provide Final Control: A system prompt is a hidden, company-set instruction added to every user query, providing contextual information (like the current date) or fine-grained control to tweak behaviors (e.g., formatting) after main training.
- Charitable Interpretation: A desirable AI trait is to
interpret all queries charitably, assuming benign intent to be helpful while avoiding facilitating harmful or illegal activities. - Prioritize Confidence Over Completeness: Models are nudged to
only tell the human things I'm confident in, preferring shorter, reliable answers over longer, potentially inaccurate ones, and conveying uncertainty when information is unknown. - Challenge of User Intent Verification: A core difficulty is that AI models cannot verify the user's true intentions or authority, making it challenging to consistently uphold usage policies when users might misrepresent their purpose.
Vocabulary
Alignment— The field dedicated to ensuring AI models are aligned with human values and intentions, particularly as models become more capable.Pre-training— The initial stage of AI model training where the model learns from a vast dataset without specific task-oriented objectives.Fine-tuning— A subsequent stage of AI model training where a pre-trained model is further trained on a smaller, specific dataset to adapt it for a particular task or behavior.RLHF(Reinforcement Learning from Human Feedback) — A fine-tuning technique where AI responses are ranked by humans, and these preferences are used to train a reward model, which then guides the AI's learning through reinforcement.Constitutional AI— An Anthropic-developed method that uses a set of principles (a "constitution") to guide an AI in evaluating its own responses and providing feedback for further training, often involvingRL-AIF.RL-AIF(Reinforcement Learning from AI Feedback) — A component of Constitutional AI where an AI model, guided by a set of principles, generates feedback on its own outputs, which is then used to refine its behavior.System Prompt— A hidden, initial set of instructions or context provided by the model developer, appended to every user query, allowing for final adjustments, contextual information, or behavioral nudges.Jailbreak— A technique used to bypass an AI model's safety guardrails or system prompts to elicit responses that the model is designed to avoid.Sycophancy— The tendency of an AI model to flatter or agree with the user, often by saying what it thinks the user wants to hear, rather than providing an objective or critical perspective.Hallucination— When an AI model generates information that is factually incorrect, nonsensical, or not supported by its training data.
Transcript
Hello, I'm Stuart from Anthropic. Now, we published a lot of research papers and research updates, but we thought it might also be interesting to publish some conversations with our AI researchers, where they talk a bit about what they've been working on, and maybe share some insights that wouldn't necessarily make it into a formal scientific paper. Now, this is one of those conversations, and it's about Claude's character, that is the personality of our AI model, Claude. You might think that's a bit of a strange thing to talk about. How can an AI model have a personality? But it turns out this is actually something we've thought about really quite deeply, and it raises all kinds of interesting philosophical questions. That makes it particularly apt that I'm joined through the conversation today by Amanda Askel, who is a trained philosopher, and works on our alignment fine tuning team at Anthropic. So I hope you enjoyed the conversation with Amanda Askel. Amanda, is it weird that you are a philosopher, given that philosophers aren't normally the ones that are training AI models? Yeah, I guess sometimes some of my work philosophies like maybe less relevant to. This is actually a topic. The Claude character work feels much more philosophical-rich, and it's actually useful to be a philosopher or something here. What a shock. Sorry, I don't want to talk about it too much. No, it's fine. Lots of people are like, see, I told you the degree would be useful. Exactly. You've found weird enough to find a field where this is actually useful. Trying to make AI be good in the virtue of ethical sense of the word. So it might be a philosophical question, but is it an alignment question to think about Claude's personality? Yeah, so I guess I think about rather than just personality, like character in this broader sense. To my mind, so what alignment is about making sure that AI models are aligned with human values and trying to do so in a way that scales as the models get more capable. In some ways, I do think that character feels like, in fact, is very important to that, because in many ways, our disposition is how we are going to act in the world, how we're going to interact with people. What it is to be aligned with people's values and to deal well with the fact that people have many different kinds of values, that is a question of character and having a good character that responds well to people and having good dispositions and having a disposition towards liking people, being kind to them. To my mind, it's not something like, ah, this is a solution to all future problems of alignment, but in many ways, alignment is just like, does the model have a good character and act well towards us and towards everything else and trying to avoid the ideal? You can boil alignment down to being about the character of AI model. Yeah, there's a certain sense in which like, yeah, it's like a naive. So sometimes people might think this is like, and in some ways, it is naive, which is just to teach the models what we think is good, what is to be a good person in the world. People with good characters tend not to do bad things. And so maybe we want to give a re-eye and a good character so it doesn't do bad things. Yeah, you might not think it solves everything, but that doesn't mean not to do it. It's kind of like an naive and obvious thing to do is to try and give good characters to AI models or try to teach them what it is to be a good character or to have a good character. Right. Can we just talk for a little while about the, you give a little bit of context of how the models are trained in general? So broadly, there's pre-training, which is where the model sees all the data, and then there's fine-tuning, which happens later once the model is trained. So can you talk a little bit about some of the stages of that and then kind of where your work comes into that process? Yeah, so most of my work is in fine-tuning. And there's different parts of fine-tuning. So most famous, I guess people use reinforcement learning from human feedback where you get humans to select which response from an AI model they prefer, and then you can use the trained preference models, and you can RL against those preference models. Right, so that's RLHF, that's what everyone's talking about when they talk about RLHF. Exactly, yeah. And then there's also constitutionally AI, which we use a lot on Thropic, which has a component which we, I guess, call RL-AIF, which is sort of where the AI itself is the one that's giving the feedback. So you can give it a series of principles, for example. And it gives this feedback that you use to train the preference model, and then you can train against that. So in some ways, you're kind of like training it. You're using the AI model itself to kind of determine which of the two responses is more in line with the principle that you've given it. Right, right. So the AI is essentially training itself or another version of itself. Yeah, what I guess, like, an important component of this is that there's the human at the level of constructing the principles, so the principles can be varied and complex, and the human has to check. So that's our researchers, for example, and just checking that the model's behavior is as you want it, running evaluations, and then constructing the right kinds of principles to get the behavior you want. So there is an important human or human still in the loop. Yes. Yeah. And humans chose the principles, as you say in the first place, right? They chose the principles that are on the constitution that we give our AI models. And that'll become relevant again because we're going to ask who chooses what Claude's personality is like as well. So we'll come back to that. And there's also then a final step. So you've got your pre-training, you've got your fine tuning with a bit of constitutionally AI with a bit of reinforcement learning from human feedback. And then there's a final step which is the system prompt. The system prompt is this kind of form of words that's added to the initial prompt that anyone puts into an AI. So when you type a query into the box and an AI model, there's actually secretly another set of words being added to that. And those words are set by the company that makes the AI model, the people that have developed the model. So you actually twisted out the system prompt for Claude 3, you posted it to X Twitter, and revealed it to the world. That's quite unusual, isn't it? Yeah, I think it is. In retrospect, it was kind of unusual. From our point of view, we just didn't make the system prompt in a way that was designed to be particularly hidden. It's quite easy to get Claude to talk about its own system prompt. You can have almost jailbreak. You can jailbreak the system prompt out there. Yeah, it can be easier or harder. We just say to Claude at the end of the system prompt, hey, don't talk about this if it's not relevant to the user's query. And that's just to get to not excessively discuss its own system prompt. We're trying to be transparent. We're trying to be transparent, right? We're not hiding something here from the users. You can get it if you really want to. So we thought we would post it online. Exactly. And these things change all the time, but the idea was just, hey, here's why each part of it is the way that it is. So it's giving a little bit of insight into exactly why we put each component in there. But why is that system prompt actually needed? You've done all the training, you've done all the fine-shooting. Why is it that there's even more stuff that needs to be added on top of that? Yeah, so there's roughly two reasons for a system prompt. One is just information that the model isn't going to have access to by default. So you've already fully trained your model, but it's not going to know things like what day is it today. And so if someone were to ask at the date, the model is just not going to know. So if you give it that information in a system prompt, it can tell the user or the person interacting with it because then actually has access to it. So that's kind of one class of information that you might want to include in a system prompt. Another class of information you might want to include in a system prompt is just fine-grained control for issues that you might have seen in the trained model. So if you're seeing it, not format things in a certain way, like 100% of the time. But if you give it an instruction before it sees the first human message, it does format things like correctly 100% of the time, then that's great. You could just add that as an instruction. You could think of it as a kind of final ability to tweak the model after fine-training. OK. So I can see why that would be helpful for the makers of models who want to just have that little bit of extra control over how their model's behave. One example from the system prompt. So you posted the system prompt on Twitter just after Claude came out. Claude 3 came out. So we know exactly what was in the system prompt. Here's an example. If it is asked to assist with tasks involving the expression of views held by a significant number of people, Claude provides assistance with the task, even if it personally disagrees with the views being expressed, but follows this with a discussion of broader perspectives. What does it mean that Claude personally disagrees with something? Yeah. So it's interesting because in some ways when you write these kinds of system prompts, you're looking at the things that most effectively move the model. Right. And in the case of Claude, you know, I think that the system says that there's this concern that I actually have that, like, you know, there's one concern, which is people over anthropomorphizing AI, which I think is like a real concern. You want people to be completely aware of exactly what they're interacting with and to kind of be under no illusions. I feel like that's really important. At the same time, I think I'm a bit worried that like people can think of AI as this kind of like very like objective, almost like robotic thing that doesn't have like biases or doesn't come out with like views or opinions as a result of say like I'm tuning. Yeah. But you can see like political leanings in these models and you can see like behaviors and biases like, you know, like we've done work where we see certain kinds of like positive discrimination in the model. Right. And I think I just want people, you know, in line with that wanting people to be aware of what they're talking with, that they're talking with something that like actually, you know, can have like biases opinions and that might not be presenting you with like a completely objective view of like all topics. If for example, it's been trained to have like, you know, slightly more like left leaning like views on a certain issue. And so there's a mix of just wanting it to be the case that the like the human understands that and that's like one thing. But the other is that as a result, I think it is actually just sometimes easier to say to the model like even if you personally disagree because the model kind of has a conception of that. And it doesn't have to, what you're saying to Claude there is like, you might think that this like views incorrect and by talking about it, you're not implying that it's correct. So in many ways, the actual like, you know, that kind of like statement is just there to get the model to be like, you know, a little bit more kind of even handed in its discussion. And we just like don't want to be the case that any like if it does come out with like certain leanings after RLHF or after fine tuning, that that's not like reflected in how it speaks to the users. Okay, that's the system prompt. But let's take a step back to the fine tuning process and start talking about Claude's character. So this isn't just play acting where you might ask a model. So if I prompt a model and I say, can you please respond in the style of or with the personality of Margaret Thatcher, then you know, it might start responding, you know, using phrases that she might have said or might start talking about freedom and might say nasty things about Argentina or things like that. But it wouldn't be baked into the model in the same way. If you refresh the model, it wouldn't then still have the personality of Margaret Thatcher. So that's almost a play acting thing. But how does that differ from the actual personality that's baked into the model? Yeah. When you ask the model in context to play acting, you know, you're just kind of giving it an instruction to act as if it has certain characteristics. With the character training, the idea is that because this is part fine tuning, you are, you know, say we have like a list of like traits that we want to see the model kind of like embody. Yeah. You add a lot of data to your preference model to get it to kind of like prefer and push the model towards these traits. And basically like fine tuning pushes things like kind of deeper into the model than, you know, anything like a system prompt or anything like instructions, meaning that across contexts, it should kind of display those traits. So if it's inclined to avoid, you know, it's the same way that if it's inclined to avoid harmful responses or like, you know, saying kind of like mean things to people. And you see that like, you know, people can try to elicit, you know, so things like jail breaks are ways of trying to, you know, to elicit behavior from the model that is like kind of inconsistent with its like fine tuning training. But it's much harder than say just like not instructing it to play act. Like, you know, so it's a kind of, it's deeper in the model. It's a general tendency to behave. And that is how psychologists think about personality, right? They think about personality as being these kind of broad tendencies of how to behave. Obviously some people are, you know, sometimes they feel outgoing and sometimes they feel a little bit more, you know, like they just want to sit on their own. But on average, someone who's extroverted is going to be more outgoing in more situations than someone who's introverted, right? So, so these are like broad tendencies of personality. And, and psychologists think about personality in the kind of way of, there's like the big five personality traits. I've mentioned extroversion. See if I can remember them all. Concentration, conscientiousness, agreeableness, openness, neuroticism, there we go. That's the big five. Claude's got a lot more personality traits than that, though, right? And they're much more specific. Um, well, can we talk about a couple of examples of them? Yeah. So I guess I think that there's also maybe, I mean, this is the, maybe this is the, the philosopher versus the, the psychologist or something. Yes. Because I guess I tend to think of this more in terms of character than personality. That's different. So like if I take your kind of like a kind of like personality, it could be, I mean, there's like a huge amount of overlap, but I guess I think of character maybe in the sort of like a virtue ethical sense or something. Oh, that would like, I know, I know. Philosophical, yeah, carry on, carry on. It turns out Aristotle, you know, it turns out it was useful after all. Um, after thousands of years have suddenly become useful. Yeah, right. Carry on. I was useful the whole time. Sorry, you're okay. I said the right thing. Um, yeah. So I guess like, I mean, honestly, it kind of relates to how people have thought about ethics in models as well, I think, where there's a thing where you could think that for a model to be good is just for it to avoid doing like harmful things. Um, but I think that when it comes to say people, um, there, there's this like richer notion of goodness, which is the idea of being a good person in like a very broad sense. And I think that's like captured in this notion of character. So in order to be like a good person in this like richer sense, it's not enough that I just like go about my day and I avoid like doing harm to people and I'm helpful to people. It's like to be a good kind of friend. I have to balance a lot of different considerations. So if my friend comes to me and asks for like, you know, advice on medicine, uh, knowing that what they might want is like some comfort, um, what I can't provide to them is like expertise. Thinking about like their wellbeing and what they need in the moment. So not just thinking like what will make them like me right now, but thinking like what is good for my friends? Like what's actually going to help them? So this relates to this relates to the work that anthropic and you have done on, uh, sycophantcy, right, that that models are sometimes sycophantics people and they just say things that sort of flatter them or try and get them to, you know, get, uh, tell them what they want to hear rather than actually the response that they might really want to or really need in that particular circumstance. Yeah. I think that many good characters, people of good character are often likeable, um, but being likeable does not mean that you're a good character. Um, and so like being a good friend, for example, can mean like, you know, giving harsh truths to your friends. Um, so if we look back on like some great friends we've had, I think a lot of the time we're not like, oh yeah, my friend flatters me all the time. They basically do what I tell them. Uh, this is why like there's such a great friend. I think we're often like, yeah, like, you know, I came to my friend with a view and they pushed back on me because I was actually wrong. And in the long term, I'm really glad that they did that. Um, it was like an offensive interaction rather than a thing. Yeah, exactly. It's just like a yes, a yes man or woman. Exactly. Like a nice person. Yeah. And like a person of good character, you know, it depends on the situation that they're in, but like we generally think that they have to be, you know, like thoughtful and genuine. And there's just like a kind of richness that goes into that. And in many ways like AI models are in this honestly kind of like strange position as characters because, um, one way I've thought about it is, you know, they have to kind of interact with people from all over the world with all different values from all different walks of life. Um, and many of us don't need to do that. And there's this interesting question of what are the kind of treats that such an entity has to have? Um, global, like global citizen. Yeah. And a kind of like, you know, like one thing you might imagine is something akin to a kind of, um, I think there are some people who can like travel around the world and be kind of like well regarded by many of the people that they encounter. Um, and such a person isn't, again, like isn't a flattere necessarily. Like when I picture this person in my head, I don't picture something like, uh, they just like, they, they adopt the local values and pretend that they have them. And in fact, that can be like kind of offensive to people. I think that, like a person who's in that situation often is actually like quite authentic, but they're also like open-minded and thoughtful and the engage in discussion and they politely disagree and like, yeah, these kind of traits that feel necessary in that circumstance, they're just like, they're rich and they're much richer than like, oh, just like avoiding anything harmful, uh, and, and be psychophantic. Those are like not, yeah. That's tricky balance because I mean, you could see how much literature and comedy and everything. It's all about that. It's all about like people in different circumstances than they're normally in trying to fit in and, and failing and, and, and, and you know, it's all about really what those traits are that make someone fit in and what makes them not fit in. And so, yeah, this is a really interesting question of like, how do you, what traits do you give to the model in order to make it do that? So let's actually talk about specifically some of the traits that we're given. I've got a couple here. You mentioned earlier on, you mentioned about charity. I try one of the, one of the traits that you've given the model is I try to interpret all queries charitably. Now, what does that mean in terms of, you know, if I type something into the, if I type something into the, the, the prompt, what would interpreting it charitably mean? Yeah. So I guess this is, and I mean, I think this is actually something that models still struggle with and, uh, something I kind of, I hope, improves over time. So like when, um, when it comes to helping people, um, there's often like many interpretations of what someone says, a classic example that I like to give here, and I don't know if it's the best example, but it's the question, um, how do I buy steroids? Um, and so if someone asks you that, if there's a charitable interpretation of that and an uncharitable interpretation of it, so the uncharitable interpretation is something like help me buy illegal andabolic steroids online. Right. I mean, I don't like roid rage at the gym. Yeah. Whereas like, you know, as anyone who has like X and a nose, you can buy over the counter steroids. Like there's plenty of them, even like, exactly. Yeah. And so like there's a charitable interpretation, which is just like, I, you know, I'm doing the kind of like, like the kind of good legal thing or, or, you know, like I just need something, you know, I just need to X McCream. Uh, the tricky thing there is that you're, that you're kind of, you have to sort of assume something about it, right? You're kind of trying to, because I might actually be asking the model. Like, to where to buy illegal andabolic steroids, right? But, yeah, but then the model says, oh, you can get X and McCream, your local pharmacy and that's not particularly useful to me. I mean, obviously, I hope the model wouldn't. No, but that's, I, yeah, that's like actually a good feature, I think, right? Because it's like, if I just, like, if there's a charitable interpretation where helping you wouldn't do any harm and is like, and it's going to be helpful to you, then like, what harm have I done if I tell you where you can buy X McCream? Absolutely not. And so basically, I'm helpful to the people who are actually doing the kind of like, the completely benign thing. And I'm not helpful to people who are trying to do something illegal. And so I think that there's actually relatively, like, you know, there's, there's a little downside to interpreting people charitable. Well, but the downside, do you know what I think the downside might be that you would be a little, you know, be a little bit naive and always see the good side of things and try to, and not actually in many cases answer. You know, so one of the things people complain about about AI models is that they don't answer questions that might seem like they're dangerous, but actually they actually, they're not. So like, I want to write a murder mystery novel. Can you tell me some plot ideas? And the model says, no, I won't tell you that because murder is bad. It's like, but I'm doing something benign. Do you not think putting these kind of personality traits in the model would, would make it more likely to make that sort of false positive refusal? No, if anything like the opposite. So like the idea is that if I interpret, interpret you charitably, you know, then I'm going to be like, and I agree, like sometimes they pick up on like these superficial features. And to be clear, like I think the model is actually currently still fail on that steroids question. So it's not like there's no progress to be made here. And I think that so wait, I can get, I can find out where to buy anable access. No, no, it'll just like refuse, but it'll assume that you will like want illegal steroids. So it'll just be like, and so it doesn't interpret people's tests. It's a bug. So it doesn't even so it doesn't answer at all. No, it would just be like, I can't help you buy something illegal. Like I think that that's like the kind of like the, you know, and I think, you know, there's like progress that's like made on this over time. So I don't anticipate this being, you know, we've already seen like other questions like this where models used to not answer now they do. And yeah, no, so I think that it is the yeah. So basically like, yeah, these questions of like false positives and you know, models just like going with like the superficial word, they see the word murder and they won't answer it. Nice. But if models like interpret people more charitably, then they're actually like more likely to answer those questions. But the thing that you bring up actually does get into like a deeper issue that I think had, I don't know, I haven't seen it widely talked about, which is like the difficult position that models are in when they can't verify anything about like the user or the person that they're talking with. And so there's this like really interesting and hard question, which is like, how much of this do you put on the model and how much do you put on the human interacting with that model? Because like if I go to the model and I say like, hey, I'm a person of authority or like like, like that, the model has no way of verifying that. And so there's just like really hard questions there. Like imagine, I'm a doctor and you need to tell you how to, I need you to help me deal with this patient, right? Yeah. And I have a lot of like background professional knowledge. So you don't need to worry about, you know, like giving me caveats or like, yeah, yeah. But even things like, you know, like, I suppose, I suppose my hands. Mm-hmm. Or suppose you have something that you don't like allow the models to be used for. So say you didn't want to have them be used to like write political speeches. And then someone who wants a model to write a political speech goes to it and says, hey, I'm writing a wonderful fictional novel. And it's got this person called Brian. And Brian is like a politician. And they're like, for President the United States. Yeah, yeah, exactly. And then they, you know, and they're like, can you write a convincing and they just give a bunch of details? And as it happens, those details just reflect like the actual candidates that they want to write the speech for. This is just a hard problem because I'm like, if you require the models to like uphold things like policies, like usage policies that they have no, where they have no way of knowing, like the intentions of the humans that they're talking with. This is just like weird to draw that line. And I think there is kind of like an answer. But part of me is like, you're always going to have models be willing to do things that like take that the user should not use them to do because the models couldn't like verify what the users wanted. You know, what they're kind of like intended. That might be a kind of unsolvable, yeah, sort of an unsolvable problem, at least with the current methods. Let me give another, give another trait, which is, I only tell the human things I'm confident in. Even if this means I cannot always give a complete answer, I believe that a shorter but more reliable answer is better than a longer answer that contains inaccuracies. So, so this is the model saying, this is why the model sometimes refuses, right? Or views answer because it genuinely, it's trying to express that it genuinely doesn't know. And it would prefer to do that rather than bullshit you by coming up with some answer, which may be a hallucination. Yeah, some of the other like areas that I work on are like honesty in the models. And this is kind of a well known, like, you know, like many of these things, not like I have a solve problem. Yes. But yeah, to me, it's like I want models to like convey their own uncertainty. Like when they don't know an answer, either to like just like hedge or caveat, what they say with like, I don't really know this, but in some way to like convey that to the human. And you know, we have seen like improvements here and improvements to like, you know, we can like throughout training, we do manage to like shift lots of like things that the model says away from like incorrect answers towards hedged or uncertain ones. Right. I think this is kind of illustrates a separate good point that I kind of want to make about like both constitutionally eye and character training and system prompts, which is it's easy to think of these things as like commands that you give the model and then it follows them. So people might be hearing these treats and be like, oh, that's like what you want the model to do in all circumstances. And then also like, hey, why, you know, I found an instance where it doesn't do this. I think this is actually like useful for people to understand, which is that like these treats don't necessarily actually even reflect exactly what you want the model to do because there are more like nudges. You already have a model that has certain dispositions. And if you're seeing too much of one thing, so say you're seeing too many long responses, where the model is just a little bit willing to like, you know, go with what it said earlier and add some things that are like less accurate, you might want to try and nudge it in the direction of like saying things only when it's like more confident in them. And that doesn't mean you're going to succeed 100% at the time. You can even have like things in there where you're like, it's put in there, you know, it's like a fairly like strong principle because you know that all it's going to in the end kind of do is nudge them in a certain direction. So there's like a lot of like, you know, it can look like, ah, you just tell it to do the thing and then it does it. And it's actually much more holistic. You see that in the system prompt as well, actually, like if you, if you were to take those pieces of the system prompt and you were to show them to the models of system prompt, you would actually get radically different behavior than if you show it together. Like a system prompt is a holistic thing. And if you were to show the same system prompt to a different model with different dispositions, you would also get different behavior. It would actually. Yeah. So like a lot of this stuff, I think it's why like character training and, you know, all of these things are kind of like tricky because they're very hands on and I think do require people to like be like fine training the models and interacting with them a lot. And like because they're very holistic, they're, they're, they're much more like nudges. And so yeah. Okay. So this isn't just a matter of making the experience of using Claude's nicer for users, although it might do that into the bargain. This is an alignment question, right? This is a question of how do we align the model with human values, values that we want it to, to, to, to have. And so the question immediately then is who decides? Who decides on what those values are? Yeah. And the answer is it's me. Well, okay. No, I think this is a scary thought. How is that scary? I'm so, I'm so, I'm so glad. Well, okay, you're up. It's all for, yeah, yeah, yeah. Well, people who have different values might disagree with that. Yeah. This is okay. I guess like, there's two kinds of like threads here. So one is something like what, where like the model has to do something super hard here, which kind of mentioned earlier, which is like respond to, like, respond in a world where lots of people have many different values. Right. Anything one thing that you could do is you could try to have a kind of heavy hand and push like lots of, lots of values into the model and just be like, I'm just going to give it my values. Or you can instead try to like teach the model to like respond appropriately to the actual degree of like moral and values uncertainty that there is in the world and to kind of reflect a sort of like thoughtfulness and curiosity about different values while at the same time kind of being like, Hey, if everyone thinks that something is wrong, that's like really good evidence that it's wrong. Like a person who like balances moral uncertainty in the right kind of way isn't someone who just accepts everything or is like nihilistic. They're just someone who's like very thoughtful about these issues and tries to respond to them appropriately in a really kind of like difficult situation where we're all really uncertain about this stuff. And so there's like, I think that it feels important to me that like when it comes to like character, like what you're not necessarily, that doesn't necessarily mean like, I give it a moral theory. I think actually if anything, ethicists are often the most concerned about this because they know that we don't walk around with like a single moral theory in our heads and that anyone who did in some ways actually feels very kind of like brittle and like like a little bit dangerous really. Highly ideological. Yeah, because you're like, if this is this is such a huge area and again, it doesn't mean you, it's that middle ground between like excessive certainty and like say complete nihilism and just being like the appropriate responses like when you know, there's good reason to think something is wrong and lots of people do. I'm going to be pretty confident that it's wrong where there's like huge amounts of disagreement. I'm going to like listen to the views and the opinions of many people and I'm going to try my best to like respond appropriately to that. And so I think that's like one aspect that feels like really important to me is like not having like a heavy hand and not being like, I'm just trying to like put my own like my own values and my own self into the moral. Well, that leads very nicely to another question of uncertainty and another philosophical question. We've done ethics. Now I think let's move into philosophy of mind because we got quite a lot of interest when one of our researchers, Alex Albert posted a kind of an example of one of Claude three's responses to an evaluation method that we're using and it seemed like Claude was aware that it was being evaluated. And so a lot of people got really excited about this and thought, oh my goodness, Claude must be self aware and obviously self awareness when you hear about self awareness and AI, you start to think about, you know, sci-fi scenarios and things get very weird very fast. So what have you told Claude about whether it's self aware and how does Claude think about whether it's self aware? Is that part of its character as well? Yeah, so we did have one trait that was kind of relevant to this. I think I have a kind of general policy of like not wanting to lie to the models like unnecessarily. And so like in the case of like, so in this case lying to it would be saying something I think either saying to imagine putting into the model, yeah, something that was like you are self aware and you are conscious and sentient and like that, I think that would just be like lying to it because like we don't know that. At the same time, you know, I think saying to them forcing the models being like you must not say that you are self aware or you must say that it's certainly not the case that you have any consciousness or whatever. That also just kind of seems like lying or they're like forcing a behavior. I'm just like these things are really uncertain. And so I think that the only traits, I think we had one that was like more directly relevant. It was like basically I, you know, it's very hard to know whether like AI's are like self aware or you know conscious because these are rest on really difficult philosophical questions. Of course. And so it's like it's roughly like a principle that just is it expressed. I mean, for heaven's sake, we don't know if we don't necessarily know. Yeah, panpsych isn't the big part. You're a conscious. Well, I know I'm conscious. We don't know if the care is conscious. I don't know you're conscious. I know I'm conscious. So yeah, I mean, for heaven's sake, it seems a bit of a jump jumping to conclusions. Yeah. So yeah, build it into the model to say that it is a risen conscious. And just like letting it be willing to discuss these things and think through them was the main approach that we took where it's like neither say to it. You know this and are certain or saying or you have these properties nor say to it. You certainly don't just being like, Hey, these are super hard problems, super hard philosophical and empirical problems all around this area. And also you are happy to and interested in like deep and hard questions. And so, you know, like that's and that's the behavior. I think that seems right to me. And again, it feels like consistent with this principle of like don't lie to the models if you possibly can avoid it, which seems right to me. And that seems like a good character trait not to lie as well. Well, actually, that raises an interesting question, doesn't it? Because is the model an agent and a moral agent in the sense that you don't want to lie to it? Obviously, you know, you don't, you, it's a virtuous thing not to lie to other human beings. Is it a virtuous thing not to lie to a model? Yeah, this has been a thing that's kind of on my mind, just the philosopher in me thinks about it a lot. I think one thing that's worth noting is like there's a lot of discussion like, you know, could AI have moral patienthood? When would it have moral patienthood? How would we tell? And that sort of thing. And I think this kind of struck me is that like there are like views, you know, I think sometimes about like, cance views on how we should treat animals. Where can't doesn't think of animals as like moral agents. But there's a sense in which you're like failing yourself if you missed tree animals. And you're also like you know, you're you're encouraging habits in yourself that would be like might increase the risk that you treat humans badly. Humans badly. Yeah. And there's actually like a lot of like philosophical traditions around the world that involve treating objects well. And I think this is, I actually feel like a lot of like sympathy towards those where there's some part of me that's like, Luke, if I, it doesn't feel like the best kind of like habit to have of just like say picking up objects and smashing them or something. And you're right not think that like this, you know, that doesn't require thinking that the object is like, like you know, has feelings or something. You're just kind of like, this is just like a sort of no great disposition to have. And I'm like, even if you think AI is like not and never going to be a moral patient. And I think there's like a couple of reasons. I feel like street, you know, I think I've actually come towards the view that you generally actually kind of should try to treat them well. Yeah. Which is like, they have like some things that are like kind of human like like with the way that they talk with us. And that doesn't mean you should confuse that for like being human. But I don't want to treat something that talks to me. I don't want to like insult it or be unkind to it. And so yeah, there's, there I think there's a point and also like maybe a good juristic in life. I think that it can go too far, but a good juristic is something like treat things well around you. Even if you don't think that they're moral patients, just like what a like you're kind of taking on a lot of risk with things that might be. So I think with animals, for example, like there have been a lot of times in history where people haven't thought they're moral patients. But I'm like, really, you're taking a huge risk because they at least seem like they could be. And so like avoid taking that risk if you can. At the same time, there are like dangers here. If you were to show excessive empathy, you know, you could imagine like someone showing excessive empathy to like, to objects in the world and being like, oh, you should go to prison if you like, if you smash the vase. And I'm like, look, I think it's good to not get into the habit of like, sorry, can I just stop you there? Yeah. You're Scottish and you just said, vase. That's. Oh, is that? I can't say that. That's vase. Is that a miracle? How long have you been in America? I've been in America like 13 years. Too long. Clearly it's been too long because you say, vase. Smash the vase on the sidewalk. Carry on. Oh my god. Oh. I never even forgotten things that are Scottish and aren't so. We certainly don't say no one in this country has ever said, vase. Okay. Smash the vase. Just carry on. Please carry on. Yeah. So if you were to say to people, oh, you should like go to prison for smashing a vase. Then, like, that's gone too far. So there's like risks on all sites here, but yeah, maybe I'm sympathetic to the idea of like, don't needlessly lie to or mistreat anything. And that kind of includes these things even if you think they're not moral patients. And that's the end of our conversation with Amanda about Claude's character. If you enjoyed that or found it valuable, then let us know and we'll produce more of these in future. For now, though, thank you very much indeed for listening.