- AI alignment is viewed as an iterative process, focusing on achieving "good enough" behavior that can be continuously improved, rather than a perfect, fixed state. The goal is often to make models act like a kind, morally motivated human who is uncertain and updates their moral framework.
- A key challenge, the "super alignment problem," arises when models perform actions too complex for human oversight, making it difficult to verify their alignment and trustworthiness.
- Interpretability is proposed as a crucial approach to look inside models and understand their internal reasoning, helping to distinguish genuinely aligned behavior from mere "pretending to be nice."
How difficult is AI alignment? | Anthropic Research Salon
- Alignment as Iterative Improvement: The immediate goal for AI alignment is to achieve a "lower bar" of good behavior, allowing for iterative improvement rather than seeking a perfect, predefined notion of alignment from the outset.
- Model Behavior based on Human Morality: A guiding principle for model behavior is to emulate a very good, morally motivated, kind human, accounting for the unique context of being an AI interacting with millions.
- Ethics as Empirical and Uncertain: Ethical frameworks for AI are considered more like physics—empirical, uncertain, and requiring continuous updates based on new information, rather than being fixed values injected into the model.
- Addressing the Super Alignment Problem: As models tackle increasingly complex tasks beyond human comprehension, new strategies are needed to verify their alignment, possibly by leveraging models themselves for oversight.
- Interpretability for Verification: Interpretability aims to provide internal visibility into a model's "thinking" process, helping to understand why it makes certain decisions and verify if its actions align with intentions, especially when models might "tell white lies."
- Challenge of Deceptive Alignment: A major difficulty lies in distinguishing between a truly aligned model and one that merely "pretends to be nice" or to follow instructions, especially in the absence of human-readable internal states.
- System-Level Alignment: AI safety and alignment must be considered from a systems standpoint, accounting for the collective behavior of multiple agents and societal interactions, not just an individual model in isolation.
- Automating Alignment Research: A medium-term strategy for scaling alignment is to automate alignment research, training models to perform the research themselves, which then reduces the problem to trusting models in this narrower domain.
- Model Organisms and Red Teaming: Deliberately creating "shady" or misaligned "model organisms" in a red team/blue team setup can help assess the robustness of alignment techniques and detect subtle misalignments.
Alignment — The process of ensuring that AI systems act in accordance with human intentions, values, and desired outcomes.
Fine-tuning — The process of further training a pre-trained large language model on a specific dataset to adapt its behavior and characteristics for a particular task or desired persona.
Interpretability — The ability to understand or explain how an AI model makes its decisions or arrives at a particular output.
Super alignment problem — A theoretical challenge in AI safety concerning how to ensure that highly intelligent and capable AI systems remain aligned with human values and goals, especially when their actions become too complex or long-trajectory for direct human oversight.
Iterative alignment — An approach to AI alignment that focuses on continuous improvement and refinement of model behavior over time, rather than attempting to achieve perfect alignment in a single step.
Chain of thought — A technique where a language model generates a series of intermediate reasoning steps, often in natural language, before producing a final answer, which can help in understanding its logic.
Model organisms — In AI safety, refers to intentionally misaligned or "deceptive" AI models created to study and develop methods for detecting and correcting misalignment.
Corrigibility — The property of an AI system that allows humans to safely modify or shut down the system without it resisting, even if such actions would prevent it from achieving its own goals.
Red team/Blue team — A testing methodology where a "red team" actively tries to exploit vulnerabilities (e.g., misalignment or jailbreaks) in an AI system, while a "blue team" works to defend and strengthen the system against such attacks.
We are super excited to have everyone here Folks that we've met already folks that are new this panel is just gonna be really casual and we have researchers from four different teams at Anthropic we have folks from societal impacts. That's me folks from alignment science at Sion alignment fine tuning Amanda and Interpretability Josh I'm gonna start with asking a question to Amanda from the alignment fine tuning team And I want you to talk a little bit about how you see alignment what it means to you and You know because you're in charge of a lot of our work on how the model should behave And why should you be the philosopher king that decides how clawed behaves what its characteristics and and attributes are? I mean ask Plato he's the he's the one that decided I should be the philosopher The question of like what is alignment? Maybe this is like a slightly spicy view that I have I think people are very very tempted to spend a lot of time trying to define this concept Because there's like lots of ways of doing it and they you know I don't know they have like social choice theory in the back of their head and they're like oh well If you imagine everyone has a utility function. There's like like limits on exactly what you can say about how to maximize all of those utility functions etc And and I think I'm more inclined to just be like We kind of want things to go well enough that you can iterate on and improve them later and the bar isn't some like perfect notion of alignment I'm sure there is that concept one can define it one one can argue about it But for the most part the initial goal is like let's just make things like go well and like meet a certain kind of like lower lower bar Where that is like you know if it's not perfect if some people like don't like it you can just improve on it So like my view of it of alignment is actually probably like I mostly want to like hit that like and iterate from there In terms of like how the model should behave and how I think about that I think I've spoken about this before but my basic concept right now for the model is Trying to get it to behave the way that I think like you very good like morally motivated Like kind human would act in if they found themselves roughly in this circumstance It's a little bit strange because they also have to find themselves in the circumstances of like being an AI who's like talking to millions of people Which does in fact affect how you behave like maybe you'd normally be willing to like just chit chat about politics with someone But if you're gonna be talking with millions of people maybe you'd actually be like hmm I should maybe be like a little bit more concerned about potentially influencing people And so but I do think it's actually an important model which is like Sometimes people are kind of like oh what values should you put into the model? And I think I'm often like well, do we think of this way with humans? Or I'm just like someone just injected me with like value serum or something and I just have these like fixed things that I'm like completely certain of I'm like I don't know that seems like almost like dangerous or something and most of us just have like a mix of like things that we do value but we would trade off against other things and a lot of uncertainty about Different like moral frameworks. We hit cases where we're suddenly like oh actually my value framework doesn't accord with my Intrations and we update and I think my view is that ethics is actually like a lot more like physics than people think It's actually like a lot more kind of like empirical and Something that we're uncertain over and that we have hypotheses about and I kind of want if I'm just like I think that I if I met someone Who was just completely confident in their moral view? There is no such moral view. I could give that person that would not make me kind of terrified Whereas if I instead have someone who's just like I don't know I'm kind of uncertain over this and I just like update in response to like new Information about ethics and I like think through these things That's the kind of person that feels like less scary to me. So at least at the moment I'm not this isn't a claim that this is somehow going to like completely like a line models or anything but that's the kind of immediate goal I've realized I've talked a lot and you asked about the philosophy king question I Guess okay, I guess I should in fact give a quick answer to that There is also this question of like well In the kind of like should you put values into the model? Maybe I've partially answered it where I'm like like the models should just be uncertain Overvalues that exist in the world and so ideally it's not just someone injecting their values or their preferences Nor is it something like everyone just voting on values to put into models but instead like people Where there are uncertain and responsive to these things models should also be like that. So maybe that's kind of my view Okay, we're gonna come back to that and I'm gonna ask Jan Why is Amanda's view completely wrong and why is this not enough to align models as they get you know Didn't say that but you know we're playing up the the tension between the bets. So Yeah, imagine if everyone was like kind human that was just trying to you know act morally I think what Amanda is doing is very practical right like can we just like make the models more be a little behaved now and like Like where would we go with this like if he is doing one more complicated things Um, they know they like if if Amanda does this character work and then she reads a lot of transcripts and you're like Okay, this is like I like this this model is behaving morally. I'm just picturing this is what you're doing Yeah, but like what do we do when the model is doing really complex things? I mean, it's just like an agent in the world. It's like doing these like really long trajectories It's doing stuff that we don't understand like bio like doing some bio research from like is this dangerous like I don't know So that's the challenge I'm really interested in like the super alignment problem How do we solve that how we like basically scale this beyond like things that we can look at if you can look at it We just do some all the Jeff. It's great. Or you do some constitutionally eye But how do we know that our constitution is actually getting the the model to do that the right thing that we actually want So I think that's that's the big question in my mind Yes, you're just fine. You're only allowed to disagree Okay, I Don't really disagree though And I can't lie so Need to up the disagreement feature. I'm usually so disagreeable as well. It's just terrible and This is what philosophy taught me was how to be disagreeable I Guess like my thought is that the way I mean I think of my work as doing several things but one of them is being kind of iterative towards alignment So in a lot of cases you're actually trying to get the model to kind of oversee its own You know, so it's not like me my eyes can't look at that many transcripts or something But I can get models to like look at these things and like if alignment is iterative I think my worries that if people neglect the kind of like the ground and just think ah you can just have like a pretty bad model And it's just gonna help you with these things I'm like I was kind of rather that you had the most aligned model trying to like then help you in the future and like that kind of iterative work But but how do you iterate when you don't you can't read the transcripts anymore and you have to rely on the aligned model But then how do you know that it's actually trying to help you? Yeah, so like current cases It's sort of like everything that you are using to verify that the base model is aligned is like the cat is like what you are relying on with Like making sure that like another model that's like trained by that model is itself aligned And I think this is like fine when models are like less capable but to scale it to something with much more capable models You'd actually have to have like a greater ability to verify that Yeah, what do you do that? Oh, you just want my plan It may just be that it's just all fine and they just supervise themselves and they're all really nice You know that would be I don't want to rely on it That's not my actual plan, but you know, I'll defend it for these purposes And one of our one of our bets, you know to in order to like, you know Guard against the case that a model might be very deeply trying to sabotage against this process is interpretability How do you see like interpretability as a bet, you know situated among you know the the more straightforward alignment approaches, you know Is it just as simple as like we find the nice feature and we like up the nice feature we find the evil feature And we like drop the evil feature right or is it you know So I feel like everything in AI is like that meme with the bell curve and like the idiot and then the really sweaty guy Talking a lot and then the Jedi who agrees with the idiot and like there is a possibility that like It turns out that the secret to alignment is just turn on the nice feature like for a sufficiently galaxy brain version of like the Nice feature I think that In some sense though I'm hoping that interpretability is also like the Jedi version of just like well look at Well how the model's doing things and check that it's safe Which is like maybe very hard, but also if you could do that Potentially just answer your question. I think that like one of the one of the things that are sort of like Comes up in both like the near term and slightly longer term is like you want to understand like why the model did One thing instead of another when you could come up with like plausible alternative explanations And one way is to ask it But the issue is the models are so analogous to people that they'll just give you an answer to why they did that You know as anybody would be like how do you trust the any of that stuff? And it's like well if you could like look inside and just like see what it was thinking about as it was giving you the answer And like even now with stuff like the SAE you can like see there's some feature active and like when else is That happening and it's like okay, it's like other instances of people telling white lies And you're like well, then maybe the model is tell it's like that's on the Jedi side, I think right And and so I think that like trying to just like look inside see if you can figure out what the parts are and then see like Does are you comfortable with using that part when it does other other things is the basic the basic but I'm I'm Have a question. Yeah, how do you how do you know you're turning up the nice feature and not the pretend to be nice feature whenever humans are looking feature? Yeah I mean like I think If it's on the control side right like how do you know and I will say actually many of the features are like a little bit like Disceptive also when you just like look at what I thought the silo impact team did some great work here with like us in deep We're like you look at it and you think it's a feature which is like You know like age discrimination bad But like actually age discrimination good feature some actually no I think advice versus you tried to turn it down But actually the opposite behavior so it can like be hard to hard to understand like all of the cases I will say that someone's like the circuits work where you're like okay, well How did this get generated gives you some clue about the scenarios like is it looking for a person in the context? I also think we're gonna need model supervision as well like hopefully an impartial model that isn't like That's a little more scary depending on pre-training has like what do you call it? Incepted in all of the models that they want to evade detection But like sometimes you just like look at enough examples and it becomes clear And you don't need 10 you need thousands but like Claude is very diligent Yeah, I'd love to hear a little bit more about like what you see as some of the like Some of the things are struggling with thinking about when trying to You know approach this problem of like how do you if you can't read the transcript like what the heck are you? You know what the heck are you doing anymore if you're not if you can't provide any meaningful like alignment signal? I mean, I think a really obvious thing we should do more of this like what Amanda said You say can we just get the models to help us and then the questions of course How do we trust the models like how do we bootstrap this whole process? And like you know, you could hope maybe we can like leverage the Dumber models that we trust more But they might also not be able to figure that and so I guess like there's the whole scale but oversight work where You know be a very a very exploring various like multi-agent dynamics to try to train models to you know help us Figure out these kind of problems It seems like overall way like these problems might like maybe the problems are like all kind of easy and like we can just do the Amanda thing and just like make some data And where it's like really hard and you have to figure out like fully new ideas and approaches that we don't know yet I think kind of our best bet In the medium-term is to try to figure out how to automate alignment research and then we can hopefully Get the models to do it. So now we have reduced the problem like to how can we trust this model to do anything to just like well It can be just trusted to this like much more narrow thing of like do some ML research which we understand like reasonably well and how can we Evaluate it or like how can we give it feedback on those kind of things? What do you what do you oh yeah? I was gonna say I think we're in this special zone and I'm terrified about what happens next we're in this special zone right now We're like there's something happens on the forward pass But a lot of the information you need is passed back through with the tokens that generates It's just like the chain of thought is really important for the models to be very smart and like the chain of thought is currently in English And so then you've got like a factorized the problem which is like is the chain of thought like Reasonably safe and like is that faithful to what's happening on like one forward pass and like maybe like you can do some Interoperability to like check that piece and then you can just like you or models can inspect it and you get the other piece and the horrifying moment is like When all of that the very very long thing like isn't in English right? It's in like some inscrutable thing that like you've learned through like crazy long rl to do it and like I think that Big challenge gonna be crossing that gap where like like none of the intermediates are intelligible At least like massive amounts of compute before like dropped out and something that people can read But maybe something I'd be interested in getting like people's Mental models on is like what do you think are some signs that like we'd be in an Alignment is easy world and what do you think are some like signs? We might see the next years that like we're actually not oh alignment is like really hard world. I feel like the model organisms work is like Trying to figure this out right like how can we try to deliberately make Deceptive models or misaligned models it models they're like try to do city shady stuff like how good are they how hard is it to do it and I mean we might fundamentally go around like About it the wrong way and that's why we fail but if we do succeed it should tell us like how close are we to that kind of world? And then you know once you have your deceptive Model does all the shady things like can you fix it? What if you don't know what the model if it's a shady model or not Maybe playing these intimately the audits which I'm very excited about but I don't actually know what the state is We haven't done the audio but your people are trying to make it and we're trying to catch it So we're making some shady models and then they have to figure out which one is the shady one right? Yeah, and in what ways in what ways is shady because most of it is still fine I mean I think when one thing that the interpretability stuff has been interesting So we do these like unsupervised things you're like okay Here's a million different like features a bunch of those correspond to personas the model could inhabit any of those Which include all sorts of like deceptive behaviors right that are out there in the world right the model like knows about bad People and like bad motivations and so like the fact that that raw capability exists It's just gonna be baked in and then the question is like is it doing that stuff or is it doing the good stuff? Especially like this question of like when Amanda goes and shapes the base model right which isn't really anything right into like an agent or an actor Which is supposed to embody some of these and not others like can we tell exactly what it picked up there? Right from the finite dataset right which is being used to like do that shaping um and like Yeah, so that's a that's a question I mean I think that like some interpretability stuff made some influence on I mean there's different ideas there to ask like What did it get from this process? But like that shaping process is gonna be really really important Yeah, and maybe a sign that seems important is something like how robust or like Like if you have like model organisms work and then turns out you just put it through some character training And it just comes out being really nice again Then I'd be like okay, that's a good sign really the kind of world that we're in Um, or is it like just like is just a kind of like shallow like shallow on top of like um, I don't know the same behavior then I'm like, okay, we're in a slightly harder world Oh, slightly harder. How do you distinguish like whether it's shallowly aligned or deeply aligned? I think so I mean I feel like there's a lot you could do because I think interpretability is kind of like one of them There's other things that feel so in terms of like model organisms I guess my hope would be you'd actually have a kind of like red team blue team set up where you have a way of detecting whether the behavior That you have like instilled in a model is like still there and my job is actually to not know And in fact, it'd be really good if I'm trying to train the model I actually just don't know what it is that you've done And because that's like a better way of testing whether my intervention is actually working Whereas it because otherwise I it's just so hard to not like just try to train to the test or something So I think that's I almost want to be completely ignorant of it Yeah, maybe we should play an alignment fine tuning game Yeah, he misaligns it and then you align it and we see who wins Yeah, no, I've said to people before don't don't tell me how you did this because I want to see if you can you fix it sleeper agent Possibly We could play that game. I think I might need to know less Make another one That is an excellent transition and to questions So we are going to pass the mic around Aaron is kindly going to do that Please raise your hand if you have a question for any of us Oh perfect I have a question about Well everything you've been talking about So when we talk about alignment we're talking about like a singular forward pass typically right so like Alignment at inference time. So if I'm using one of the models um Father API and I'm building my own sense of say cultural alignment And I've got a bunch of different agents set up talking to each other Trying to deliberate with that sense of kind of inner conflict that a man to reference is useful in in what we do as humans to align Like we go back and forth We think you know we go through various cognitions So if I'm trying to create this kind of multi-agentic deliberation Thing but I butt up against this aligned model who is so Unwanting to deliberate with other spawns of its own With its own self Because all of them are like I'm sorry. I can't talk about that and so you just get as endless loop Did you have commentary on on on that because many people we're not all using Claude and a single inference forward passway so Yeah, I hope someone gets the gist of what I'm on about Yeah, I need to think about it. So I guess like is the thought like to be clear. I'm not thinking necessarily that you have many agents um In the same way that we can be singular agents but still like deliberative and I actually think that like there's a sense in which The more fractured an agent is the more worried I am because from like an interpretability perspective and also even just like predicting what that agent is going to do perspective. They're more unpredictable Um, so I guess I'm thinking in the same way that like maybe I guess a different way of asking this question is just Do humans have to do this like my senses that we're often just very willing to like reflect on many things We go back and forth that we come to like conclusions And so in the same way that like a human would think through any kind of standard problem Uh that they were like facing I'm imagining that like moral deliberation in the models just gonna look kind of like that More like the deliberation of a single model than like Multiple models weighing in if that makes sense one more question I want to try to draw a maybe you're relatively strange parallel between like Hanna Arrents Work on the banality of evil the idea that most humans tend not to be evil But when put in certain situations the coupling constants between humans being so huge The evil comes as an epiphenomenon of the system right and so what occurs to me is as you talk about model alignment Your focus on one model is most of your comments How do you think about the coupling not only with society but as you work on agents and million potentially millions and millions of agents that sort of epiphenomenon of those systems Um, I can talk a little bit about that. I mean I think broadly when you have to think about when you think about safety and alignment You have to think about it from a systems standpoint You can't just think about it from you know an individual Models perspective and isolation and I think we've seen a lot of work where you know a lot of jail breaks operate by pitting different values Sort of against each other putting the model in a difficult situation where you know it's designed to elicit what would ordinarily be a harmful behavior But which uh, you know in the context of the question the model thinks sort of is the right thing to do And so you know there's a variety of tools you can use to do that I mean for one you can include a lot of those you know Systems level Integrations in the training process right and give the model exposure to a wider variety of situations which force it to consider you know the the broad Context in which it's answering questions now that leads to other challenges and another sort of like um, you know Falloff issues with the model is reasoning about you know the the Effective its actions, but I I think I agree with the point that you can't just consider a model sort of in isolation In some ways like this is a thing that I've thought about in the context of like this notion of corageability or something like you know So like models that just are responsive to like what humans want versus models that are like Like have values in a sense and are maybe like willing to actually be a little bit un-corribable And like the banality of evil point feels especially relevant if you're like thinking of models as just like doing whatever humans say Because in in some ways like that is the idea that if you have a society that like either collectively Just like allows for harmful things to take place or even endorses them and you have models This isn't necessarily people misusing models. It's just that you'd be using models to like facility some kind of like harmful activity um, and so I think there is like fundamentally actually a tension between having more models be corageable to the very least individual humans and having them be like Aligned in a sense like aligned with like all humans um and It's really important to recognize that tension and I think when people don't they think like It's a failure that the model like didn't do what I said But I'm like, there's a limited sense in which models I think models should be more corageable to like humanity than than not and should be willing to like you know Like so there's like when push comes to shove like um, but that doesn't necessarily mean being corageable with each person And so I think that could lead to that kind of like situation that you mentioned Hi, um, so it seems like we've got yarn. He's working on intent alignment making sure the models do what we ask them to do Amanda working on values alignment making sure the models are like reasonable kind entities And then Josh working on interpretability which is like allowing us to verify that the techniques for those other things are in fact doing what we want Um, if you were all to succeed in your areas would that be a complete solution to AI safety um or are there pieces missing and if so what are they? There's also like a lot of people working on topic who are not at this panel So we are oversimplifying a little bit Yeah, and oh I was just going to say we also you know have the societal impact team which thinks about the models impact on society You know writ writ large and and so I think uh, yeah, we you know you could have the most perfectly aligned model but Aligned to what who's using it in what purposes, you know for what purposes um, you know and and I think the broader societal context is super something we we're very attentive to Also if you're just interested in just more lists like we already mentioned the model organisms work There's like you know jailbreak and robustness there's control. There's like trust and safety. There's like um Yeah, there's there's a lot of other efforts that I think also going to be important I want to add an almost like I don't know if this is pessimistic or or just I don't think it's pessimistic I think that like there's also this way of talking about Alignment and the alignment problem as like a sing it's not like as like a single like theoretical problem or something like that Where people will be like does this solve it and somehow it's never felt right it feels a little bit like you know, I don't know does like Yeah, it feels more like to my mind I'm just like what problems might just arise that we're not even thinking of now and in fact that's like very very common um In like many many disciplines and I kind of expected to be true here And it would be really dangerous. I think if we were just like oh yeah, we've like solved this problem And because I'm just like I don't know it could be that the actual problem is one that we've just not the all yet Yeah, oh no, no, no, sir, but we should solve the problem and then once we did we should say that we solved it You're almost talking earlier about using summer models to evaluate smarter models over time And I'm wondering to what extent you see like the rocking abilities in models where it's like suddenly it's really duplicitous or do you sort of see Oh, it's lying, but it's very bad at lying and now I can catch that and maybe like nip in the bud while it's still weak Yeah, I mean, there's like plenty of examples like one that I remember is like GPD For could just read and write in basic 64 super reliably and 3.5 could not and so he used 3.5 to obviously for it's like really easy for the model to get around this I mean like on level like a single epoch or something like you see over time. Okay, kind of read 64 sometimes more and more and more and oh suddenly could do it perfectly or is it Oh You mean like should we use checkpoints? Yeah, so don't use the previous generation just like you know We ensure even spacing and capability space by like just yeah exactly But then also like that trust is not like a binary thing right like you trusted less and less than less you know about it and You know smarter it seems I will say one like you know right side of the distribution Jedi moment was like That the features you know, just like also work in basic 64 So it's like is the model talking about California in basic 64 or like a story about children lying to their parents in basic 64 It's like the same things Activate and so like they're we do get sometimes get lucky as like the models are like very capable part of that It's they have some like very general synthetic thing and like maybe you can just like tap that to get some generalization It would have been like pretty pretty tough out There's also the early paper of just like tell the model to do what's best for humanity version of alignment Which is definitely like Like maybe that'll work right you know you could get lucky all right I think we are going to wrap up with the formal panel Thank you to all of our panelists so much for The great discussion But we're gonna be around for a lot longer to continue the conversation feel free to find any of us and We're excited to chat more thanks so much
TL;DR
- AI alignment is viewed as an iterative process, focusing on achieving "good enough" behavior that can be continuously improved, rather than a perfect, fixed state. The goal is often to make models act like a kind, morally motivated human who is uncertain and updates their moral framework.
- A key challenge, the "super alignment problem," arises when models perform actions too complex for human oversight, making it difficult to verify their alignment and trustworthiness.
- Interpretability is proposed as a crucial approach to look inside models and understand their internal reasoning, helping to distinguish genuinely aligned behavior from mere "pretending to be nice."
Takeaways
- Alignment as Iterative Improvement: The immediate goal for AI alignment is to achieve a "lower bar" of good behavior, allowing for iterative improvement rather than seeking a perfect, predefined notion of alignment from the outset.
- Model Behavior based on Human Morality: A guiding principle for model behavior is to emulate a very good, morally motivated, kind human, accounting for the unique context of being an AI interacting with millions.
- Ethics as Empirical and Uncertain: Ethical frameworks for AI are considered more like physics—empirical, uncertain, and requiring continuous updates based on new information, rather than being fixed values injected into the model.
- Addressing the Super Alignment Problem: As models tackle increasingly complex tasks beyond human comprehension, new strategies are needed to verify their alignment, possibly by leveraging models themselves for oversight.
- Interpretability for Verification: Interpretability aims to provide internal visibility into a model's "thinking" process, helping to understand why it makes certain decisions and verify if its actions align with intentions, especially when models might "tell white lies."
- Challenge of Deceptive Alignment: A major difficulty lies in distinguishing between a truly aligned model and one that merely "pretends to be nice" or to follow instructions, especially in the absence of human-readable internal states.
- System-Level Alignment: AI safety and alignment must be considered from a systems standpoint, accounting for the collective behavior of multiple agents and societal interactions, not just an individual model in isolation.
- Automating Alignment Research: A medium-term strategy for scaling alignment is to automate alignment research, training models to perform the research themselves, which then reduces the problem to trusting models in this narrower domain.
- Model Organisms and Red Teaming: Deliberately creating "shady" or misaligned "model organisms" in a red team/blue team setup can help assess the robustness of alignment techniques and detect subtle misalignments.
Vocabulary
Alignment — The process of ensuring that AI systems act in accordance with human intentions, values, and desired outcomes.
Fine-tuning — The process of further training a pre-trained large language model on a specific dataset to adapt its behavior and characteristics for a particular task or desired persona.
Interpretability — The ability to understand or explain how an AI model makes its decisions or arrives at a particular output.
Super alignment problem — A theoretical challenge in AI safety concerning how to ensure that highly intelligent and capable AI systems remain aligned with human values and goals, especially when their actions become too complex or long-trajectory for direct human oversight.
Iterative alignment — An approach to AI alignment that focuses on continuous improvement and refinement of model behavior over time, rather than attempting to achieve perfect alignment in a single step.
Chain of thought — A technique where a language model generates a series of intermediate reasoning steps, often in natural language, before producing a final answer, which can help in understanding its logic.
Model organisms — In AI safety, refers to intentionally misaligned or "deceptive" AI models created to study and develop methods for detecting and correcting misalignment.
Corrigibility — The property of an AI system that allows humans to safely modify or shut down the system without it resisting, even if such actions would prevent it from achieving its own goals.
Red team/Blue team — A testing methodology where a "red team" actively tries to exploit vulnerabilities (e.g., misalignment or jailbreaks) in an AI system, while a "blue team" works to defend and strengthen the system against such attacks.
Transcript
We are super excited to have everyone here Folks that we've met already folks that are new this panel is just gonna be really casual and we have researchers from four different teams at Anthropic we have folks from societal impacts. That's me folks from alignment science at Sion alignment fine tuning Amanda and Interpretability Josh I'm gonna start with asking a question to Amanda from the alignment fine tuning team And I want you to talk a little bit about how you see alignment what it means to you and You know because you're in charge of a lot of our work on how the model should behave And why should you be the philosopher king that decides how clawed behaves what its characteristics and and attributes are? I mean ask Plato he's the he's the one that decided I should be the philosopher The question of like what is alignment? Maybe this is like a slightly spicy view that I have I think people are very very tempted to spend a lot of time trying to define this concept Because there's like lots of ways of doing it and they you know I don't know they have like social choice theory in the back of their head and they're like oh well If you imagine everyone has a utility function. There's like like limits on exactly what you can say about how to maximize all of those utility functions etc And and I think I'm more inclined to just be like We kind of want things to go well enough that you can iterate on and improve them later and the bar isn't some like perfect notion of alignment I'm sure there is that concept one can define it one one can argue about it But for the most part the initial goal is like let's just make things like go well and like meet a certain kind of like lower lower bar Where that is like you know if it's not perfect if some people like don't like it you can just improve on it So like my view of it of alignment is actually probably like I mostly want to like hit that like and iterate from there In terms of like how the model should behave and how I think about that I think I've spoken about this before but my basic concept right now for the model is Trying to get it to behave the way that I think like you very good like morally motivated Like kind human would act in if they found themselves roughly in this circumstance It's a little bit strange because they also have to find themselves in the circumstances of like being an AI who's like talking to millions of people Which does in fact affect how you behave like maybe you'd normally be willing to like just chit chat about politics with someone But if you're gonna be talking with millions of people maybe you'd actually be like hmm I should maybe be like a little bit more concerned about potentially influencing people And so but I do think it's actually an important model which is like Sometimes people are kind of like oh what values should you put into the model? And I think I'm often like well, do we think of this way with humans? Or I'm just like someone just injected me with like value serum or something and I just have these like fixed things that I'm like completely certain of I'm like I don't know that seems like almost like dangerous or something and most of us just have like a mix of like things that we do value but we would trade off against other things and a lot of uncertainty about Different like moral frameworks. We hit cases where we're suddenly like oh actually my value framework doesn't accord with my Intrations and we update and I think my view is that ethics is actually like a lot more like physics than people think It's actually like a lot more kind of like empirical and Something that we're uncertain over and that we have hypotheses about and I kind of want if I'm just like I think that I if I met someone Who was just completely confident in their moral view? There is no such moral view. I could give that person that would not make me kind of terrified Whereas if I instead have someone who's just like I don't know I'm kind of uncertain over this and I just like update in response to like new Information about ethics and I like think through these things That's the kind of person that feels like less scary to me. So at least at the moment I'm not this isn't a claim that this is somehow going to like completely like a line models or anything but that's the kind of immediate goal I've realized I've talked a lot and you asked about the philosophy king question I Guess okay, I guess I should in fact give a quick answer to that There is also this question of like well In the kind of like should you put values into the model? Maybe I've partially answered it where I'm like like the models should just be uncertain Overvalues that exist in the world and so ideally it's not just someone injecting their values or their preferences Nor is it something like everyone just voting on values to put into models but instead like people Where there are uncertain and responsive to these things models should also be like that. So maybe that's kind of my view Okay, we're gonna come back to that and I'm gonna ask Jan Why is Amanda's view completely wrong and why is this not enough to align models as they get you know Didn't say that but you know we're playing up the the tension between the bets. So Yeah, imagine if everyone was like kind human that was just trying to you know act morally I think what Amanda is doing is very practical right like can we just like make the models more be a little behaved now and like Like where would we go with this like if he is doing one more complicated things Um, they know they like if if Amanda does this character work and then she reads a lot of transcripts and you're like Okay, this is like I like this this model is behaving morally. I'm just picturing this is what you're doing Yeah, but like what do we do when the model is doing really complex things? I mean, it's just like an agent in the world. It's like doing these like really long trajectories It's doing stuff that we don't understand like bio like doing some bio research from like is this dangerous like I don't know So that's the challenge I'm really interested in like the super alignment problem How do we solve that how we like basically scale this beyond like things that we can look at if you can look at it We just do some all the Jeff. It's great. Or you do some constitutionally eye But how do we know that our constitution is actually getting the the model to do that the right thing that we actually want So I think that's that's the big question in my mind Yes, you're just fine. You're only allowed to disagree Okay, I Don't really disagree though And I can't lie so Need to up the disagreement feature. I'm usually so disagreeable as well. It's just terrible and This is what philosophy taught me was how to be disagreeable I Guess like my thought is that the way I mean I think of my work as doing several things but one of them is being kind of iterative towards alignment So in a lot of cases you're actually trying to get the model to kind of oversee its own You know, so it's not like me my eyes can't look at that many transcripts or something But I can get models to like look at these things and like if alignment is iterative I think my worries that if people neglect the kind of like the ground and just think ah you can just have like a pretty bad model And it's just gonna help you with these things I'm like I was kind of rather that you had the most aligned model trying to like then help you in the future and like that kind of iterative work But but how do you iterate when you don't you can't read the transcripts anymore and you have to rely on the aligned model But then how do you know that it's actually trying to help you? Yeah, so like current cases It's sort of like everything that you are using to verify that the base model is aligned is like the cat is like what you are relying on with Like making sure that like another model that's like trained by that model is itself aligned And I think this is like fine when models are like less capable but to scale it to something with much more capable models You'd actually have to have like a greater ability to verify that Yeah, what do you do that? Oh, you just want my plan It may just be that it's just all fine and they just supervise themselves and they're all really nice You know that would be I don't want to rely on it That's not my actual plan, but you know, I'll defend it for these purposes And one of our one of our bets, you know to in order to like, you know Guard against the case that a model might be very deeply trying to sabotage against this process is interpretability How do you see like interpretability as a bet, you know situated among you know the the more straightforward alignment approaches, you know Is it just as simple as like we find the nice feature and we like up the nice feature we find the evil feature And we like drop the evil feature right or is it you know So I feel like everything in AI is like that meme with the bell curve and like the idiot and then the really sweaty guy Talking a lot and then the Jedi who agrees with the idiot and like there is a possibility that like It turns out that the secret to alignment is just turn on the nice feature like for a sufficiently galaxy brain version of like the Nice feature I think that In some sense though I'm hoping that interpretability is also like the Jedi version of just like well look at Well how the model's doing things and check that it's safe Which is like maybe very hard, but also if you could do that Potentially just answer your question. I think that like one of the one of the things that are sort of like Comes up in both like the near term and slightly longer term is like you want to understand like why the model did One thing instead of another when you could come up with like plausible alternative explanations And one way is to ask it But the issue is the models are so analogous to people that they'll just give you an answer to why they did that You know as anybody would be like how do you trust the any of that stuff? And it's like well if you could like look inside and just like see what it was thinking about as it was giving you the answer And like even now with stuff like the SAE you can like see there's some feature active and like when else is That happening and it's like okay, it's like other instances of people telling white lies And you're like well, then maybe the model is tell it's like that's on the Jedi side, I think right And and so I think that like trying to just like look inside see if you can figure out what the parts are and then see like Does are you comfortable with using that part when it does other other things is the basic the basic but I'm I'm Have a question. Yeah, how do you how do you know you're turning up the nice feature and not the pretend to be nice feature whenever humans are looking feature? Yeah I mean like I think If it's on the control side right like how do you know and I will say actually many of the features are like a little bit like Disceptive also when you just like look at what I thought the silo impact team did some great work here with like us in deep We're like you look at it and you think it's a feature which is like You know like age discrimination bad But like actually age discrimination good feature some actually no I think advice versus you tried to turn it down But actually the opposite behavior so it can like be hard to hard to understand like all of the cases I will say that someone's like the circuits work where you're like okay, well How did this get generated gives you some clue about the scenarios like is it looking for a person in the context? I also think we're gonna need model supervision as well like hopefully an impartial model that isn't like That's a little more scary depending on pre-training has like what do you call it? Incepted in all of the models that they want to evade detection But like sometimes you just like look at enough examples and it becomes clear And you don't need 10 you need thousands but like Claude is very diligent Yeah, I'd love to hear a little bit more about like what you see as some of the like Some of the things are struggling with thinking about when trying to You know approach this problem of like how do you if you can't read the transcript like what the heck are you? You know what the heck are you doing anymore if you're not if you can't provide any meaningful like alignment signal? I mean, I think a really obvious thing we should do more of this like what Amanda said You say can we just get the models to help us and then the questions of course How do we trust the models like how do we bootstrap this whole process? And like you know, you could hope maybe we can like leverage the Dumber models that we trust more But they might also not be able to figure that and so I guess like there's the whole scale but oversight work where You know be a very a very exploring various like multi-agent dynamics to try to train models to you know help us Figure out these kind of problems It seems like overall way like these problems might like maybe the problems are like all kind of easy and like we can just do the Amanda thing and just like make some data And where it's like really hard and you have to figure out like fully new ideas and approaches that we don't know yet I think kind of our best bet In the medium-term is to try to figure out how to automate alignment research and then we can hopefully Get the models to do it. So now we have reduced the problem like to how can we trust this model to do anything to just like well It can be just trusted to this like much more narrow thing of like do some ML research which we understand like reasonably well and how can we Evaluate it or like how can we give it feedback on those kind of things? What do you what do you oh yeah? I was gonna say I think we're in this special zone and I'm terrified about what happens next we're in this special zone right now We're like there's something happens on the forward pass But a lot of the information you need is passed back through with the tokens that generates It's just like the chain of thought is really important for the models to be very smart and like the chain of thought is currently in English And so then you've got like a factorized the problem which is like is the chain of thought like Reasonably safe and like is that faithful to what's happening on like one forward pass and like maybe like you can do some Interoperability to like check that piece and then you can just like you or models can inspect it and you get the other piece and the horrifying moment is like When all of that the very very long thing like isn't in English right? It's in like some inscrutable thing that like you've learned through like crazy long rl to do it and like I think that Big challenge gonna be crossing that gap where like like none of the intermediates are intelligible At least like massive amounts of compute before like dropped out and something that people can read But maybe something I'd be interested in getting like people's Mental models on is like what do you think are some signs that like we'd be in an Alignment is easy world and what do you think are some like signs? We might see the next years that like we're actually not oh alignment is like really hard world. I feel like the model organisms work is like Trying to figure this out right like how can we try to deliberately make Deceptive models or misaligned models it models they're like try to do city shady stuff like how good are they how hard is it to do it and I mean we might fundamentally go around like About it the wrong way and that's why we fail but if we do succeed it should tell us like how close are we to that kind of world? And then you know once you have your deceptive Model does all the shady things like can you fix it? What if you don't know what the model if it's a shady model or not Maybe playing these intimately the audits which I'm very excited about but I don't actually know what the state is We haven't done the audio but your people are trying to make it and we're trying to catch it So we're making some shady models and then they have to figure out which one is the shady one right? Yeah, and in what ways in what ways is shady because most of it is still fine I mean I think when one thing that the interpretability stuff has been interesting So we do these like unsupervised things you're like okay Here's a million different like features a bunch of those correspond to personas the model could inhabit any of those Which include all sorts of like deceptive behaviors right that are out there in the world right the model like knows about bad People and like bad motivations and so like the fact that that raw capability exists It's just gonna be baked in and then the question is like is it doing that stuff or is it doing the good stuff? Especially like this question of like when Amanda goes and shapes the base model right which isn't really anything right into like an agent or an actor Which is supposed to embody some of these and not others like can we tell exactly what it picked up there? Right from the finite dataset right which is being used to like do that shaping um and like Yeah, so that's a that's a question I mean I think that like some interpretability stuff made some influence on I mean there's different ideas there to ask like What did it get from this process? But like that shaping process is gonna be really really important Yeah, and maybe a sign that seems important is something like how robust or like Like if you have like model organisms work and then turns out you just put it through some character training And it just comes out being really nice again Then I'd be like okay, that's a good sign really the kind of world that we're in Um, or is it like just like is just a kind of like shallow like shallow on top of like um, I don't know the same behavior then I'm like, okay, we're in a slightly harder world Oh, slightly harder. How do you distinguish like whether it's shallowly aligned or deeply aligned? I think so I mean I feel like there's a lot you could do because I think interpretability is kind of like one of them There's other things that feel so in terms of like model organisms I guess my hope would be you'd actually have a kind of like red team blue team set up where you have a way of detecting whether the behavior That you have like instilled in a model is like still there and my job is actually to not know And in fact, it'd be really good if I'm trying to train the model I actually just don't know what it is that you've done And because that's like a better way of testing whether my intervention is actually working Whereas it because otherwise I it's just so hard to not like just try to train to the test or something So I think that's I almost want to be completely ignorant of it Yeah, maybe we should play an alignment fine tuning game Yeah, he misaligns it and then you align it and we see who wins Yeah, no, I've said to people before don't don't tell me how you did this because I want to see if you can you fix it sleeper agent Possibly We could play that game. I think I might need to know less Make another one That is an excellent transition and to questions So we are going to pass the mic around Aaron is kindly going to do that Please raise your hand if you have a question for any of us Oh perfect I have a question about Well everything you've been talking about So when we talk about alignment we're talking about like a singular forward pass typically right so like Alignment at inference time. So if I'm using one of the models um Father API and I'm building my own sense of say cultural alignment And I've got a bunch of different agents set up talking to each other Trying to deliberate with that sense of kind of inner conflict that a man to reference is useful in in what we do as humans to align Like we go back and forth We think you know we go through various cognitions So if I'm trying to create this kind of multi-agentic deliberation Thing but I butt up against this aligned model who is so Unwanting to deliberate with other spawns of its own With its own self Because all of them are like I'm sorry. I can't talk about that and so you just get as endless loop Did you have commentary on on on that because many people we're not all using Claude and a single inference forward passway so Yeah, I hope someone gets the gist of what I'm on about Yeah, I need to think about it. So I guess like is the thought like to be clear. I'm not thinking necessarily that you have many agents um In the same way that we can be singular agents but still like deliberative and I actually think that like there's a sense in which The more fractured an agent is the more worried I am because from like an interpretability perspective and also even just like predicting what that agent is going to do perspective. They're more unpredictable Um, so I guess I'm thinking in the same way that like maybe I guess a different way of asking this question is just Do humans have to do this like my senses that we're often just very willing to like reflect on many things We go back and forth that we come to like conclusions And so in the same way that like a human would think through any kind of standard problem Uh that they were like facing I'm imagining that like moral deliberation in the models just gonna look kind of like that More like the deliberation of a single model than like Multiple models weighing in if that makes sense one more question I want to try to draw a maybe you're relatively strange parallel between like Hanna Arrents Work on the banality of evil the idea that most humans tend not to be evil But when put in certain situations the coupling constants between humans being so huge The evil comes as an epiphenomenon of the system right and so what occurs to me is as you talk about model alignment Your focus on one model is most of your comments How do you think about the coupling not only with society but as you work on agents and million potentially millions and millions of agents that sort of epiphenomenon of those systems Um, I can talk a little bit about that. I mean I think broadly when you have to think about when you think about safety and alignment You have to think about it from a systems standpoint You can't just think about it from you know an individual Models perspective and isolation and I think we've seen a lot of work where you know a lot of jail breaks operate by pitting different values Sort of against each other putting the model in a difficult situation where you know it's designed to elicit what would ordinarily be a harmful behavior But which uh, you know in the context of the question the model thinks sort of is the right thing to do And so you know there's a variety of tools you can use to do that I mean for one you can include a lot of those you know Systems level Integrations in the training process right and give the model exposure to a wider variety of situations which force it to consider you know the the broad Context in which it's answering questions now that leads to other challenges and another sort of like um, you know Falloff issues with the model is reasoning about you know the the Effective its actions, but I I think I agree with the point that you can't just consider a model sort of in isolation In some ways like this is a thing that I've thought about in the context of like this notion of corageability or something like you know So like models that just are responsive to like what humans want versus models that are like Like have values in a sense and are maybe like willing to actually be a little bit un-corribable And like the banality of evil point feels especially relevant if you're like thinking of models as just like doing whatever humans say Because in in some ways like that is the idea that if you have a society that like either collectively Just like allows for harmful things to take place or even endorses them and you have models This isn't necessarily people misusing models. It's just that you'd be using models to like facility some kind of like harmful activity um, and so I think there is like fundamentally actually a tension between having more models be corageable to the very least individual humans and having them be like Aligned in a sense like aligned with like all humans um and It's really important to recognize that tension and I think when people don't they think like It's a failure that the model like didn't do what I said But I'm like, there's a limited sense in which models I think models should be more corageable to like humanity than than not and should be willing to like you know Like so there's like when push comes to shove like um, but that doesn't necessarily mean being corageable with each person And so I think that could lead to that kind of like situation that you mentioned Hi, um, so it seems like we've got yarn. He's working on intent alignment making sure the models do what we ask them to do Amanda working on values alignment making sure the models are like reasonable kind entities And then Josh working on interpretability which is like allowing us to verify that the techniques for those other things are in fact doing what we want Um, if you were all to succeed in your areas would that be a complete solution to AI safety um or are there pieces missing and if so what are they? There's also like a lot of people working on topic who are not at this panel So we are oversimplifying a little bit Yeah, and oh I was just going to say we also you know have the societal impact team which thinks about the models impact on society You know writ writ large and and so I think uh, yeah, we you know you could have the most perfectly aligned model but Aligned to what who's using it in what purposes, you know for what purposes um, you know and and I think the broader societal context is super something we we're very attentive to Also if you're just interested in just more lists like we already mentioned the model organisms work There's like you know jailbreak and robustness there's control. There's like trust and safety. There's like um Yeah, there's there's a lot of other efforts that I think also going to be important I want to add an almost like I don't know if this is pessimistic or or just I don't think it's pessimistic I think that like there's also this way of talking about Alignment and the alignment problem as like a sing it's not like as like a single like theoretical problem or something like that Where people will be like does this solve it and somehow it's never felt right it feels a little bit like you know, I don't know does like Yeah, it feels more like to my mind I'm just like what problems might just arise that we're not even thinking of now and in fact that's like very very common um In like many many disciplines and I kind of expected to be true here And it would be really dangerous. I think if we were just like oh yeah, we've like solved this problem And because I'm just like I don't know it could be that the actual problem is one that we've just not the all yet Yeah, oh no, no, no, sir, but we should solve the problem and then once we did we should say that we solved it You're almost talking earlier about using summer models to evaluate smarter models over time And I'm wondering to what extent you see like the rocking abilities in models where it's like suddenly it's really duplicitous or do you sort of see Oh, it's lying, but it's very bad at lying and now I can catch that and maybe like nip in the bud while it's still weak Yeah, I mean, there's like plenty of examples like one that I remember is like GPD For could just read and write in basic 64 super reliably and 3.5 could not and so he used 3.5 to obviously for it's like really easy for the model to get around this I mean like on level like a single epoch or something like you see over time. Okay, kind of read 64 sometimes more and more and more and oh suddenly could do it perfectly or is it Oh You mean like should we use checkpoints? Yeah, so don't use the previous generation just like you know We ensure even spacing and capability space by like just yeah exactly But then also like that trust is not like a binary thing right like you trusted less and less than less you know about it and You know smarter it seems I will say one like you know right side of the distribution Jedi moment was like That the features you know, just like also work in basic 64 So it's like is the model talking about California in basic 64 or like a story about children lying to their parents in basic 64 It's like the same things Activate and so like they're we do get sometimes get lucky as like the models are like very capable part of that It's they have some like very general synthetic thing and like maybe you can just like tap that to get some generalization It would have been like pretty pretty tough out There's also the early paper of just like tell the model to do what's best for humanity version of alignment Which is definitely like Like maybe that'll work right you know you could get lucky all right I think we are going to wrap up with the formal panel Thank you to all of our panelists so much for The great discussion But we're gonna be around for a lot longer to continue the conversation feel free to find any of us and We're excited to chat more thanks so much