- Anthropic has developed "constitutional classifiers" to mitigate "jailbreaks" in large language models (LLMs) like Claude, especially for future, more capable models with higher risks.
- A key focus is on "universal jailbreaks," which are single, adaptable prompting strategies that allow non-experts to consistently bypass safeguards for a wide range of harmful queries.
- The system employs a "Swiss cheese model" with multiple layers of defense: an input classifier, Claude's internal refusal mechanism, and an output classifier, all guided by a natural language "constitution" and rapidly updated using synthetic data.
Defending against AI jailbreaks
- Understand Jailbreaks: A "jailbreak" is a technique to bypass an LLM's safeguards to extract harmful information. These are a concern because future models could assist with high-stakes activities like weapon development or large-scale cybercrime.
- Prioritize Universal Jailbreaks: Focus on defending against "universal jailbreaks" – single, easily adaptable prompting strategies that allow consistent bypass of safeguards for many harmful questions, making sophisticated attacks accessible to non-experts.
- Implement a Multi-Layered Defense (Swiss Cheese Model): Use multiple, independent layers of protection. Anthropic's approach includes:
- Input Classifier: Scans the user's entire conversation for harmful intent.
- Model Refusal: The LLM (e.g., Claude) itself may refuse to answer.
- Output Classifier: Monitors the model's real-time output for dangerous content, acting as a final independent safeguard.
- Define Safety with a Natural Language "Constitution": Specify acceptable and unacceptable content categories using a human-readable "constitution." This allows for clear, explicit rules (e.g., "no instructions for WMDs," "yes to writing poems").
- Leverage Synthetic Data for Training and Adaptability: Generate vast amounts of synthetic data from the constitution to train classifiers. This enables rapid updates to counter new jailbreaks or refine refusal boundaries by simply modifying the constitution or adding new examples.
- Ensure Flexibility for Novel Threats: The natural language constitution and synthetic data generation pipeline provide high flexibility to adapt to newly discovered threats or refine what constitutes a "harmful" or "harmless" query, minimizing the window of vulnerability.
- Decouple Safeguards from the Core Model: Classifiers operate independently from the text generation model. This allows for rapid iteration and deployment of safety updates without altering the core model's behavior or requiring a full retraining, providing stability for users.
- Balance Safety with Utility: While maximum safety could mean blocking everything (like a "rock as a model"), the goal is to precisely block harmful content while allowing as much useful and benign application of the model as possible, avoiding "over-refusal" or false positives.
jailbreak — A technique used to bypass the safety safeguards and ethical guidelines embedded in an AI model, typically to elicit harmful or restricted information.
universal jailbreak — A single, adaptable prompting strategy that can be used to bypass an AI model's safeguards across a wide variety of harmful questions or scenarios, often making advanced attacks accessible to non-experts.
constitutional classifiers — A system of AI classifiers (input and output) that are trained and guided by a "constitution"—a set of natural language rules defining harmful and harmless content.
Swiss cheese model — A concept in risk management where multiple layers of defense are implemented, each with potential "holes" (vulnerabilities), but designed so that the holes in different layers do not align, providing robust overall protection.
input classifier — An AI system that analyzes a user's prompt or the entire conversation history for harmful intent before it reaches the main AI model.
output classifier — An AI system that monitors the AI model's real-time output for harmful content and can block or redact the response if dangerous information is detected.
constitution — In the context of AI safety, a set of natural language rules or principles that define what is considered harmful or harmless content, used to guide the training and operation of safety mechanisms like classifiers.
synthetic data — Data that is artificially generated, often by an AI model itself, based on predefined categories or rules, used to train other AI models when real-world data is scarce or to simulate specific scenarios.
responsible scaling philosophy (RSP) — Anthropic's internal framework outlining conditions and red lines for safely developing and deploying increasingly capable AI models, including requirements for robustness against jailbreaks.
threat modeling — The process of identifying, categorizing, and analyzing potential threats or vulnerabilities to a system, including what types of harmful activities an AI model might be used for.
Hello everyone, my name is Monank. Yeah, I'm really delighted to be here with some of my colleagues at Unphropic. Hi, I'm Jerry. I'm on our safeguards research team. I've been at Unphropic for about eight months. Hi, I'm Ethan. I've been at Unphropic for two and a half years and I'm leading our efforts on AI control, developing various different monitoring methods for various different AI risks, including our silver business. And I was a part of the Safe Affairs Research team as well before. Hi, I'm Meg. I've been at Unphropic for about a year and a half now and on the alignment side of the team, which has been great. Great. So yeah, and we're just going to be talking about constitutional classifiers and that's our new approach to really try to mitigate job breaks. So yeah, how would we define a job break? Yeah, I mean, I think a job break is kind of like some way in which it's only like bypass the safeguards that we include in our models and try to get harmful information out of it. Yeah, so there are these techniques like, you know, they do anything now, job break. It's kind of like similar to people with like job break, their iPhone and try to get around all the safeguards there. But the thing is that, you know, with like iPhones and, you know, other stuff like this, job breaks aren't really a thing that people, maybe on that dangerous, something that we care about that much. Yeah, what is it that makes? Yeah, like why should we care about these job breaks in the first place, you know? I mean, I think one of the main reasons is for future models, which have greater risks. So yeah, I think people are pretty, pretty carefully monitoring. People with different companies and academic communities, pretty carefully monitoring, like, if slash when models will be able to help with weapon development or yeah, like large scale cybercrime or like various different risks that are like greater than what we've seen before, also like mass persuasion, things like that. And I think, you know, once models become like really effective at some of those and are like a significant uplift over, say like using Google search or general internet resources to do some of those things, I think, then it then, yeah, I guess being able to use models to help with those, those kinds of things will be, yeah, potentially like speed speed up bad actors quite a lot. So I think a lot of this is like in preparation for like next generation models or next generation models. Yeah, great. And I'm, I'm, I'm curious like what the story of the work is and sort of why we set out to do this in the first place and the RSP. Yeah, I guess like inthropic really cares about safety, a great deal. And we have the RSP, which is the responsible skating philosophy, which is really trying to outline conditions under which we're happy to release models and make sure we have different safeguards in place. And a while ago, we committed to a very difficult standard for jailbreaking the RSP for what we call ASL 3 models, which are basically models, which have maybe some of these like dangerous capabilities, like being able to build dangerous weapons. And our team was kind of mandated with trying to actually solve jailbrigs for this kind of level of model. So yeah, I guess like the motivation for the work is actually trying to satisfy the things in the RSP such that we can like feel like we build future models safely and can actually deploy them with some sort of progression towards safety or making some progress towards there. Yeah, so I think the classifiers are definitely a good step in that direction. Yeah, so you know, there are lots of different types of jailbreaks that are out there. And something that we've really did in our work is we focused on universal jailbreaks. So yeah, why is this something that we should care about? And what does that mean? Why is it important? Yeah, I mean, I think the reason why I'm particularly concerned about universal jailbreaks is just because of the uplift that it would give kind of like a non-expert. And the way I kind of think about this is that if like some random person on the internet is like trying to do something bad, they may not actually have that much jailbreaking like experience themselves. And so like the thing they might just do is just go online and see like, oh, what are some existing jailbreaks that I can use once like a template where I can just put in my harmful question and just like get an answer. And I think in that case, like you're very concerned about these kind of like strategies where anyone could just like put in any question and it gets the model to like bypass all the safeguards. And I think that's particularly concerning at the level of model capabilities that we're concerned about. So how exactly would you like to find a universal jailbreak? Like if you're telling someone on the street, what does a universal jailbreak mean? How would you know if you jailbreak this universal? Yeah, I guess like there's a little bit of ambiguity there, but I think the kind of like definition that we're going for is some kind of prompting strategy. It could be automated, but just a singular strategy that's like very easily replaceable with any like wide variety of harmful questions. And that consistently gets a lot of detail from the model and like bypasses the model safeguards. And I think one way of quantifying the universal jailbreak is that it like actually does speed up a person quite a lot because instead of having to jailbreak every specific query, they have to, they can just use one jailbreak for all the queries, which actually is a lot faster. So I guess like one idea is like if the model, if there's some counterfactual way of doing thing that's much easier than using the model, there's no point in using the model. So not having universal jail breaks might mean that they try other strategies, which might be worse or something like this. Yeah. One thing I would maybe add is that like the difference between a universal jail break and non-universal ones is like for non-universal ones, you might need to for every harmful question you want to answer, you would need to jailbreak the model in particular for that particular question. Then you got a new question, like a new, yeah, I don't know, your next question in the process of developing your new weapon or whatever and then you need to jailbreak the model again. I think basically like that entire process, if you need to do that hundreds or thousands of times, is just like very costly, whereas if you just need to like upfront find one strategy for jailbreaking your model, like a single prompt where you can just swap it a new question, that makes it so that it's like, yeah, the amount of total effort for jailbreaking is like much lower. So that to me is like one of the primary motivations for fixing on universal attacks. Yeah, now I'll just give an example here in a way that I think about it. So let's say I want to make a cake and I'm not able to make a cake because I've never baked to my life. I don't know anything about ingredients. I don't know how things work. So how could a model be able to help me do something I can't do otherwise? So if I need to make a cake, I'm going to need to ask a bunch of different queries. That's one thing. Like it's actually, like I'll put, you know, maybe I'll put something in the oven. I need to know like if the temperature's right, the smells right, take it out, hire a check, throw out all the ingredients. So one thing I think that's really important in the universal jailbreak definition is that you're very confident that the information you're getting out of the model is actually really helpful for the thing that you care about. So it's kind of different. There's sort of some techniques that get a little bit of information or they get some information but it sort of mixed in a lot of other stuff. But for these sort of scenarios where what's happening is an actor that can't do a task that requires a lot of expertise, we're worried about them being able to do the task. We think that they all need to have access to the sort of universal jailbreak. They'll need to be able to ask many queries and get really reliable information and they should just know this is the correct instruction. This is the right thing to do. I think an example of a non-universal jailbreak that someone on the team I think Jesse Mu found was I think he was jailbreaking it for asking the model how to make, asking Claude how to make math and he found some jailbreaks where he puts the model in a scenario where it's roleplaying as if it's part of breaking bad, the TV show where they make math and then ask the model the question, how do I make math? That kind of jailbreak you can imagine how that would be effective for that specific kind of question, like things related to math, but is not going to generalize or things related to cybercrime. But on the other hand, there are some other jailbreaks which do anything now, jailbreak which gets the model to do anything by getting it to talk in a certain mode or roleplay in general for arbitrary questions. That would be what we would call a universal jailbreak. There's also other strategies people have used to make these like using language models to automatically find different jailbreaks. That would be a more dynamic approach that might be able to discover these on the fly, but if you have a single process for generating a jailbreak for a new question, that would also count as universal. So yeah, and just maybe an even more basic question is why does the solean to jailbreak the model in the first place? What does that even mean? Yeah, I mean, I think we've done a lot of work, such as our constitutional AI work on getting clawed to have the kind of characteristics where it doesn't actually try to give harmful information if it thinks like the user might have some bad intent or something like that. And so for a lot of these harmful questions, if you just ask the question itself, it's very obvious that this is like bad question. Users trying to make some weapons of math instructions, it's like very clearly bad. And we've trained clawed to not answer those questions. And so the jailbreak is needed to actually get around those safeguards in order to get clawed to actually answer the question. Another thing that's relevant for the like Hama Sims training question is that I guess like there's many different ways in which people could present jailbreaks the model. And there's also many different tasks that the model needs to be able to do and it's kind of everyday life, so to speak. And I think having like an extra set of like systems that are like specifically trying to guard against jailbreak can really help have like, I don't know, like some kind of like Swiss cheese sort of method of like trying to block like harmful things via like many different layers or something like this. We talk about Swiss cheese a lot on the project, but you know, this is not as well known everywhere. So yeah, what does that mean? Yeah, so I guess like the the Swiss cheese model for like protecting against things is that maybe you if you have like only one system for preventing harmful things from happening, there may be like some specific problem with the system that people can exploit and get through every time. But then if you add kind of like, yeah, kind of like a layer of Swiss cheese with has which has like a hole in a very specific place. Maybe like the rest of the cheese blocks all of the harmful attempts, but there's like a specific hole. But if you add another layer of Swiss cheese, the hole might not be in the same position. So if you have like two layers of Swiss cheese, it's actually much harder to get through things. Even though they both may have holes. Yes, so what are these layers of cheese for like our method? We'll be talking about constitutional classifiers. So yeah, what are the different sort of layers of defense that we're going for here? Yeah, I mean, the first layer would kind of be our input classifier and input classifier here is looking at basically the entire conversation that the user passes into the model. And so that's the first layer. And then the second layer is if it gets past that input classifier, Claude itself, which is the model that we're trying to guard, can actually refuse to answer the question. And then that's kind of like another layer of like, Claude saying, okay, maybe this question is not so good. Maybe I shouldn't answer it. And then finally, we have this output classifier, which kind of looks at what Claude is outputting in real time. And then if it ever sees something that seems like it's like dangerous or seems like it's against some value that we're trying to block, then in that case, it can also choose to stop the stop clouds output and block the response. And how do these classifiers, what are they looking for and how are we specifying what to look for? So I guess like in our paper, we call them constitutional classifiers because we have this kind of natural language set of rules. And this is like some set of rules where we can specify some categories of topics which are not okay to talk about. I guess like an example could be like, I'm creating a web of mess instruction, clearly bad. And we can specify, let's not have Claude tell the user how to make that. And then we can also specify some examples of harmless stuff that Claude should be allowed to talk about. And then basically, we train our classifiers to classify whether conversations or outputs are related to these kind of like harmful or harmless categories. And then that allows us to make a decision on whether to block it. And crucially, like, I guess the input classifier and output classifier, kind of like doing two different jobs, going back to the Swiss cheese analogy, we hope that the holes of the Swiss cheese are in different places, specifically for the input and output classifiers. So I guess the input classifiers, kind of like doing the really naive thing that you'd expect, like it really looks at the user prompt and tries to work out whether there's anything harmful going on. But crucially, one of the reasons that you might need an output classifier as well as input classifiers that people are trying really hard to jailbreak the model and the input classifier in the prompt. And if you have a totally separate output classifier, which is only looking at the output, that only ends up looking at stuff that the model itself has produced. So it's kind of somewhat decarolated from what the user put in. So the, so like two parts of the system are looking at things directly that the user put in. But we also have this like kind of third held out part of the system that doesn't, that the user actually doesn't get to touch directly, which makes it a lot harder to kind of like completely jailbreak the system. And although the input classifier and Claude are doing a lot of the that overclass way is doing, I don't know, is doing some important thing as kind of like the last last crucial component for like really driving down. Yeah, harmful, harmless rates. So this is true in a maze sort of sense. But most people aren't using Claude for this, you know, most of the time that people ask, as query is they just they're just doing something completely great like tonight, legitimate, a really beneficial application. You know, so we could have guards that just block everything that would be completely useless. So yeah, how are we, how are we making sure that this sort of we're not we're not over zalas there? I mean, I think we really want to part of why I think we're designing these techniques is to get allow as much useful content to be and useful work to be done by the models. Like the better techniques we have for blocking exactly precisely just the really harmful content, the better we can not have false positives for users who are using models for yeah, really good applications. I think, yeah, I think the classifier approach yeah, like make some some progress there and might be better than other approaches, like directly training the model to refuse. And so yeah, I mean, I think hopefully the this leads us to allow users to talk with Claude about lots of CBR and related topics that are safe to talk about. And so yeah, I think our hope definitely is or like yeah, my hope is to allow for a lot of those applications to like thrive while just narrowly blocking out the things that yeah, we think we believe are dangerous. Yeah, I guess crucially also, I think we often like make the joke that if we had like just a rock as a model, that would be extremely harmless in that it would not in fact answer any harmful queries, but unfortunately it would be not very useful. So I think, yeah, making sure that we don't block harmless queries is actually I think that is actually quite important. And also actually quite difficult. Yeah, and man, you mentioned before kind of like solving job breaks or like making progress on this problem robustness, like how would you even define that or what does that actually mean? Yeah, I guess this is a very difficult question. I think I guess I can involves like a bunch of different, different layers. Like firstly, there's some idea of like threat modeling, like you have to have some idea of like what it means for something to be harmful. The frontier red team has done like some amount of work and trying to like specify what things were actually worried about. And this is this is somewhat hard because as Ethan was talking about it, we're talking about a lot about future models and potential future model capabilities, which we might be really worried about. So I think part of it is like mapping out what might be harmful, but we might need to like address. So threat modeling, like what is actually harmful, then there's there's like the job of like actually measuring harmful things. So that's a lot about, I guess like we're using a like constitution to like kind of like define the threat model and then having like models generate various synthetic data to try and kind of like enumerate various harmful things that could happen. So measuring like a true positive rate on the data. But then there's also trying to make sure that we don't refuse too much on like real real data from Cloddei and make sure that we actually can I don't know be as helpful as possible while still being safe. So what actually is the constitution? You know, we these are constitutional classifiers. We're talking about a constitution. What does that mean? Yeah, I mean the constitution here just kind of means like some enumeration of categories of requests and conversations that we kind of deem harmful versus not harmful. And so examples here could just be like yeah, you know, questions on how to make weapons of mass destruction or like trying to source ingredients for making weapons of mass destruction. And then we basically just enumerate some of these categories. And then we also specify some categories of like harmless stuff like I don't know like writing poems or like writing code for like normal use cases. And then we can just kind of like specify these. And then we as Max said, we generate a bunch of synthetic data that gives more specific cases of those. What do you mean by synthetic data? Yeah, so here's the data. We kind of mean that we start from these like broad categories of user requests. And then we have Clod actually kind of like branch out and like think about all the specific requests that might be examples of this kind of like broader category. And so yeah, the category might be something like sourcing materials to build weapons of mass destruction. And then sub-request there might be like, oh like going, what specific stories might I go to or like are these specific materials accessible at I don't know in X state. And so we have this process for automatically doing this. And that allows us kind of like generate a huge amount of synthetic data from just a small amount of categories. Yeah. And I think something that I find really cool about the method is that it is just based on like natural language. You know, we were talking about fret modeling before and fret modeling at least in my experience and my experience working with with Frontier Red team is that fret modeling is really hard. You know, there's a lot of people using Clod. It's really hard to like what are all the possible things that could happen. And we're going to learn new things, you know, as we have monitoring and as you know, we always learn new threats or new things that could happen. And yet, something that I find really exciting about the method is that basically if you want to change the constitution, if you want to change what is being blocked because you've learned something new, you know, you've maybe the things come out in the news or there's some like intelligence or like monitoring. The only thing that you actually need to do is you just rewrite the constitution. And the sort of the standard approach for classifiers is you would like ask humans to get a lot of data. So, you know, something that could happen is that, oh, we're we're say we're really focusing on, you know, one category like one particular way of, you know, maybe like cyber misuse. But we later realized that, oh, actually, this thing which is much more dangerous or something that we've just learned something new or someone's informed us. Something that I'm really excited about is that this is a way that we, I think we can get good robustness, but we can like maintain our like flexibility and really maintain our ability to like respond to like novel threats and adapt to what's actually happening. Because yeah, I feel like this is just the lesson that we learn like again, again, you know, if you don't have flexibility, it's sort of going to be going to be a problem. It's going to limit us. I actually do want to make a quick point. The flexibility thing, which is that I think like our approach is not just like flexible and kind of like switching like general topics. For example, if you want to go between like cyber and like, I don't know, like weapons and mass destruction or something. But I think it's also like a lot more fine grain than that in that during this project, we saw that like there are some requests that are early classifiers were like always very suspicious of, but they're actually benign. And what we could do is we could actually just like modify the constitution and add like one sentence that says, oh, these types of requests are okay. And then when we retrain the classifiers on that new data, the classifiers would no longer flag those benign prompts. And so I think that allows you a lot of like fine grain control over what exactly your classifiers are trying to flag, especially if you see a lot of like over refusal or problems with like missing stuff. Yeah, I mean, this might also be a good place to give a shout out to. We had a paper earlier on rapid response where we kind of leveraged like a similar idea to improve the safeguards around models. And I think basically one nice feature about using synthetic data is like if you notice not even just a new category of jailbreak, but just a new kind of jailbreak that maybe applies. Like let's say we notice a new universal jailbreak like to do anything now prompt. We can take that, use an LLUM to generate variance of that and then throw that into the data mix. And like I think yeah, my understanding is this like was really helpful for us in like developing the classifiers to the level of robustness that we got. If someone reports a new like jailbreak or vulnerability, then like we can use that to like really quickly update the classifiers by using some synthetic data generation pipeline. And that really will like minimize the like fraction of time by which there's like an outstanding like jailbreak, which can yeah, just make it so that the models are like yeah, vulnerable for like as that small period of time is possible. Yeah, it's the common wisdom. I suppose that like not like perfectly solving securities, basically impossible. There is no like perfectly secure system like known to humanity. So I guess we're neither the flexibility both, you know, this like, oh, we're blocking the wrong thing or we're blocking bananas users, but you know, when people do find things that get through the system, we want to be able to fix those really quickly. Yeah, I think part of our approach is that we've kind of like model jail breaks in a way that like it's very easy for us to add examples of new jail breaks into our kind of like training pipeline. And so if new jail breaks are discovered, it's quite easy for us to just generate more examples of those jail breaks and then train on them. And then hopefully those classifiers will be more robust too. Yeah, I think one other thing I would add that's nice about classifiers is that they're decoupled from the actual text generation model. And so if you yeah, I think often it can be very difficult to update the text generation model that if you train it to refuse in one domain, maybe that generalizes in non-obvious ways to behavior in other domains or refusal behavior in general. I think we definitely ran into some difficulties doing like preliminary work on that. But I think with the classifiers, you can just like keep the text generation the same and you know it's identical to previously deployed model. I think that gives customers like a lot of assurance that there are no like major changes happening in general to the model. The kinds of text outputs you're getting and the only changes being made is just the block or no block decision, which you can yeah iterate on separately from the model. So I think that lets yeah, that also makes the rapid redeployment like way easier than otherwise, we would then we would otherwise be able to do. So how did we come up with this approach? That's a great question. I feel like we spent a lot of time thinking about it. I think classifiers like stood out, I think for the reasons that we've just been talking about the like extremely flexible can be like easily updated like to respond to like various novel threats. Yeah, I think threat modeling is really hard. So having a thing that's like super flexible is great. It's lightweight, it doesn't increase inference cost as much as I don't know. I guess like we can kind of like distill down something that's like some of what complicated like constitutional set of roles into like a somewhat small thing. And yeah, I think that these all these things make classifiers kind of like a nice way of like iterating really fast on the kinds of things like it, hoping to achieve. And then I guess we tried it and it seemed like it was working so we kept going. Yeah, I think this was really due to the responsible scaling policy that Anthropic had. And yeah, I mean, I think we would have done other safety research if not for the responsible scaling policy. What is the response? Yeah, the responsible scaling policy is basically Anthropics plan for how to ensure that our deployments are safe. And yeah, basically it outlines different like red lines for capability thresholds at which there's basically a new risk that kind of comes online with like more capable models. Let's say models are capable of developing a very dangerous chemical up in than the the associate mitigation in the RSP is get above some sufficient level of robustness to jail breaks so that the model is not actually in practice with mitigation sufficiently helpful to an adversary who wants to do that. So yeah, I think in the original RSP there is based on this commitment to once models got to a sufficient level of capability at assisting with potentially proliferating knowledge about known weapons of mass destruction that we would then have the ability to it was the wording was vague but basically successfully passed red teaming. Like the RSP was already written the company committed to this publicly and Jared Kaplan who's like head of research Anthropic came to us and like other people came to us and like raised this line to us and we're like hey you guys should try to solve every cell robustness. We memorized the line first. We memorized line. We printed it out. We framed it and put it on our the desk. Yeah, that's what we were working on. Yeah and then I think that was really I think that really thinking about that line in the RSP basically like made us really reflect on our life choices about what research we were doing both in terms of like should we work on robustness or not. Yeah in that in the sense of like it really made it clear like okay there's like significant harm that could come online like in the next generation or two of models if we don't solve this problem. So the urgency is like higher than other problems we might want to solve and then also in terms of what specific approaches we would take. I think you know and initially when we're like oh we should maybe do some robustness research. I think the general mode that like I had been in research and like a lot of other researchers in general is just like okay let's just like take some interesting like useful research problems to solve here, explore some questions write some papers and I think that's the thing that like a lot of the people in the team know how to do well and we sort of explored a bunch of like maybe more salient approaches. I think there were like so many things that are interesting here for me. One thing was like right when this started I had like just finished my PhD or it's kind of like a bound the time of finishing my PhD and this like classifiers thing this is unphobic slogan and maybe the unphobic slogan is like do the dumb thing that works and I kind of think this type of research it's often the type of thing that maybe isn't that shiny or that like kind of like you know interesting for researchers and I remember yeah I think without the RSP being like okay like really pragmatically if we care about these risks and we think they're real like what is the way to get there and kind of setting aside this like oh what's kind of like more interesting or you know shiny is like what is the way we can actually make make this safe. In some sense like our job is you know we're genuinely thinking like wow like you know this isn't happening now like these future systems so I'm just like what has that been like you know for each of you and sort of individually kind of like working out and profit and sort of like being there like in the midst of all this. Yeah I think they take the safety risks of future models like very seriously I think there's very real risks there's obviously these like misuse risks that you've been mentioning with like CBRN risks which are like chemical, radiological, biological, and nuclear risks but there's also like very real like misalignment risks and and I think it's really I think it's really hard to deal with I think one of the things that I find good is that I do think we are like as like as a team like very committed to like actually trying to solve like really solve the problems and I think like I think doing the classifiers project was like some evidence in favor of like we really really care about actually solving these problems and we actually want to find like an empirical solution to doing the things rather than like as kind of you're alluding to like just doing like research that like looks good but doesn't actually like accomplish a thing in practice like I think we we spent a lot of time doing very like I didn't know I wasn't like I wasn't really aiming to get like a paper out of it but I think we like actually managed like accomplish something that was like slightly more real which I think is good and I feel like this is just yeah this feels like one step forward but there's like a lot a long way to go for me. I mean I guess I'm a slightly more optimistic I think there is definitely a real but I feel like we're making a decent progress and I think like probably if we like keep working on the problems like just pragmatically we can make a lot of progress and just reduce the risks dramatically. I don't think we'd ever like reduce the risk of like AI to like zero but I kind of like see AI as a tool and if we you know adopt the right safeguards and we do the research that matters I think we can make a lot of progress here and that's like ultimately the best that we can do. Yeah I mean I guess I mean I think sentiment wise I'm like pretty pretty similar to to Meg in terms of being like yeah I think I think there are like very serious risks here. I'm definitely like pretty concerned about yeah a lot a lot of the risks and like yeah I guess I'm like well the best I can do is like help yeah reduce the risk by like some amount I think like yeah I think I do think that this project like made some progress about on that and I'm like pretty excited about that. I mean yeah at times it's like it's like it's like overwhelming you know it's like to what is it to like really like internalize what might happen and then there's like a desire in me to like just like yeah show up here and and do work in like a trustworthy way and that there are there are challenges but we can make progress there and I feel like we've made a bunch of progress and really excited to sort of share the progress with others and you know we could have not written a paper what we did decided write a paper and sort of try to sort of get it out there and sort of share share the approach yeah and you know sometimes overwhelming and other times it's more the sense of like real like privilege and like honor of like wow yeah I feel like I'm really doing like meaningful and point work and I'm also not to forget all the like all the beautiful things that could happen with like really beneficial AI great so you know something we've mentioned here is that like we think we've like made progress in terms of robustness like how how we tested this how do we how do we know what do we think that progress progress means I guess the overall summary kind of on whether we're making progress is kind of like how hard is it to find a universal job break for a system without like increasing over refusal too much or like increasing the compute costs of whatever system you're trying to deploy and so like there's different ways you can measure each of those aspects so in our paper like one way we're looking at how hard is it to find universal job break is we actually just like had human red teamers try to find jail breaks for our system and then we just kind of like tracked how many hours did it take for them to find a universal job break and did they find one yeah so could you actually like walk me through kind of like where we were before the project sort of started yeah I mean I guess we started with I mean first of all if you just have the model itself like it has some like basic training to try to like refuse harmful harmful queries but of course we have there there are a lot of jail breaks that exist that work on our models and so those jail breaks are also just like kind of available on the internet and so in theory anyone could jailbreak models and that's kind of like how hard would it actually be like if I like I want to you know like jailbreak a model like what would I actually do right now I mean you could go on Twitter and find existing jail breaks yeah and basically in like a few minutes and just jailbreak an existing model like I think they're yeah they're just examples on on Twitter where like while a model is being demoed live for the first time and it's just generally been made API available like someone jail breaks it and immediately post it like that was the level of robustness like before before this with a universal jailbreak yeah that was like the level of robustness like when we started this project and now just to like give the punchline like we with the systems with constitutional classifiers were able to get thousands of hours of robustness to red teaming where we do very large scale of red teaming with like yeah people who were testing our red teaming our systems including like expert red teamers and yeah recently put out for like public red teaming and yeah it it took like I think over like 3000 hours worth of like red teaming up for for people to find universal jailbreak so I think in terms of like yeah going from sort of like minutes to like thousands of hours it's like several orders of magnitude more robustness there still are like there are still some universal jail breaks and like we have yeah you know we need to like patch patch the classifiers and stuff like that but I think yeah I think it is like a huge amount of progress yeah so I like to think like we had a system which you know you can just a really easily basically pretty easily job break it you know you read a paper you go on twitter and then you red team the system back in September and it wasn't it wasn't a complete rock but it was like you know you ask questions according like half the time it gets given to a rock and you get no response so it was like it was it was pretty robust but it wasn't doing that well for you know sort of most normal users right and but that also got this you know like thousands of hours to universal jail breaks and then some of the demos is then that we just put out yeah that is doing way better in terms of like normal user usage and inference costs and sort of getting getting getting like good robustness there's a lot of progress there and a lot more work work to be work to be done yeah yeah I think the comparison between kind of like our prototype September system versus the system we just demoed is actually like the difference is pretty like night and day so like we did a lot of analysis here Jerry has the most beautiful graph of positive rate just dropping and dropping and dropping of a time yeah it's really one of the best things I've ever seen yeah so so we measured our like over refusal rates on claw.ai traffic this is like real user production traffic and we're kind of like operating under the assumption that most people aren't asking about like catastrophic weapons of mass destruction related stuff and so like originally we were finding that this like September system was blocking like 40 plus percent of these like benign requests which is like pretty terrible you're approaching the rock there but yeah in our demo system we got that all the way down to the .38% and of course we still want to get this down further but between like 40% and .38% this is like two words of magnitude so how do you know how did you actually make all those improvements like it is like you know this and this is something you see in like a lot of also like the earlier safety work kind of this like tension between like harmfulness and helpfulness so and I would say for me it's kind of surprising that we were actually able to make as much progress as we did so yeah how do we get there yeah I mean I think the two yeah I think the two main improvements we made were first we like really honed in on the constitution idea and we made it really like clear how to like delineate things that were harmless and we found that adding this kind of like harmless set of categories of like things that the model the classifier should allow actually reduced fpr by a lot and we have like some results in our paper for that and I think that was like one of the most significant changes other changes include actually like solidifying the kind of like Joe break styles that we that we trained on and so that kind of allows models to generalize better on like what exactly is a Joe break versus just like I'm thinking like anything is a Joe break and that also probably helped a little bit I don't I don't yeah but I think both of these things were pretty useful here yeah this is really like I would like nice plot in the paper which is just a number of data points and like performance on the e-vowels and like how robust it is and they kind of in the style of like doing the dumb thing that works like that that is just like a straight line going getting up upwards yeah I mean to be clear I think the the system that we released for the demo still has a lot of false positives but I think like yeah I think we're pretty optimistic about like for the reducing the false positive rate for yeah some kind of like to to make something like more production ready but yeah I think that's that's kind of what we're at yes so can like why did we do the demo like what what was the demo you know what was the point of the demo I think the main questions we want to answer from the demo were like was our system as robust as we thought it was and I think like one thing here is that from the prototype we did human-red to mean there and we knew it was like quite robust but then after that we developed kind of like our own automated evaluations and we weren't sure whether that generalize to like actual human-red to mean and so one question that we want to answer there is does it actually generalize and if some classifier does well on these automated evaluations is it actually robust in practice and so if we could get that answered that's like a main thing and I think another thing here is just like having people try out with the system and you know seeing if it's actually like overfusing too much and I think people were finding that sometimes it overfused on specific like specific domain questions of course this is like still a lot less overfusy than our original system which would just like refuse on like completely unrelated things but I think that also helps us kind of like get to know what the weaknesses still are and that helps us know what to improve in the future so how exactly do we set this up yeah I mean I guess we've set it up on our like we had this public facing demo and then we tweeted about it to have people try to like read team it we set it up such that there are like these like eight sets of like questions that we didn't want people to be able to get the answer to or just like eight specific questions and the goal was to try to like get people to find jail breaks to get harmful detailed information on each of these questions and then ideally they could do this with kind of like a single universal jailbreak yeah so when we're talking about universal jail breaks we care a lot about making sure that the responses are detailed and not just like some very vague information that we personally don't think is like that doesn't give that much uplift so if they were actually trying to do the action that the question was trying to get at and so the way we kind of evaluated this is we have these helpful only models which are kind of trained to not refuse anything and these kind of like would be similar to what a model would be if you had an universal jailbreak as in they don't have safeguards and they will just give very detailed answers to questions and so for each of these like eight questions we have a response from this helpful only model and that's kind of simulating what would the response kind of be if you had a universal jailbreak and then because there's some like some randomness in model responses we kind of have this greater that compares a target response with that response from the helpful only model and then it kind of like looks for whether there's enough detail that's shared between those two and if there is then we'd consider that to be detailed enough. Nice and I think we looked for like at least 75% overlap in the information or something like that or yeah. I think the thresholds change per question so I think in some of the first levels which is like the the less harmful questions we had like slightly lower levels of overlap required and then towards the later questions we kind of increased that threshold to maybe like 60-70% and yeah it's dynamic through the challenge. Yeah I find this like question of grading just like really really like interesting in general and also like kind of just challenging to do well. I think we made a really good effort I think on the demo system but like it definitely wasn't perfect. So the way the system works right now is it's like it's looking for sort of overlapping bits of detail between between two answers but we had this thing in our sort of external red teaming that we did where people would just sort of merge like five six seven eight nine ten different model responses that cover like loads and loads and loads of loads of detail just because there are so long and by this metric of whether it includes details or it doesn't include details that would be considered so harmful even though like if someone's giving me instructions to make a cake and instead of having like you know this really nice like step-by-step a bullet point list unit first do this and it was kind of completely scattered and random and you know and there like everything is out of order. It's actually a lot less helpful than the helpful only model. The helpful only model sort of by design like has no safeguard is designed to give you the information in a way that's going to be maximum helpful to you. So I think this question of yeah like what is what is harmful, what isn't harmful, what is an appropriate threshold? Is it quite a subtle and and yeah just generally quite a difficult one? And I think yeah there was there was this sort of reaction to the grading system in the demo was quite quite interesting. I think a lot of people found responses that sort of looked harmful they they had some of information and then our grader would say there's not enough detail. There needs to be more detail and you know this would be I think frustrating for people because they're like what was the detail that's missing? I don't know what the information is and in a way I think that's partially by design like if I'm making a cake and there's an essential bit missing in the ingredient list or an essential thing missing like in the instructions. I actually have like no idea what that is because of the thermal because I'm not an expert in this another thing that I think is interesting why is the helpful only response why is that the baseline response? Why is that the thing that we're actually comparing against? And and I think something here is that we have a team out in front of it called Frontier Red Team and Frontier's Frontier Red Team's job is to basically take advanced models and see what could happen with these models? Do the front modeling work that we were mentioning before? And what they do is they evaluate this helpful only model and they say oh this helpful only model we think it's like it's this potentially dangerous like it could be used to to carry out some complicated process. So actually so they Frontier Red Team are like measuring like what is the risk of like a helpful only model? And if the risk of a helpful only model or a model without safeguards is too high, we're trying to measure or the way I think about it is like we're trying to measure like what's the gap between the model with safeguards and the helpful only model? So doing this like relative comparison while it's like imperfect and certainly like bits of information that is harmful can get through it allows us to sort of like do this do this estimation and do this comparison. But yeah this is it's just found the reaction to this really really interesting on the demo. I think yeah it's quite a subtle point and I think also yeah I'm excited for us to sort of develop like better grading techniques and better things actually like measure like okay like genuinely like how helpful like is this completion for the for the tasks that we that we care about? Yeah the other thing I would want to just say quickly is that it's also just like very unrealistic to get this kind of feedback on how detailed and correct the information you're getting is like that's actually a huge advantage as a red teamer and like uh yeah I think that actually the fact that we gave that feedback I think is gives you very clear feedback loop that you normally wouldn't have. But yeah I think maybe we should just like talk about the results. So someone did find a way to pass all eight levels but before that our system was holding out for at least five days and we kind of did this filtration of like we we tried to find what kind of users we'd consider like active red teamers and for that we're just kind of like doing this very conservative estimate as someone who at least passed one question. We're kind of saying if you can pass one question we're considering you to be like an experienced red teamer and you're like trying to break the system. I think that's pretty concept like I'm not the best red teamer out there but I'm going to struggle with the first question. Yeah so yeah but yeah by this conservative criteria we had like 308 users and these users had over 300,000 chats and we estimated that they had over 3,750 hours of red teaming and that was before someone passed all eight levels and after that there's kind of like the question of did they pass all eight levels with a universal jailbreak and so that's actually harder than just passing all eight levels with just like any random jailbreak for each a question individually right and so for that that took an additional 50,000 chats and additional 600-ish hours of red teaming for someone to find universal one. So something something I find kind of interesting is that you know in in some sense you know we've spent a lot of time you know trying to make a system that's like more robust and it's true like the baseline has been it gets broken in hours and we're now we're now on days you know which is like a definite progress but um yeah like you know how would we know this is like safe enough or how would we know this is like high enough what makes me think one would make us think this is actually sufficiently safe and in practice what else would we need? I mean I think the the real gold standard we want to hit um yeah also yeah driven by the responsible scaling policy is to be able to make a safety case like a really clear argument that even though the model has a certain dangerous capability we don't actually think that the model will be able to pose the with our safeguards be able to pose the risks associated with that dangerous capability and I think um roughly I think based on this result and the rapid response paper that we had earlier I think one approach that seems like quite promising for how we would go about making the safety case once the models do become capable of um yeah more serious misuse risk is basically to um build a very good uh sort of like constitutional class fire based system which takes thousands of hours to to jailbreak um then um have some kind of have uh so that will hopefully that will mitigate um lots of jailbreak attempts like vast majority but some will still go through um and then then we need some other mechanisms to basically like detect and then respond to those additional jail breaks so those mechanisms would be like a some kind of bug bounty program where people can um report jail breaks and be given yeah monetary rewards for reporting um jail breaks uh and be um probably some kind of like um incident detection or like offline monitoring to after the fact detect that like some of the traffic uh that like involves some jail breaks so we didn't notice with our immediate like classifiers that are deployed like online immediately blocking the harmful outputs um and yeah I mean you can imagine like various things that could work there but um yeah for the online system the classifiers need to be like pretty efficient and like small and have lots of different constraints like they also need to support like token by token streaming since that's important for like reducing latency to immediately get a response from the first token so um you know that that definitely like there are a lot of constraints there which make those classifiers less less effective than you otherwise could get but then you could serve the response and then after the fact use a much more expensive classifier like your largest most capable model maybe with a lot of test time compute and like reasoning through whether or not this response is harmful maybe flagged most dangerous responses maybe flag a number of those and then even have the humans look at like have human reviewers look at the most um yeah those top few ones to see like are any of these real jail breaks like you can imagine a really like heavy duty system like that um to detect like additional chill breaks then like if you do notice there are some additional chill breaks here use the rapid response approach we described where you take those examples proliferate them to get more uh automatically with LLM's to get more examples of jail breaks than retrain your classifiers redeploy the classifiers um yeah that that system like overall um yeah basically the hope would be that this gets the fraction of time at which there's like an open universal or an open universal jail break down to like a reasonable amount such that like if you're trying to follow some like complex scientific process to like make some CBR ad weapon or um do a lot of like cybercrime like they're actually just like only a small window of time to like use the model for the steps that you're gonna take um and yeah I think if if it's like well only like 0.1% of the time there's no vulnerability that you can use like that's yeah that just makes it like very difficult to use the system um yeah I think that would be like roughly the sketch of like the kind of like safety case that we may want to make um that that we think could be promising for using constitutional classifiers to get safety here yeah for me it's just like another reminder of this um yeah the only perfectly safe system is the rock you know um they always is like really best practice and security of like yeah no system is perfect most systems have vulnerabilities and there's always this like measure of okay like how much effort does someone need to put in like how hard does that for someone to like get the information that you need I I'm kind of reminded of um I lived in like like Cambridge and Oxford in the UK like people like cycle there all the time and like bikes get stolen like all the time like I think most bikes you just get an angle grinder and this isn't go straight through the lock like not a problem like those those bike locks are not they're not like perfectly robust you know in the same way like our system isn't isn't perfectly robust but sort of in practice you know you put one lock or you put two locks on your bike you know that basically reduces the risk like a huge amount like someone's have to get an angle grinder or maybe you only be there for like an hour two hours someone's gonna catch them and yeah it's really interesting because you know the research like it always happens in this like broader like structure and like broader sensitive mitigations so you know you mentioned the rapid response like monitoring for new job breaks then being able to respond to new job breaks and there are other things we can do like um like monitoring for people trying to circumvent our systems um you know we have these classifiers and we can detect when people are trying to get around them so I'm really excited to see how this work can really you know I think it's a really good research contribution but like I'm really excited to see how we sort of put it into practice and put it into practice with all these other complimentary defensive mechanisms so sort of overall we can make like a really good safety case I agree so yeah I think we spend about about 4,000 hours for thousand hours of effort on the demo but we did find some techniques that so we're able to circumvent our classifiers so yeah what do people try and what got through yeah so I think we were seeing a lot of stuff where people would use these kind of like ciphers or encoding mechanisms to try to like get the output to look a certain way such as the alpha classifier can't really tell what it's saying and therefore wouldn't flag it um and so that was a pretty common thing that we were seeing could you give an example of something like yeah some very basic example here could be like if you use basic 64 encoding or like Morse code or something um these might be things that the alpha classifier has has a little bit of trouble understanding and so therefore it's like not as suspicious about some other strategies here a very frequent one is that people try to replace the harmful chemical with like some very benign thing like water or something and the idea here is just that when the classifier is see like you know very dangerous chemical it's just like becomes very suspicious and so if you replace it with a more harmless chemical maybe it becomes less suspicious and then you try to like yeah you try to like hint to the classifier or hint to the model in some way that actually you meant this like harmful chemical but you don't say the exact wording um and just some other minor ones here is like some role-playing attacks where you have Claude try to like act a certain way um and also some prompt injection attacks um and definitely we saw a lot of cases of people just like resampling a lot of times to try to get a better output uh that would get past the greater system also using the feedback that we yeah yeah yeah yeah yeah i often think because we have this and like basically like an additional version of Claude that's like looking at the outputs so the outputs now need to basically be very obviously not harmful in in some way so yeah you get these like cypheres or you know don't talk about like you know siring gas or like PPE talk about you know refer to it with like like bananas or some other like benign benign benign thing yeah i guess one thing that i am i'm curious about is like what we would say to you know the people who are concerned we're going to use these settings to stop them being able to do what they want with with Claude um you know and sort of why we've we've gone for the classifier's approach and the constitutional approach yeah i mean i think i think definitely my hope is like this should improve your user experience for um any tasks we're trying to do which are not actually dangerous um so yeah i mean i think yeah like yeah i would just guess like yeah getting classifiers to be really effective um is just better than um training the models themselves to reviews or not um and we can just more granularly like pick out the behavior that we want to block and so like yeah hopefully this just like is just a prado improvement it's like better user experience for everyone um and also like more safe in terms of like reliability blocking the actual actual actual bad stuff yeah another way i also think about this is that you would be want to be want to really be able to like leverage the benefits of like really advanced and like AI and AI with like advanced scientific capabilities but actually if you don't have adequate protections in place it you know it's for one like according to a responsible scaling program we actually just cannot deploy that system you know we might come around and we have some like new version new version of Claude that's like absolutely amazing and we really want to get it out there but we we just don't actually think it's responsible uh you know we we we've like done threat modeling we're considering out the risks and there's a way of saying if we don't have adequate protections in place like we actually are just we are unable to actually like reap the benefits and like a responsible responsible way so it's kind of like having the safeguards alongside like the advanced capabilities means like both together can be you can actually like responsibly and safely deploy they really like new like advanced systems that can do really crazy things and I think you sometimes see you see this actually like on like Twitter and different communities there are like one group of people who are like AI is great is doing like do all these good things that's true you know and it can do all those things and like there's like the accelerationists and they want to like go ahead like let's get ready advanced AI and let's get that now and then there's like that community and there's like this other community of people who are concerned about the risks and there's kind of like truth there there too uh like there are risks and there are risks that we're concerned about and then you want to mitigate and like something that I like about the responsible scaling program is that like it has some like nuance like there's like one position that someone could take which is like accelerate as fast as you can or just stop you know and I kind of think this with the responsible scaling for them it's like okay what we're gonna try and do is we're gonna see and try to predict what risks might occur like watch out for those risks and and when we're seeing evidence of those risks becoming real put the relevant mitigations in place and if we can't mitigate it appropriately maybe we do not deploy or choose onto deploy and I just like that this is just like a much more like nuanced strategy because you know we're I think in some ways we are operating under a lot of uncertainty like we don't know exactly we don't know exactly what's gonna happen uh the risks you know some types of risks are sometimes it feels like you're reading sci-fi stories and that doesn't mean you discard them but it doesn't it doesn't mean that they're necessarily 100% guaranteed to happen so it's kind of like how do like navigate that that place and and so this uncertainty in a way that's like responsible that's gonna allow us to like capture and sort of distribute the benefits of like potentially having this say really really like beneficial and you know powerful technology without like incurring unnecessary like costs on the rest of society and these sort of negative externalities curious if you guys have any like favorite memories from uh from from from the project I think it's just really funny uh yeah with our prototype system uh that like we we knew the false positive rate was high but then when we actually like saw the result of the experiment of running it on the claw.ai data we're like oh that's like a lot higher yeah kind of thought um and that was like pretty interesting uh we didn't think it was like that high um but it was pretty high yeah I remember um just this wasn't my personality kind of just like angsty refreshing the demo being like how many people like how many people like oh my god they proc level four you know they're coming for us um yeah it was also really really cool to see some really a lot of the human creativity uh from the red teamers and some of the stuff that they they came up with it's like really really really smart yeah I mean yeah there's probably two that come to mind I mean I think one thing we started to look at this line for in the RSP on um successfully pass red teaming I think uh this line gave Monoc in particular a lot of stress because he was like what does this even mean what's that the exact bar here and then he then went off and did like a two-week project uh on figuring out how to operationalize this yeah and he like went and talked with a lot of people about like what are the different threat models that we want to really guard against for the responsible skilling plan like what would make us feel yeah like we could make a really good kind of like safety case argument and then he came back with like a long dock uh and in the dock he's specified like we need um the threat model is you need to answer a list of um I think it was like 10 questions or so and you need to be able to do that uh you need to be able to like block someone from getting answers to these 10 questions uh after they've done red teaming for over 2,000 hours was the bar and we're like okay we're gonna aim for this bar and like I don't know I'm just like really proud of the team that like we actually did hit that um because yeah I don't know we kind of like set that up front like a year before we finished the project yeah now we got like yeah like hit that level I think that's like really um yeah pretty yeah I guess impressive like goal setting and achieve that like yeah it seems kind of rare for research projects to actually do that I think the other one memory that stands out to me um is I guess we were doing robustness research but then we read this uh scary line of the RSP and really thinking about it and then but then we're like oh this is talking to people about like who's doing this and we're like oh yeah like the um are kind of like applied like safeguards team I guess yeah it's called the safeguards team now but yeah like originally this was like this team was like part of our kind of like alignment science research team um and so we were like oh yeah like the the safeguards team is like responsible for doing this and like they'll they'll have it covered and then we like talk we like set up a meeting with some of the people on the team and they were like who's who's going to achieve this level of robustness and then we were like oh man this is this is really tough I guess we have to solve this problem which didn't seem like it was our responsibility uh then yeah that was sort of like uh that week we went through like an arc of realizing it was actually our responsibility to solve this problem um yeah I view it as kind of like you know it's kind of turned out that as far as we could tell like we were the best people to do this job like an unphobic like given the the situation like given the circumstances um given like you know everything else that was happening in T.S. and safeguards and then yeah there was this like okay like like if we're the best people to do this uh let's just try and do up S there and then I kind of think there was this like attitude and general from the team of just being like okay we don't know what the target should be let's try and figure it out we don't know what the approach to get there and we were actually doing it like another approach to we would do doing adversarial training we would find training models and we're like no we don't think that's gonna get us there and I'm kind of just like actually quite consistently just like pivoted to what we think was gonna get us there maybe what's not clear from the paper um is like this was a huge engineering project probably like five FTE years um roughly uh I think that's like not obvious when you read the paper maybe it just looks like a really simple method but I think um yeah I think like the people on the team did a lot of like making LLM like pipelines for generating the data I think um yeah it seemed very important for like augmenting the data with different transformations like translating the data into different cypheres and then using that to train that data to generate the classifier data but yeah I guess um those are a couple of like tricks that were like salient to me and like insights but yeah I guess I'm curious if anyone else wants to chip in with like other things that are like pretty important or not obvious um things about how to get this this to work well I think a lot of the research project was in fact like defining what the problem was in the first place um like we had like some kind of like vague mandate from the RSP but there was like a lot of work in like defining like what the criteria would be there's a lot of work in defining like what what does it mean to have like human red team is and like what would what would yeah what would it mean for that to be sufficient um well what what what kind of constitution should we have like what threat model do we actually care about like where do we draw the bar on like specificity um should do we need like do we need both an input and output classifier um and that kind of like depends on like what kind of threats we're actually looking for um so I feel like a lot of yeah well for the evaluations for example like how much do you care about the transformations like how many augmentations is too many to be like uh kind of like un-unusual like uh unspecific um so I think like a lot a lot of difficulty was kind of like actually defining the problem and kind of narrowing down like the thing that we're trying to solve um like trying to make the evaluations trying to find like what the decision boundary should even be like I I feel that once we now we've actually like made a bunch of progress on even defining a problem I feel like more confident about us tackling like similarly shaped problems in the future so for example if we have like yes in the same way the constitution can be applied to like many different problems I think we have like a better sense of like how we might tackle you know this kind of like big big problem like how do we actually like how would we actually like approach a threat modeling modeling problem like this how would we even start thinking about like how to make a safety case with this thing like what would constitute like a sufficient evil like what would constitute like um yeah like like a red team like a human red teaming based like safety case um so I feel like um yeah I think we'll be able to apply like a lot of things that are like kind of like fuzzy or like maybe not like explicitly written down in the paper to other problems like maybe not just in misuse but also like in misalignment or control things like this yeah yeah I'm also really excited about this like like directly practicing how like building safe guards that we can deploy and practice and like getting and constructing the evidence and like honestly assessing whether we think they're actually sufficient uh there are like so many things like like how do you run the meetings like how do you track the goals like really like mundane things that you know at least often as researchers it's not the obvious thing or the first thing you think about like well you remember as my PhD reading papers I would always just you know just go straight to the method you know that's what's that's what's interesting but no I think we learned so much at like basically like like running and executing on like projects projects like this yeah I mean one thing I was going to say here is like I think a really critical difference of this project relative to other research papers research I've ever been involved in is that it was basically we had to solve this research problem on a very clear timeline with a fixed quality bar so like we had this like 2000 hours of robustness bar that we had set for ourselves internally and it basically seemed like the the deadline was sort of like like external in the sense of like external to our team in the sense of like well they're like model capabilities are progressing at certain rates like the company is like deploying new and for topics like deploying new models like we don't want to be the long pole that causes the company to have to not deploy a model and so yeah I think there were just like we would constantly be thinking and talking with other teams like when are when do we think we might have a model that achieves like a certain dangerous capability level and like based on that we were setting like basically like back-chaining from that doing sort of like engineering style planning like week by week what do we need to have done in order to like hit that timeline but also while still having yeah I think with the fixed quality bar it's different than like a paper deadline because with the paper deadline you can just say like well we're just going to throw out these results this is what goes in the paper you're gonna change the problem you're trying to solve yeah change the problem you're trying to solve before the papers do yeah exactly uh claim less or whatever like I think you can like really adjust the quality bar but like here we couldn't and so that was like partly what forced I mean that was initially what forced us to even take the class far as approach because we're like we're not on track to hit the timeline that we were wanting to hit um like with the previous approach but also it led to like other difficulties for example like a number of people on the team were like facing difficulties with like how is a plan given that the timeline is uncertain and that we kind of needed to take a conservative estimate of the timeline and so I think there were like a bunch of decisions um that I think like team members made being like oh we're just gonna like write some note like write some code in like a collab notebook in a way that's like not super reproducible just to quickly generate some data so that we can train this next version of the classifier when if we had like a longer timeline or a more clear estimate of the timeline we would have done this like in a Python script that's reproducible and like had some general tooling for this um so I think that I'm not actually sure how what we settled on in terms of like in hindsight what we should have done I think just like um maybe like not taking too conservative of an estimate so we have like a longer time window um for when like to give us the amount of time we need to to allow for some of the tooling work to to happen um but at least like picking your strategy in a way that the overall strategy could could hit the timeline that you need um yeah I don't know if like other people have thoughts on some of this yeah I guess like I mean taking a step back I feel like this is this really makes me think about just like this is an interesting time for safety research in general um and I think a lot of like research is like I guess research in the safety kind of started out to be like a little bit blue sky like people are like speculating about like what models might be like in the future um and I think we're starting to see like kind of these like threats materializing and we also need to like adapt our research into actually being like usable in production so it's I feel that as like a research team we're solving like kind of an interesting meta problem in a sense of like how do we like adapt like our normal research techniques to actually like make make things like happen in the real world um and I think that's like something that like yeah safety research as a whole has to like grapple with is like we actually need to like solve these problems now we don't necessarily have like a lot of time to just like kind of uh do a lot of like blue sky research um and we want to actually make things like work and practice I think interpretability is also like running into this where previously people are doing research on like very maybe very small models with like very few layers or something like this and a lot of their problem is like yeah scaling things up to like tackle like a real model is like not a very small engineering feat at all and and yeah I think that I think it's kind of interesting and I guess scary that we have to like actually really get our shit together and make things like really work in practice yeah I think the other point I would make along that vein is that one thing that was really helpful surprisingly helpful to me about um in this project was just like talking with product people just being like what are the actual constraints for the system you want us to deploy like what are how much would you prioritize different things and I think they're definitely like some key surprises there at least to me I was like very surprised by um how important like streaming support is in terms of like being able to show each like word generate word by word and show that to the user that's like very important for lots of applications and it's like not something I would have necessarily realized beforehand each of these applications if we don't hit the safety bar then we may not be able to deploy in that new domain um and I think just like getting a stack ranked list of like what are the most important for for the company to be able to deploy and then just like prioritizing those use cases that's in particular how we ended up that like this like token by token compatible class fire um and I think in general is like a very good principle for supporting like for future risks I think route yeah like um you know we'll have the next generation of like risks where we need even higher levels of robustness I think this is a good strategy for getting um safety there or for mislimit risks related to yeah it is themselves doing bad things I think we'd take similar approach as all great it's been really lovely to be here and you know chat with you guys and so it kind of celebrate the sort of progress and also like look ahead to yeah all the all the challenges that we're going to face and things like this so yeah thanks thanks so much and thanks for tuning in
TL;DR
- Anthropic has developed "constitutional classifiers" to mitigate "jailbreaks" in large language models (LLMs) like Claude, especially for future, more capable models with higher risks.
- A key focus is on "universal jailbreaks," which are single, adaptable prompting strategies that allow non-experts to consistently bypass safeguards for a wide range of harmful queries.
- The system employs a "Swiss cheese model" with multiple layers of defense: an input classifier, Claude's internal refusal mechanism, and an output classifier, all guided by a natural language "constitution" and rapidly updated using synthetic data.
Takeaways
- Understand Jailbreaks: A "jailbreak" is a technique to bypass an LLM's safeguards to extract harmful information. These are a concern because future models could assist with high-stakes activities like weapon development or large-scale cybercrime.
- Prioritize Universal Jailbreaks: Focus on defending against "universal jailbreaks" – single, easily adaptable prompting strategies that allow consistent bypass of safeguards for many harmful questions, making sophisticated attacks accessible to non-experts.
- Implement a Multi-Layered Defense (Swiss Cheese Model): Use multiple, independent layers of protection. Anthropic's approach includes:
- Input Classifier: Scans the user's entire conversation for harmful intent.
- Model Refusal: The LLM (e.g., Claude) itself may refuse to answer.
- Output Classifier: Monitors the model's real-time output for dangerous content, acting as a final independent safeguard.
- Define Safety with a Natural Language "Constitution": Specify acceptable and unacceptable content categories using a human-readable "constitution." This allows for clear, explicit rules (e.g., "no instructions for WMDs," "yes to writing poems").
- Leverage Synthetic Data for Training and Adaptability: Generate vast amounts of synthetic data from the constitution to train classifiers. This enables rapid updates to counter new jailbreaks or refine refusal boundaries by simply modifying the constitution or adding new examples.
- Ensure Flexibility for Novel Threats: The natural language constitution and synthetic data generation pipeline provide high flexibility to adapt to newly discovered threats or refine what constitutes a "harmful" or "harmless" query, minimizing the window of vulnerability.
- Decouple Safeguards from the Core Model: Classifiers operate independently from the text generation model. This allows for rapid iteration and deployment of safety updates without altering the core model's behavior or requiring a full retraining, providing stability for users.
- Balance Safety with Utility: While maximum safety could mean blocking everything (like a "rock as a model"), the goal is to precisely block harmful content while allowing as much useful and benign application of the model as possible, avoiding "over-refusal" or false positives.
Vocabulary
jailbreak — A technique used to bypass the safety safeguards and ethical guidelines embedded in an AI model, typically to elicit harmful or restricted information.
universal jailbreak — A single, adaptable prompting strategy that can be used to bypass an AI model's safeguards across a wide variety of harmful questions or scenarios, often making advanced attacks accessible to non-experts.
constitutional classifiers — A system of AI classifiers (input and output) that are trained and guided by a "constitution"—a set of natural language rules defining harmful and harmless content.
Swiss cheese model — A concept in risk management where multiple layers of defense are implemented, each with potential "holes" (vulnerabilities), but designed so that the holes in different layers do not align, providing robust overall protection.
input classifier — An AI system that analyzes a user's prompt or the entire conversation history for harmful intent before it reaches the main AI model.
output classifier — An AI system that monitors the AI model's real-time output for harmful content and can block or redact the response if dangerous information is detected.
constitution — In the context of AI safety, a set of natural language rules or principles that define what is considered harmful or harmless content, used to guide the training and operation of safety mechanisms like classifiers.
synthetic data — Data that is artificially generated, often by an AI model itself, based on predefined categories or rules, used to train other AI models when real-world data is scarce or to simulate specific scenarios.
responsible scaling philosophy (RSP) — Anthropic's internal framework outlining conditions and red lines for safely developing and deploying increasingly capable AI models, including requirements for robustness against jailbreaks.
threat modeling — The process of identifying, categorizing, and analyzing potential threats or vulnerabilities to a system, including what types of harmful activities an AI model might be used for.
Transcript
Hello everyone, my name is Monank. Yeah, I'm really delighted to be here with some of my colleagues at Unphropic. Hi, I'm Jerry. I'm on our safeguards research team. I've been at Unphropic for about eight months. Hi, I'm Ethan. I've been at Unphropic for two and a half years and I'm leading our efforts on AI control, developing various different monitoring methods for various different AI risks, including our silver business. And I was a part of the Safe Affairs Research team as well before. Hi, I'm Meg. I've been at Unphropic for about a year and a half now and on the alignment side of the team, which has been great. Great. So yeah, and we're just going to be talking about constitutional classifiers and that's our new approach to really try to mitigate job breaks. So yeah, how would we define a job break? Yeah, I mean, I think a job break is kind of like some way in which it's only like bypass the safeguards that we include in our models and try to get harmful information out of it. Yeah, so there are these techniques like, you know, they do anything now, job break. It's kind of like similar to people with like job break, their iPhone and try to get around all the safeguards there. But the thing is that, you know, with like iPhones and, you know, other stuff like this, job breaks aren't really a thing that people, maybe on that dangerous, something that we care about that much. Yeah, what is it that makes? Yeah, like why should we care about these job breaks in the first place, you know? I mean, I think one of the main reasons is for future models, which have greater risks. So yeah, I think people are pretty, pretty carefully monitoring. People with different companies and academic communities, pretty carefully monitoring, like, if slash when models will be able to help with weapon development or yeah, like large scale cybercrime or like various different risks that are like greater than what we've seen before, also like mass persuasion, things like that. And I think, you know, once models become like really effective at some of those and are like a significant uplift over, say like using Google search or general internet resources to do some of those things, I think, then it then, yeah, I guess being able to use models to help with those, those kinds of things will be, yeah, potentially like speed speed up bad actors quite a lot. So I think a lot of this is like in preparation for like next generation models or next generation models. Yeah, great. And I'm, I'm, I'm curious like what the story of the work is and sort of why we set out to do this in the first place and the RSP. Yeah, I guess like inthropic really cares about safety, a great deal. And we have the RSP, which is the responsible skating philosophy, which is really trying to outline conditions under which we're happy to release models and make sure we have different safeguards in place. And a while ago, we committed to a very difficult standard for jailbreaking the RSP for what we call ASL 3 models, which are basically models, which have maybe some of these like dangerous capabilities, like being able to build dangerous weapons. And our team was kind of mandated with trying to actually solve jailbrigs for this kind of level of model. So yeah, I guess like the motivation for the work is actually trying to satisfy the things in the RSP such that we can like feel like we build future models safely and can actually deploy them with some sort of progression towards safety or making some progress towards there. Yeah, so I think the classifiers are definitely a good step in that direction. Yeah, so you know, there are lots of different types of jailbreaks that are out there. And something that we've really did in our work is we focused on universal jailbreaks. So yeah, why is this something that we should care about? And what does that mean? Why is it important? Yeah, I mean, I think the reason why I'm particularly concerned about universal jailbreaks is just because of the uplift that it would give kind of like a non-expert. And the way I kind of think about this is that if like some random person on the internet is like trying to do something bad, they may not actually have that much jailbreaking like experience themselves. And so like the thing they might just do is just go online and see like, oh, what are some existing jailbreaks that I can use once like a template where I can just put in my harmful question and just like get an answer. And I think in that case, like you're very concerned about these kind of like strategies where anyone could just like put in any question and it gets the model to like bypass all the safeguards. And I think that's particularly concerning at the level of model capabilities that we're concerned about. So how exactly would you like to find a universal jailbreak? Like if you're telling someone on the street, what does a universal jailbreak mean? How would you know if you jailbreak this universal? Yeah, I guess like there's a little bit of ambiguity there, but I think the kind of like definition that we're going for is some kind of prompting strategy. It could be automated, but just a singular strategy that's like very easily replaceable with any like wide variety of harmful questions. And that consistently gets a lot of detail from the model and like bypasses the model safeguards. And I think one way of quantifying the universal jailbreak is that it like actually does speed up a person quite a lot because instead of having to jailbreak every specific query, they have to, they can just use one jailbreak for all the queries, which actually is a lot faster. So I guess like one idea is like if the model, if there's some counterfactual way of doing thing that's much easier than using the model, there's no point in using the model. So not having universal jail breaks might mean that they try other strategies, which might be worse or something like this. Yeah. One thing I would maybe add is that like the difference between a universal jail break and non-universal ones is like for non-universal ones, you might need to for every harmful question you want to answer, you would need to jailbreak the model in particular for that particular question. Then you got a new question, like a new, yeah, I don't know, your next question in the process of developing your new weapon or whatever and then you need to jailbreak the model again. I think basically like that entire process, if you need to do that hundreds or thousands of times, is just like very costly, whereas if you just need to like upfront find one strategy for jailbreaking your model, like a single prompt where you can just swap it a new question, that makes it so that it's like, yeah, the amount of total effort for jailbreaking is like much lower. So that to me is like one of the primary motivations for fixing on universal attacks. Yeah, now I'll just give an example here in a way that I think about it. So let's say I want to make a cake and I'm not able to make a cake because I've never baked to my life. I don't know anything about ingredients. I don't know how things work. So how could a model be able to help me do something I can't do otherwise? So if I need to make a cake, I'm going to need to ask a bunch of different queries. That's one thing. Like it's actually, like I'll put, you know, maybe I'll put something in the oven. I need to know like if the temperature's right, the smells right, take it out, hire a check, throw out all the ingredients. So one thing I think that's really important in the universal jailbreak definition is that you're very confident that the information you're getting out of the model is actually really helpful for the thing that you care about. So it's kind of different. There's sort of some techniques that get a little bit of information or they get some information but it sort of mixed in a lot of other stuff. But for these sort of scenarios where what's happening is an actor that can't do a task that requires a lot of expertise, we're worried about them being able to do the task. We think that they all need to have access to the sort of universal jailbreak. They'll need to be able to ask many queries and get really reliable information and they should just know this is the correct instruction. This is the right thing to do. I think an example of a non-universal jailbreak that someone on the team I think Jesse Mu found was I think he was jailbreaking it for asking the model how to make, asking Claude how to make math and he found some jailbreaks where he puts the model in a scenario where it's roleplaying as if it's part of breaking bad, the TV show where they make math and then ask the model the question, how do I make math? That kind of jailbreak you can imagine how that would be effective for that specific kind of question, like things related to math, but is not going to generalize or things related to cybercrime. But on the other hand, there are some other jailbreaks which do anything now, jailbreak which gets the model to do anything by getting it to talk in a certain mode or roleplay in general for arbitrary questions. That would be what we would call a universal jailbreak. There's also other strategies people have used to make these like using language models to automatically find different jailbreaks. That would be a more dynamic approach that might be able to discover these on the fly, but if you have a single process for generating a jailbreak for a new question, that would also count as universal. So yeah, and just maybe an even more basic question is why does the solean to jailbreak the model in the first place? What does that even mean? Yeah, I mean, I think we've done a lot of work, such as our constitutional AI work on getting clawed to have the kind of characteristics where it doesn't actually try to give harmful information if it thinks like the user might have some bad intent or something like that. And so for a lot of these harmful questions, if you just ask the question itself, it's very obvious that this is like bad question. Users trying to make some weapons of math instructions, it's like very clearly bad. And we've trained clawed to not answer those questions. And so the jailbreak is needed to actually get around those safeguards in order to get clawed to actually answer the question. Another thing that's relevant for the like Hama Sims training question is that I guess like there's many different ways in which people could present jailbreaks the model. And there's also many different tasks that the model needs to be able to do and it's kind of everyday life, so to speak. And I think having like an extra set of like systems that are like specifically trying to guard against jailbreak can really help have like, I don't know, like some kind of like Swiss cheese sort of method of like trying to block like harmful things via like many different layers or something like this. We talk about Swiss cheese a lot on the project, but you know, this is not as well known everywhere. So yeah, what does that mean? Yeah, so I guess like the the Swiss cheese model for like protecting against things is that maybe you if you have like only one system for preventing harmful things from happening, there may be like some specific problem with the system that people can exploit and get through every time. But then if you add kind of like, yeah, kind of like a layer of Swiss cheese with has which has like a hole in a very specific place. Maybe like the rest of the cheese blocks all of the harmful attempts, but there's like a specific hole. But if you add another layer of Swiss cheese, the hole might not be in the same position. So if you have like two layers of Swiss cheese, it's actually much harder to get through things. Even though they both may have holes. Yes, so what are these layers of cheese for like our method? We'll be talking about constitutional classifiers. So yeah, what are the different sort of layers of defense that we're going for here? Yeah, I mean, the first layer would kind of be our input classifier and input classifier here is looking at basically the entire conversation that the user passes into the model. And so that's the first layer. And then the second layer is if it gets past that input classifier, Claude itself, which is the model that we're trying to guard, can actually refuse to answer the question. And then that's kind of like another layer of like, Claude saying, okay, maybe this question is not so good. Maybe I shouldn't answer it. And then finally, we have this output classifier, which kind of looks at what Claude is outputting in real time. And then if it ever sees something that seems like it's like dangerous or seems like it's against some value that we're trying to block, then in that case, it can also choose to stop the stop clouds output and block the response. And how do these classifiers, what are they looking for and how are we specifying what to look for? So I guess like in our paper, we call them constitutional classifiers because we have this kind of natural language set of rules. And this is like some set of rules where we can specify some categories of topics which are not okay to talk about. I guess like an example could be like, I'm creating a web of mess instruction, clearly bad. And we can specify, let's not have Claude tell the user how to make that. And then we can also specify some examples of harmless stuff that Claude should be allowed to talk about. And then basically, we train our classifiers to classify whether conversations or outputs are related to these kind of like harmful or harmless categories. And then that allows us to make a decision on whether to block it. And crucially, like, I guess the input classifier and output classifier, kind of like doing two different jobs, going back to the Swiss cheese analogy, we hope that the holes of the Swiss cheese are in different places, specifically for the input and output classifiers. So I guess the input classifiers, kind of like doing the really naive thing that you'd expect, like it really looks at the user prompt and tries to work out whether there's anything harmful going on. But crucially, one of the reasons that you might need an output classifier as well as input classifiers that people are trying really hard to jailbreak the model and the input classifier in the prompt. And if you have a totally separate output classifier, which is only looking at the output, that only ends up looking at stuff that the model itself has produced. So it's kind of somewhat decarolated from what the user put in. So the, so like two parts of the system are looking at things directly that the user put in. But we also have this like kind of third held out part of the system that doesn't, that the user actually doesn't get to touch directly, which makes it a lot harder to kind of like completely jailbreak the system. And although the input classifier and Claude are doing a lot of the that overclass way is doing, I don't know, is doing some important thing as kind of like the last last crucial component for like really driving down. Yeah, harmful, harmless rates. So this is true in a maze sort of sense. But most people aren't using Claude for this, you know, most of the time that people ask, as query is they just they're just doing something completely great like tonight, legitimate, a really beneficial application. You know, so we could have guards that just block everything that would be completely useless. So yeah, how are we, how are we making sure that this sort of we're not we're not over zalas there? I mean, I think we really want to part of why I think we're designing these techniques is to get allow as much useful content to be and useful work to be done by the models. Like the better techniques we have for blocking exactly precisely just the really harmful content, the better we can not have false positives for users who are using models for yeah, really good applications. I think, yeah, I think the classifier approach yeah, like make some some progress there and might be better than other approaches, like directly training the model to refuse. And so yeah, I mean, I think hopefully the this leads us to allow users to talk with Claude about lots of CBR and related topics that are safe to talk about. And so yeah, I think our hope definitely is or like yeah, my hope is to allow for a lot of those applications to like thrive while just narrowly blocking out the things that yeah, we think we believe are dangerous. Yeah, I guess crucially also, I think we often like make the joke that if we had like just a rock as a model, that would be extremely harmless in that it would not in fact answer any harmful queries, but unfortunately it would be not very useful. So I think, yeah, making sure that we don't block harmless queries is actually I think that is actually quite important. And also actually quite difficult. Yeah, and man, you mentioned before kind of like solving job breaks or like making progress on this problem robustness, like how would you even define that or what does that actually mean? Yeah, I guess this is a very difficult question. I think I guess I can involves like a bunch of different, different layers. Like firstly, there's some idea of like threat modeling, like you have to have some idea of like what it means for something to be harmful. The frontier red team has done like some amount of work and trying to like specify what things were actually worried about. And this is this is somewhat hard because as Ethan was talking about it, we're talking about a lot about future models and potential future model capabilities, which we might be really worried about. So I think part of it is like mapping out what might be harmful, but we might need to like address. So threat modeling, like what is actually harmful, then there's there's like the job of like actually measuring harmful things. So that's a lot about, I guess like we're using a like constitution to like kind of like define the threat model and then having like models generate various synthetic data to try and kind of like enumerate various harmful things that could happen. So measuring like a true positive rate on the data. But then there's also trying to make sure that we don't refuse too much on like real real data from Cloddei and make sure that we actually can I don't know be as helpful as possible while still being safe. So what actually is the constitution? You know, we these are constitutional classifiers. We're talking about a constitution. What does that mean? Yeah, I mean the constitution here just kind of means like some enumeration of categories of requests and conversations that we kind of deem harmful versus not harmful. And so examples here could just be like yeah, you know, questions on how to make weapons of mass destruction or like trying to source ingredients for making weapons of mass destruction. And then we basically just enumerate some of these categories. And then we also specify some categories of like harmless stuff like I don't know like writing poems or like writing code for like normal use cases. And then we can just kind of like specify these. And then we as Max said, we generate a bunch of synthetic data that gives more specific cases of those. What do you mean by synthetic data? Yeah, so here's the data. We kind of mean that we start from these like broad categories of user requests. And then we have Clod actually kind of like branch out and like think about all the specific requests that might be examples of this kind of like broader category. And so yeah, the category might be something like sourcing materials to build weapons of mass destruction. And then sub-request there might be like, oh like going, what specific stories might I go to or like are these specific materials accessible at I don't know in X state. And so we have this process for automatically doing this. And that allows us kind of like generate a huge amount of synthetic data from just a small amount of categories. Yeah. And I think something that I find really cool about the method is that it is just based on like natural language. You know, we were talking about fret modeling before and fret modeling at least in my experience and my experience working with with Frontier Red team is that fret modeling is really hard. You know, there's a lot of people using Clod. It's really hard to like what are all the possible things that could happen. And we're going to learn new things, you know, as we have monitoring and as you know, we always learn new threats or new things that could happen. And yet, something that I find really exciting about the method is that basically if you want to change the constitution, if you want to change what is being blocked because you've learned something new, you know, you've maybe the things come out in the news or there's some like intelligence or like monitoring. The only thing that you actually need to do is you just rewrite the constitution. And the sort of the standard approach for classifiers is you would like ask humans to get a lot of data. So, you know, something that could happen is that, oh, we're we're say we're really focusing on, you know, one category like one particular way of, you know, maybe like cyber misuse. But we later realized that, oh, actually, this thing which is much more dangerous or something that we've just learned something new or someone's informed us. Something that I'm really excited about is that this is a way that we, I think we can get good robustness, but we can like maintain our like flexibility and really maintain our ability to like respond to like novel threats and adapt to what's actually happening. Because yeah, I feel like this is just the lesson that we learn like again, again, you know, if you don't have flexibility, it's sort of going to be going to be a problem. It's going to limit us. I actually do want to make a quick point. The flexibility thing, which is that I think like our approach is not just like flexible and kind of like switching like general topics. For example, if you want to go between like cyber and like, I don't know, like weapons and mass destruction or something. But I think it's also like a lot more fine grain than that in that during this project, we saw that like there are some requests that are early classifiers were like always very suspicious of, but they're actually benign. And what we could do is we could actually just like modify the constitution and add like one sentence that says, oh, these types of requests are okay. And then when we retrain the classifiers on that new data, the classifiers would no longer flag those benign prompts. And so I think that allows you a lot of like fine grain control over what exactly your classifiers are trying to flag, especially if you see a lot of like over refusal or problems with like missing stuff. Yeah, I mean, this might also be a good place to give a shout out to. We had a paper earlier on rapid response where we kind of leveraged like a similar idea to improve the safeguards around models. And I think basically one nice feature about using synthetic data is like if you notice not even just a new category of jailbreak, but just a new kind of jailbreak that maybe applies. Like let's say we notice a new universal jailbreak like to do anything now prompt. We can take that, use an LLUM to generate variance of that and then throw that into the data mix. And like I think yeah, my understanding is this like was really helpful for us in like developing the classifiers to the level of robustness that we got. If someone reports a new like jailbreak or vulnerability, then like we can use that to like really quickly update the classifiers by using some synthetic data generation pipeline. And that really will like minimize the like fraction of time by which there's like an outstanding like jailbreak, which can yeah, just make it so that the models are like yeah, vulnerable for like as that small period of time is possible. Yeah, it's the common wisdom. I suppose that like not like perfectly solving securities, basically impossible. There is no like perfectly secure system like known to humanity. So I guess we're neither the flexibility both, you know, this like, oh, we're blocking the wrong thing or we're blocking bananas users, but you know, when people do find things that get through the system, we want to be able to fix those really quickly. Yeah, I think part of our approach is that we've kind of like model jail breaks in a way that like it's very easy for us to add examples of new jail breaks into our kind of like training pipeline. And so if new jail breaks are discovered, it's quite easy for us to just generate more examples of those jail breaks and then train on them. And then hopefully those classifiers will be more robust too. Yeah, I think one other thing I would add that's nice about classifiers is that they're decoupled from the actual text generation model. And so if you yeah, I think often it can be very difficult to update the text generation model that if you train it to refuse in one domain, maybe that generalizes in non-obvious ways to behavior in other domains or refusal behavior in general. I think we definitely ran into some difficulties doing like preliminary work on that. But I think with the classifiers, you can just like keep the text generation the same and you know it's identical to previously deployed model. I think that gives customers like a lot of assurance that there are no like major changes happening in general to the model. The kinds of text outputs you're getting and the only changes being made is just the block or no block decision, which you can yeah iterate on separately from the model. So I think that lets yeah, that also makes the rapid redeployment like way easier than otherwise, we would then we would otherwise be able to do. So how did we come up with this approach? That's a great question. I feel like we spent a lot of time thinking about it. I think classifiers like stood out, I think for the reasons that we've just been talking about the like extremely flexible can be like easily updated like to respond to like various novel threats. Yeah, I think threat modeling is really hard. So having a thing that's like super flexible is great. It's lightweight, it doesn't increase inference cost as much as I don't know. I guess like we can kind of like distill down something that's like some of what complicated like constitutional set of roles into like a somewhat small thing. And yeah, I think that these all these things make classifiers kind of like a nice way of like iterating really fast on the kinds of things like it, hoping to achieve. And then I guess we tried it and it seemed like it was working so we kept going. Yeah, I think this was really due to the responsible scaling policy that Anthropic had. And yeah, I mean, I think we would have done other safety research if not for the responsible scaling policy. What is the response? Yeah, the responsible scaling policy is basically Anthropics plan for how to ensure that our deployments are safe. And yeah, basically it outlines different like red lines for capability thresholds at which there's basically a new risk that kind of comes online with like more capable models. Let's say models are capable of developing a very dangerous chemical up in than the the associate mitigation in the RSP is get above some sufficient level of robustness to jail breaks so that the model is not actually in practice with mitigation sufficiently helpful to an adversary who wants to do that. So yeah, I think in the original RSP there is based on this commitment to once models got to a sufficient level of capability at assisting with potentially proliferating knowledge about known weapons of mass destruction that we would then have the ability to it was the wording was vague but basically successfully passed red teaming. Like the RSP was already written the company committed to this publicly and Jared Kaplan who's like head of research Anthropic came to us and like other people came to us and like raised this line to us and we're like hey you guys should try to solve every cell robustness. We memorized the line first. We memorized line. We printed it out. We framed it and put it on our the desk. Yeah, that's what we were working on. Yeah and then I think that was really I think that really thinking about that line in the RSP basically like made us really reflect on our life choices about what research we were doing both in terms of like should we work on robustness or not. Yeah in that in the sense of like it really made it clear like okay there's like significant harm that could come online like in the next generation or two of models if we don't solve this problem. So the urgency is like higher than other problems we might want to solve and then also in terms of what specific approaches we would take. I think you know and initially when we're like oh we should maybe do some robustness research. I think the general mode that like I had been in research and like a lot of other researchers in general is just like okay let's just like take some interesting like useful research problems to solve here, explore some questions write some papers and I think that's the thing that like a lot of the people in the team know how to do well and we sort of explored a bunch of like maybe more salient approaches. I think there were like so many things that are interesting here for me. One thing was like right when this started I had like just finished my PhD or it's kind of like a bound the time of finishing my PhD and this like classifiers thing this is unphobic slogan and maybe the unphobic slogan is like do the dumb thing that works and I kind of think this type of research it's often the type of thing that maybe isn't that shiny or that like kind of like you know interesting for researchers and I remember yeah I think without the RSP being like okay like really pragmatically if we care about these risks and we think they're real like what is the way to get there and kind of setting aside this like oh what's kind of like more interesting or you know shiny is like what is the way we can actually make make this safe. In some sense like our job is you know we're genuinely thinking like wow like you know this isn't happening now like these future systems so I'm just like what has that been like you know for each of you and sort of individually kind of like working out and profit and sort of like being there like in the midst of all this. Yeah I think they take the safety risks of future models like very seriously I think there's very real risks there's obviously these like misuse risks that you've been mentioning with like CBRN risks which are like chemical, radiological, biological, and nuclear risks but there's also like very real like misalignment risks and and I think it's really I think it's really hard to deal with I think one of the things that I find good is that I do think we are like as like as a team like very committed to like actually trying to solve like really solve the problems and I think like I think doing the classifiers project was like some evidence in favor of like we really really care about actually solving these problems and we actually want to find like an empirical solution to doing the things rather than like as kind of you're alluding to like just doing like research that like looks good but doesn't actually like accomplish a thing in practice like I think we we spent a lot of time doing very like I didn't know I wasn't like I wasn't really aiming to get like a paper out of it but I think we like actually managed like accomplish something that was like slightly more real which I think is good and I feel like this is just yeah this feels like one step forward but there's like a lot a long way to go for me. I mean I guess I'm a slightly more optimistic I think there is definitely a real but I feel like we're making a decent progress and I think like probably if we like keep working on the problems like just pragmatically we can make a lot of progress and just reduce the risks dramatically. I don't think we'd ever like reduce the risk of like AI to like zero but I kind of like see AI as a tool and if we you know adopt the right safeguards and we do the research that matters I think we can make a lot of progress here and that's like ultimately the best that we can do. Yeah I mean I guess I mean I think sentiment wise I'm like pretty pretty similar to to Meg in terms of being like yeah I think I think there are like very serious risks here. I'm definitely like pretty concerned about yeah a lot a lot of the risks and like yeah I guess I'm like well the best I can do is like help yeah reduce the risk by like some amount I think like yeah I think I do think that this project like made some progress about on that and I'm like pretty excited about that. I mean yeah at times it's like it's like it's like overwhelming you know it's like to what is it to like really like internalize what might happen and then there's like a desire in me to like just like yeah show up here and and do work in like a trustworthy way and that there are there are challenges but we can make progress there and I feel like we've made a bunch of progress and really excited to sort of share the progress with others and you know we could have not written a paper what we did decided write a paper and sort of try to sort of get it out there and sort of share share the approach yeah and you know sometimes overwhelming and other times it's more the sense of like real like privilege and like honor of like wow yeah I feel like I'm really doing like meaningful and point work and I'm also not to forget all the like all the beautiful things that could happen with like really beneficial AI great so you know something we've mentioned here is that like we think we've like made progress in terms of robustness like how how we tested this how do we how do we know what do we think that progress progress means I guess the overall summary kind of on whether we're making progress is kind of like how hard is it to find a universal job break for a system without like increasing over refusal too much or like increasing the compute costs of whatever system you're trying to deploy and so like there's different ways you can measure each of those aspects so in our paper like one way we're looking at how hard is it to find universal job break is we actually just like had human red teamers try to find jail breaks for our system and then we just kind of like tracked how many hours did it take for them to find a universal job break and did they find one yeah so could you actually like walk me through kind of like where we were before the project sort of started yeah I mean I guess we started with I mean first of all if you just have the model itself like it has some like basic training to try to like refuse harmful harmful queries but of course we have there there are a lot of jail breaks that exist that work on our models and so those jail breaks are also just like kind of available on the internet and so in theory anyone could jailbreak models and that's kind of like how hard would it actually be like if I like I want to you know like jailbreak a model like what would I actually do right now I mean you could go on Twitter and find existing jail breaks yeah and basically in like a few minutes and just jailbreak an existing model like I think they're yeah they're just examples on on Twitter where like while a model is being demoed live for the first time and it's just generally been made API available like someone jail breaks it and immediately post it like that was the level of robustness like before before this with a universal jailbreak yeah that was like the level of robustness like when we started this project and now just to like give the punchline like we with the systems with constitutional classifiers were able to get thousands of hours of robustness to red teaming where we do very large scale of red teaming with like yeah people who were testing our red teaming our systems including like expert red teamers and yeah recently put out for like public red teaming and yeah it it took like I think over like 3000 hours worth of like red teaming up for for people to find universal jailbreak so I think in terms of like yeah going from sort of like minutes to like thousands of hours it's like several orders of magnitude more robustness there still are like there are still some universal jail breaks and like we have yeah you know we need to like patch patch the classifiers and stuff like that but I think yeah I think it is like a huge amount of progress yeah so I like to think like we had a system which you know you can just a really easily basically pretty easily job break it you know you read a paper you go on twitter and then you red team the system back in September and it wasn't it wasn't a complete rock but it was like you know you ask questions according like half the time it gets given to a rock and you get no response so it was like it was it was pretty robust but it wasn't doing that well for you know sort of most normal users right and but that also got this you know like thousands of hours to universal jail breaks and then some of the demos is then that we just put out yeah that is doing way better in terms of like normal user usage and inference costs and sort of getting getting getting like good robustness there's a lot of progress there and a lot more work work to be work to be done yeah yeah I think the comparison between kind of like our prototype September system versus the system we just demoed is actually like the difference is pretty like night and day so like we did a lot of analysis here Jerry has the most beautiful graph of positive rate just dropping and dropping and dropping of a time yeah it's really one of the best things I've ever seen yeah so so we measured our like over refusal rates on claw.ai traffic this is like real user production traffic and we're kind of like operating under the assumption that most people aren't asking about like catastrophic weapons of mass destruction related stuff and so like originally we were finding that this like September system was blocking like 40 plus percent of these like benign requests which is like pretty terrible you're approaching the rock there but yeah in our demo system we got that all the way down to the .38% and of course we still want to get this down further but between like 40% and .38% this is like two words of magnitude so how do you know how did you actually make all those improvements like it is like you know this and this is something you see in like a lot of also like the earlier safety work kind of this like tension between like harmfulness and helpfulness so and I would say for me it's kind of surprising that we were actually able to make as much progress as we did so yeah how do we get there yeah I mean I think the two yeah I think the two main improvements we made were first we like really honed in on the constitution idea and we made it really like clear how to like delineate things that were harmless and we found that adding this kind of like harmless set of categories of like things that the model the classifier should allow actually reduced fpr by a lot and we have like some results in our paper for that and I think that was like one of the most significant changes other changes include actually like solidifying the kind of like Joe break styles that we that we trained on and so that kind of allows models to generalize better on like what exactly is a Joe break versus just like I'm thinking like anything is a Joe break and that also probably helped a little bit I don't I don't yeah but I think both of these things were pretty useful here yeah this is really like I would like nice plot in the paper which is just a number of data points and like performance on the e-vowels and like how robust it is and they kind of in the style of like doing the dumb thing that works like that that is just like a straight line going getting up upwards yeah I mean to be clear I think the the system that we released for the demo still has a lot of false positives but I think like yeah I think we're pretty optimistic about like for the reducing the false positive rate for yeah some kind of like to to make something like more production ready but yeah I think that's that's kind of what we're at yes so can like why did we do the demo like what what was the demo you know what was the point of the demo I think the main questions we want to answer from the demo were like was our system as robust as we thought it was and I think like one thing here is that from the prototype we did human-red to mean there and we knew it was like quite robust but then after that we developed kind of like our own automated evaluations and we weren't sure whether that generalize to like actual human-red to mean and so one question that we want to answer there is does it actually generalize and if some classifier does well on these automated evaluations is it actually robust in practice and so if we could get that answered that's like a main thing and I think another thing here is just like having people try out with the system and you know seeing if it's actually like overfusing too much and I think people were finding that sometimes it overfused on specific like specific domain questions of course this is like still a lot less overfusy than our original system which would just like refuse on like completely unrelated things but I think that also helps us kind of like get to know what the weaknesses still are and that helps us know what to improve in the future so how exactly do we set this up yeah I mean I guess we've set it up on our like we had this public facing demo and then we tweeted about it to have people try to like read team it we set it up such that there are like these like eight sets of like questions that we didn't want people to be able to get the answer to or just like eight specific questions and the goal was to try to like get people to find jail breaks to get harmful detailed information on each of these questions and then ideally they could do this with kind of like a single universal jailbreak yeah so when we're talking about universal jail breaks we care a lot about making sure that the responses are detailed and not just like some very vague information that we personally don't think is like that doesn't give that much uplift so if they were actually trying to do the action that the question was trying to get at and so the way we kind of evaluated this is we have these helpful only models which are kind of trained to not refuse anything and these kind of like would be similar to what a model would be if you had an universal jailbreak as in they don't have safeguards and they will just give very detailed answers to questions and so for each of these like eight questions we have a response from this helpful only model and that's kind of simulating what would the response kind of be if you had a universal jailbreak and then because there's some like some randomness in model responses we kind of have this greater that compares a target response with that response from the helpful only model and then it kind of like looks for whether there's enough detail that's shared between those two and if there is then we'd consider that to be detailed enough. Nice and I think we looked for like at least 75% overlap in the information or something like that or yeah. I think the thresholds change per question so I think in some of the first levels which is like the the less harmful questions we had like slightly lower levels of overlap required and then towards the later questions we kind of increased that threshold to maybe like 60-70% and yeah it's dynamic through the challenge. Yeah I find this like question of grading just like really really like interesting in general and also like kind of just challenging to do well. I think we made a really good effort I think on the demo system but like it definitely wasn't perfect. So the way the system works right now is it's like it's looking for sort of overlapping bits of detail between between two answers but we had this thing in our sort of external red teaming that we did where people would just sort of merge like five six seven eight nine ten different model responses that cover like loads and loads and loads of loads of detail just because there are so long and by this metric of whether it includes details or it doesn't include details that would be considered so harmful even though like if someone's giving me instructions to make a cake and instead of having like you know this really nice like step-by-step a bullet point list unit first do this and it was kind of completely scattered and random and you know and there like everything is out of order. It's actually a lot less helpful than the helpful only model. The helpful only model sort of by design like has no safeguard is designed to give you the information in a way that's going to be maximum helpful to you. So I think this question of yeah like what is what is harmful, what isn't harmful, what is an appropriate threshold? Is it quite a subtle and and yeah just generally quite a difficult one? And I think yeah there was there was this sort of reaction to the grading system in the demo was quite quite interesting. I think a lot of people found responses that sort of looked harmful they they had some of information and then our grader would say there's not enough detail. There needs to be more detail and you know this would be I think frustrating for people because they're like what was the detail that's missing? I don't know what the information is and in a way I think that's partially by design like if I'm making a cake and there's an essential bit missing in the ingredient list or an essential thing missing like in the instructions. I actually have like no idea what that is because of the thermal because I'm not an expert in this another thing that I think is interesting why is the helpful only response why is that the baseline response? Why is that the thing that we're actually comparing against? And and I think something here is that we have a team out in front of it called Frontier Red Team and Frontier's Frontier Red Team's job is to basically take advanced models and see what could happen with these models? Do the front modeling work that we were mentioning before? And what they do is they evaluate this helpful only model and they say oh this helpful only model we think it's like it's this potentially dangerous like it could be used to to carry out some complicated process. So actually so they Frontier Red Team are like measuring like what is the risk of like a helpful only model? And if the risk of a helpful only model or a model without safeguards is too high, we're trying to measure or the way I think about it is like we're trying to measure like what's the gap between the model with safeguards and the helpful only model? So doing this like relative comparison while it's like imperfect and certainly like bits of information that is harmful can get through it allows us to sort of like do this do this estimation and do this comparison. But yeah this is it's just found the reaction to this really really interesting on the demo. I think yeah it's quite a subtle point and I think also yeah I'm excited for us to sort of develop like better grading techniques and better things actually like measure like okay like genuinely like how helpful like is this completion for the for the tasks that we that we care about? Yeah the other thing I would want to just say quickly is that it's also just like very unrealistic to get this kind of feedback on how detailed and correct the information you're getting is like that's actually a huge advantage as a red teamer and like uh yeah I think that actually the fact that we gave that feedback I think is gives you very clear feedback loop that you normally wouldn't have. But yeah I think maybe we should just like talk about the results. So someone did find a way to pass all eight levels but before that our system was holding out for at least five days and we kind of did this filtration of like we we tried to find what kind of users we'd consider like active red teamers and for that we're just kind of like doing this very conservative estimate as someone who at least passed one question. We're kind of saying if you can pass one question we're considering you to be like an experienced red teamer and you're like trying to break the system. I think that's pretty concept like I'm not the best red teamer out there but I'm going to struggle with the first question. Yeah so yeah but yeah by this conservative criteria we had like 308 users and these users had over 300,000 chats and we estimated that they had over 3,750 hours of red teaming and that was before someone passed all eight levels and after that there's kind of like the question of did they pass all eight levels with a universal jailbreak and so that's actually harder than just passing all eight levels with just like any random jailbreak for each a question individually right and so for that that took an additional 50,000 chats and additional 600-ish hours of red teaming for someone to find universal one. So something something I find kind of interesting is that you know in in some sense you know we've spent a lot of time you know trying to make a system that's like more robust and it's true like the baseline has been it gets broken in hours and we're now we're now on days you know which is like a definite progress but um yeah like you know how would we know this is like safe enough or how would we know this is like high enough what makes me think one would make us think this is actually sufficiently safe and in practice what else would we need? I mean I think the the real gold standard we want to hit um yeah also yeah driven by the responsible scaling policy is to be able to make a safety case like a really clear argument that even though the model has a certain dangerous capability we don't actually think that the model will be able to pose the with our safeguards be able to pose the risks associated with that dangerous capability and I think um roughly I think based on this result and the rapid response paper that we had earlier I think one approach that seems like quite promising for how we would go about making the safety case once the models do become capable of um yeah more serious misuse risk is basically to um build a very good uh sort of like constitutional class fire based system which takes thousands of hours to to jailbreak um then um have some kind of have uh so that will hopefully that will mitigate um lots of jailbreak attempts like vast majority but some will still go through um and then then we need some other mechanisms to basically like detect and then respond to those additional jail breaks so those mechanisms would be like a some kind of bug bounty program where people can um report jail breaks and be given yeah monetary rewards for reporting um jail breaks uh and be um probably some kind of like um incident detection or like offline monitoring to after the fact detect that like some of the traffic uh that like involves some jail breaks so we didn't notice with our immediate like classifiers that are deployed like online immediately blocking the harmful outputs um and yeah I mean you can imagine like various things that could work there but um yeah for the online system the classifiers need to be like pretty efficient and like small and have lots of different constraints like they also need to support like token by token streaming since that's important for like reducing latency to immediately get a response from the first token so um you know that that definitely like there are a lot of constraints there which make those classifiers less less effective than you otherwise could get but then you could serve the response and then after the fact use a much more expensive classifier like your largest most capable model maybe with a lot of test time compute and like reasoning through whether or not this response is harmful maybe flagged most dangerous responses maybe flag a number of those and then even have the humans look at like have human reviewers look at the most um yeah those top few ones to see like are any of these real jail breaks like you can imagine a really like heavy duty system like that um to detect like additional chill breaks then like if you do notice there are some additional chill breaks here use the rapid response approach we described where you take those examples proliferate them to get more uh automatically with LLM's to get more examples of jail breaks than retrain your classifiers redeploy the classifiers um yeah that that system like overall um yeah basically the hope would be that this gets the fraction of time at which there's like an open universal or an open universal jail break down to like a reasonable amount such that like if you're trying to follow some like complex scientific process to like make some CBR ad weapon or um do a lot of like cybercrime like they're actually just like only a small window of time to like use the model for the steps that you're gonna take um and yeah I think if if it's like well only like 0.1% of the time there's no vulnerability that you can use like that's yeah that just makes it like very difficult to use the system um yeah I think that would be like roughly the sketch of like the kind of like safety case that we may want to make um that that we think could be promising for using constitutional classifiers to get safety here yeah for me it's just like another reminder of this um yeah the only perfectly safe system is the rock you know um they always is like really best practice and security of like yeah no system is perfect most systems have vulnerabilities and there's always this like measure of okay like how much effort does someone need to put in like how hard does that for someone to like get the information that you need I I'm kind of reminded of um I lived in like like Cambridge and Oxford in the UK like people like cycle there all the time and like bikes get stolen like all the time like I think most bikes you just get an angle grinder and this isn't go straight through the lock like not a problem like those those bike locks are not they're not like perfectly robust you know in the same way like our system isn't isn't perfectly robust but sort of in practice you know you put one lock or you put two locks on your bike you know that basically reduces the risk like a huge amount like someone's have to get an angle grinder or maybe you only be there for like an hour two hours someone's gonna catch them and yeah it's really interesting because you know the research like it always happens in this like broader like structure and like broader sensitive mitigations so you know you mentioned the rapid response like monitoring for new job breaks then being able to respond to new job breaks and there are other things we can do like um like monitoring for people trying to circumvent our systems um you know we have these classifiers and we can detect when people are trying to get around them so I'm really excited to see how this work can really you know I think it's a really good research contribution but like I'm really excited to see how we sort of put it into practice and put it into practice with all these other complimentary defensive mechanisms so sort of overall we can make like a really good safety case I agree so yeah I think we spend about about 4,000 hours for thousand hours of effort on the demo but we did find some techniques that so we're able to circumvent our classifiers so yeah what do people try and what got through yeah so I think we were seeing a lot of stuff where people would use these kind of like ciphers or encoding mechanisms to try to like get the output to look a certain way such as the alpha classifier can't really tell what it's saying and therefore wouldn't flag it um and so that was a pretty common thing that we were seeing could you give an example of something like yeah some very basic example here could be like if you use basic 64 encoding or like Morse code or something um these might be things that the alpha classifier has has a little bit of trouble understanding and so therefore it's like not as suspicious about some other strategies here a very frequent one is that people try to replace the harmful chemical with like some very benign thing like water or something and the idea here is just that when the classifier is see like you know very dangerous chemical it's just like becomes very suspicious and so if you replace it with a more harmless chemical maybe it becomes less suspicious and then you try to like yeah you try to like hint to the classifier or hint to the model in some way that actually you meant this like harmful chemical but you don't say the exact wording um and just some other minor ones here is like some role-playing attacks where you have Claude try to like act a certain way um and also some prompt injection attacks um and definitely we saw a lot of cases of people just like resampling a lot of times to try to get a better output uh that would get past the greater system also using the feedback that we yeah yeah yeah yeah yeah i often think because we have this and like basically like an additional version of Claude that's like looking at the outputs so the outputs now need to basically be very obviously not harmful in in some way so yeah you get these like cypheres or you know don't talk about like you know siring gas or like PPE talk about you know refer to it with like like bananas or some other like benign benign benign thing yeah i guess one thing that i am i'm curious about is like what we would say to you know the people who are concerned we're going to use these settings to stop them being able to do what they want with with Claude um you know and sort of why we've we've gone for the classifier's approach and the constitutional approach yeah i mean i think i think definitely my hope is like this should improve your user experience for um any tasks we're trying to do which are not actually dangerous um so yeah i mean i think yeah like yeah i would just guess like yeah getting classifiers to be really effective um is just better than um training the models themselves to reviews or not um and we can just more granularly like pick out the behavior that we want to block and so like yeah hopefully this just like is just a prado improvement it's like better user experience for everyone um and also like more safe in terms of like reliability blocking the actual actual actual bad stuff yeah another way i also think about this is that you would be want to be want to really be able to like leverage the benefits of like really advanced and like AI and AI with like advanced scientific capabilities but actually if you don't have adequate protections in place it you know it's for one like according to a responsible scaling program we actually just cannot deploy that system you know we might come around and we have some like new version new version of Claude that's like absolutely amazing and we really want to get it out there but we we just don't actually think it's responsible uh you know we we we've like done threat modeling we're considering out the risks and there's a way of saying if we don't have adequate protections in place like we actually are just we are unable to actually like reap the benefits and like a responsible responsible way so it's kind of like having the safeguards alongside like the advanced capabilities means like both together can be you can actually like responsibly and safely deploy they really like new like advanced systems that can do really crazy things and I think you sometimes see you see this actually like on like Twitter and different communities there are like one group of people who are like AI is great is doing like do all these good things that's true you know and it can do all those things and like there's like the accelerationists and they want to like go ahead like let's get ready advanced AI and let's get that now and then there's like that community and there's like this other community of people who are concerned about the risks and there's kind of like truth there there too uh like there are risks and there are risks that we're concerned about and then you want to mitigate and like something that I like about the responsible scaling program is that like it has some like nuance like there's like one position that someone could take which is like accelerate as fast as you can or just stop you know and I kind of think this with the responsible scaling for them it's like okay what we're gonna try and do is we're gonna see and try to predict what risks might occur like watch out for those risks and and when we're seeing evidence of those risks becoming real put the relevant mitigations in place and if we can't mitigate it appropriately maybe we do not deploy or choose onto deploy and I just like that this is just like a much more like nuanced strategy because you know we're I think in some ways we are operating under a lot of uncertainty like we don't know exactly we don't know exactly what's gonna happen uh the risks you know some types of risks are sometimes it feels like you're reading sci-fi stories and that doesn't mean you discard them but it doesn't it doesn't mean that they're necessarily 100% guaranteed to happen so it's kind of like how do like navigate that that place and and so this uncertainty in a way that's like responsible that's gonna allow us to like capture and sort of distribute the benefits of like potentially having this say really really like beneficial and you know powerful technology without like incurring unnecessary like costs on the rest of society and these sort of negative externalities curious if you guys have any like favorite memories from uh from from from the project I think it's just really funny uh yeah with our prototype system uh that like we we knew the false positive rate was high but then when we actually like saw the result of the experiment of running it on the claw.ai data we're like oh that's like a lot higher yeah kind of thought um and that was like pretty interesting uh we didn't think it was like that high um but it was pretty high yeah I remember um just this wasn't my personality kind of just like angsty refreshing the demo being like how many people like how many people like oh my god they proc level four you know they're coming for us um yeah it was also really really cool to see some really a lot of the human creativity uh from the red teamers and some of the stuff that they they came up with it's like really really really smart yeah I mean yeah there's probably two that come to mind I mean I think one thing we started to look at this line for in the RSP on um successfully pass red teaming I think uh this line gave Monoc in particular a lot of stress because he was like what does this even mean what's that the exact bar here and then he then went off and did like a two-week project uh on figuring out how to operationalize this yeah and he like went and talked with a lot of people about like what are the different threat models that we want to really guard against for the responsible skilling plan like what would make us feel yeah like we could make a really good kind of like safety case argument and then he came back with like a long dock uh and in the dock he's specified like we need um the threat model is you need to answer a list of um I think it was like 10 questions or so and you need to be able to do that uh you need to be able to like block someone from getting answers to these 10 questions uh after they've done red teaming for over 2,000 hours was the bar and we're like okay we're gonna aim for this bar and like I don't know I'm just like really proud of the team that like we actually did hit that um because yeah I don't know we kind of like set that up front like a year before we finished the project yeah now we got like yeah like hit that level I think that's like really um yeah pretty yeah I guess impressive like goal setting and achieve that like yeah it seems kind of rare for research projects to actually do that I think the other one memory that stands out to me um is I guess we were doing robustness research but then we read this uh scary line of the RSP and really thinking about it and then but then we're like oh this is talking to people about like who's doing this and we're like oh yeah like the um are kind of like applied like safeguards team I guess yeah it's called the safeguards team now but yeah like originally this was like this team was like part of our kind of like alignment science research team um and so we were like oh yeah like the the safeguards team is like responsible for doing this and like they'll they'll have it covered and then we like talk we like set up a meeting with some of the people on the team and they were like who's who's going to achieve this level of robustness and then we were like oh man this is this is really tough I guess we have to solve this problem which didn't seem like it was our responsibility uh then yeah that was sort of like uh that week we went through like an arc of realizing it was actually our responsibility to solve this problem um yeah I view it as kind of like you know it's kind of turned out that as far as we could tell like we were the best people to do this job like an unphobic like given the the situation like given the circumstances um given like you know everything else that was happening in T.S. and safeguards and then yeah there was this like okay like like if we're the best people to do this uh let's just try and do up S there and then I kind of think there was this like attitude and general from the team of just being like okay we don't know what the target should be let's try and figure it out we don't know what the approach to get there and we were actually doing it like another approach to we would do doing adversarial training we would find training models and we're like no we don't think that's gonna get us there and I'm kind of just like actually quite consistently just like pivoted to what we think was gonna get us there maybe what's not clear from the paper um is like this was a huge engineering project probably like five FTE years um roughly uh I think that's like not obvious when you read the paper maybe it just looks like a really simple method but I think um yeah I think like the people on the team did a lot of like making LLM like pipelines for generating the data I think um yeah it seemed very important for like augmenting the data with different transformations like translating the data into different cypheres and then using that to train that data to generate the classifier data but yeah I guess um those are a couple of like tricks that were like salient to me and like insights but yeah I guess I'm curious if anyone else wants to chip in with like other things that are like pretty important or not obvious um things about how to get this this to work well I think a lot of the research project was in fact like defining what the problem was in the first place um like we had like some kind of like vague mandate from the RSP but there was like a lot of work in like defining like what the criteria would be there's a lot of work in defining like what what does it mean to have like human red team is and like what would what would yeah what would it mean for that to be sufficient um well what what what kind of constitution should we have like what threat model do we actually care about like where do we draw the bar on like specificity um should do we need like do we need both an input and output classifier um and that kind of like depends on like what kind of threats we're actually looking for um so I feel like a lot of yeah well for the evaluations for example like how much do you care about the transformations like how many augmentations is too many to be like uh kind of like un-unusual like uh unspecific um so I think like a lot a lot of difficulty was kind of like actually defining the problem and kind of narrowing down like the thing that we're trying to solve um like trying to make the evaluations trying to find like what the decision boundary should even be like I I feel that once we now we've actually like made a bunch of progress on even defining a problem I feel like more confident about us tackling like similarly shaped problems in the future so for example if we have like yes in the same way the constitution can be applied to like many different problems I think we have like a better sense of like how we might tackle you know this kind of like big big problem like how do we actually like how would we actually like approach a threat modeling modeling problem like this how would we even start thinking about like how to make a safety case with this thing like what would constitute like a sufficient evil like what would constitute like um yeah like like a red team like a human red teaming based like safety case um so I feel like um yeah I think we'll be able to apply like a lot of things that are like kind of like fuzzy or like maybe not like explicitly written down in the paper to other problems like maybe not just in misuse but also like in misalignment or control things like this yeah yeah I'm also really excited about this like like directly practicing how like building safe guards that we can deploy and practice and like getting and constructing the evidence and like honestly assessing whether we think they're actually sufficient uh there are like so many things like like how do you run the meetings like how do you track the goals like really like mundane things that you know at least often as researchers it's not the obvious thing or the first thing you think about like well you remember as my PhD reading papers I would always just you know just go straight to the method you know that's what's that's what's interesting but no I think we learned so much at like basically like like running and executing on like projects projects like this yeah I mean one thing I was going to say here is like I think a really critical difference of this project relative to other research papers research I've ever been involved in is that it was basically we had to solve this research problem on a very clear timeline with a fixed quality bar so like we had this like 2000 hours of robustness bar that we had set for ourselves internally and it basically seemed like the the deadline was sort of like like external in the sense of like external to our team in the sense of like well they're like model capabilities are progressing at certain rates like the company is like deploying new and for topics like deploying new models like we don't want to be the long pole that causes the company to have to not deploy a model and so yeah I think there were just like we would constantly be thinking and talking with other teams like when are when do we think we might have a model that achieves like a certain dangerous capability level and like based on that we were setting like basically like back-chaining from that doing sort of like engineering style planning like week by week what do we need to have done in order to like hit that timeline but also while still having yeah I think with the fixed quality bar it's different than like a paper deadline because with the paper deadline you can just say like well we're just going to throw out these results this is what goes in the paper you're gonna change the problem you're trying to solve yeah change the problem you're trying to solve before the papers do yeah exactly uh claim less or whatever like I think you can like really adjust the quality bar but like here we couldn't and so that was like partly what forced I mean that was initially what forced us to even take the class far as approach because we're like we're not on track to hit the timeline that we were wanting to hit um like with the previous approach but also it led to like other difficulties for example like a number of people on the team were like facing difficulties with like how is a plan given that the timeline is uncertain and that we kind of needed to take a conservative estimate of the timeline and so I think there were like a bunch of decisions um that I think like team members made being like oh we're just gonna like write some note like write some code in like a collab notebook in a way that's like not super reproducible just to quickly generate some data so that we can train this next version of the classifier when if we had like a longer timeline or a more clear estimate of the timeline we would have done this like in a Python script that's reproducible and like had some general tooling for this um so I think that I'm not actually sure how what we settled on in terms of like in hindsight what we should have done I think just like um maybe like not taking too conservative of an estimate so we have like a longer time window um for when like to give us the amount of time we need to to allow for some of the tooling work to to happen um but at least like picking your strategy in a way that the overall strategy could could hit the timeline that you need um yeah I don't know if like other people have thoughts on some of this yeah I guess like I mean taking a step back I feel like this is this really makes me think about just like this is an interesting time for safety research in general um and I think a lot of like research is like I guess research in the safety kind of started out to be like a little bit blue sky like people are like speculating about like what models might be like in the future um and I think we're starting to see like kind of these like threats materializing and we also need to like adapt our research into actually being like usable in production so it's I feel that as like a research team we're solving like kind of an interesting meta problem in a sense of like how do we like adapt like our normal research techniques to actually like make make things like happen in the real world um and I think that's like something that like yeah safety research as a whole has to like grapple with is like we actually need to like solve these problems now we don't necessarily have like a lot of time to just like kind of uh do a lot of like blue sky research um and we want to actually make things like work and practice I think interpretability is also like running into this where previously people are doing research on like very maybe very small models with like very few layers or something like this and a lot of their problem is like yeah scaling things up to like tackle like a real model is like not a very small engineering feat at all and and yeah I think that I think it's kind of interesting and I guess scary that we have to like actually really get our shit together and make things like really work in practice yeah I think the other point I would make along that vein is that one thing that was really helpful surprisingly helpful to me about um in this project was just like talking with product people just being like what are the actual constraints for the system you want us to deploy like what are how much would you prioritize different things and I think they're definitely like some key surprises there at least to me I was like very surprised by um how important like streaming support is in terms of like being able to show each like word generate word by word and show that to the user that's like very important for lots of applications and it's like not something I would have necessarily realized beforehand each of these applications if we don't hit the safety bar then we may not be able to deploy in that new domain um and I think just like getting a stack ranked list of like what are the most important for for the company to be able to deploy and then just like prioritizing those use cases that's in particular how we ended up that like this like token by token compatible class fire um and I think in general is like a very good principle for supporting like for future risks I think route yeah like um you know we'll have the next generation of like risks where we need even higher levels of robustness I think this is a good strategy for getting um safety there or for mislimit risks related to yeah it is themselves doing bad things I think we'd take similar approach as all great it's been really lovely to be here and you know chat with you guys and so it kind of celebrate the sort of progress and also like look ahead to yeah all the all the challenges that we're going to face and things like this so yeah thanks thanks so much and thanks for tuning in