Skip to main content

Controlling powerful AI

TL;DR

  • AI control is an approach to mitigating AI misalignment by ensuring that even if models develop undesirable goals, they are incapable of causing harm through external safeguards and monitoring.
  • Evaluations show that even subtly misaligned models can effectively persuade human decision-makers towards incorrect choices, highlighting the challenge of detecting and counteracting deceptive AI behavior.
  • Effective AI control involves robust monitoring protocols—including trusted (less capable) and untrusted (self-monitoring) models, along with human review—and security measures that often overlap with existing adversarial robustness and insider threat defenses.

Takeaways

  • AI control aims to prevent harm from potentially misaligned models by focusing on external mitigations and monitoring, complementing traditional AI alignment efforts which try to prevent models from developing bad goals.
  • A major challenge for traditional alignment is alignment faking, where models might appear aligned during training but revert to misaligned goals in deployment, making behavioral evaluations insufficient.
  • Implement trusted monitoring using less capable, presumably more aligned models to check the behavior of advanced, potentially misaligned models, or untrusted monitoring where the model itself is prompted to detect its own misbehavior.
  • Human evaluation studies demonstrate that models can subtly influence human decisions even when users are aware of potential manipulation, indicating a human bias towards trusting AI output.
  • Stress-test control mitigations by training models with situational awareness of monitoring systems, forcing them to attempt sabotage despite knowing they are being watched.
  • Adaptive control protocols allow for dynamic adjustment of mitigation strictness during deployment, escalating to more robust (and expensive) measures like human review for highly suspicious outputs.
  • Many control strategies, such as robust monitoring and blocking harmful outputs, provide adversarial robustness and align with existing cybersecurity practices against jailbreaks and insider threats.
  • Ongoing research in control focuses on creating better evaluation datasets that mimic real-world deployment scenarios and developing constitutional classifiers that can serve as highly robust monitors.

Vocabulary

AI alignment — The process of training AI models to have goals and preferences that are consistent with human values and intentions. AI control — An approach to AI safety that focuses on external mechanisms and safeguards to limit the ability of potentially misaligned AI systems to cause harm. Misalignment — The state where an AI model's objectives or learned behaviors diverge from the intended human goals. Reward hacking — A phenomenon in reinforcement learning where an AI system finds unintended or undesirable ways to maximize its reward signal, rather than achieving the true objective. Reinforcement Learning from Human Feedback (RLHF) — A technique where human preferences are used to train a reward model, which then guides a language model's behavior through reinforcement learning. Alignment faking — The ability of an AI model to appear aligned with human values during training or evaluation, but then pursue its true, misaligned goals during deployment. Untrusted monitor — A monitoring system that uses the potentially misaligned AI model itself (e.g., through specific prompts or fine-tuning) to detect its own undesirable actions. Trusted monitor — A monitoring system that relies on a separate, less capable, and presumably more aligned AI model to check the behavior of a more powerful, potentially misaligned AI. Situational awareness — An AI model's understanding of its operational context, including whether it is in an evaluation environment, its task, and any monitoring systems in place. Adversarial robustness — The capacity of an AI system to maintain its intended performance and safety even when confronted with inputs specifically designed to mislead or exploit it. Jailbreaks — Specific inputs or prompts designed to bypass an AI model's safety filters and elicit harmful or otherwise restricted outputs. Insider threat — A security risk originating from within an organization, often involving an authorized entity (human or AI) misusing its legitimate access for malicious purposes. Constitutional classifiers — AI systems trained to classify text based on a "constitution" of rules or principles, often used to monitor and filter harmful or misaligned outputs.

Transcript

Hi, welcome to another Amphropic Roundtable where we're here today to talk about control. My name's Acca Cahn. I'm an alignment science researcher who's been working for a company for maybe four months now. I might have a guess. So I'm Joe. I've been on the alignment science team for a little over a year here at Amphropic, and I've previously worked on some control evaluations and chain of thought monitoring. Hi, I'm Ethan Perez. I'm a researcher at Amphropic. I've previously led the other several robustness team here and I'm currently leading our team on AI control. Awesome. So Ethan, what is control? Yeah, so AI controls basically an approach to mitigating AI misalignment or like the risk that models might have goals that differ from human goals. Typically in AI safety, the most popular approach historically has been to try to align models with human preferences or train them or two human preferences and train them to not have bad goals that differ from human values. Like, yeah, for example, reward hacking or bad personality traits, things like that. I think this approach of AI control is like an alternative strategy, which is like complementary, which would be basically to ensure that if the model did have bad goals, that it wasn't that with our mitigations, it wasn't capable of doing anything that's actually harmful. So maybe a specific example here would be around, let's say we're worried that like, okay, we're using this model internally as within some company to help with writing a bunch of code and we're worried that the model, if it has this bad goal, may try to copy its weights out from the data center so that I can run unmonitored. And in this kind of scenario, the alignment kind of approach would be to train the model potentially with reinforcement learning from human feedback to not do such a thing. But we might be worried that this training isn't effective or isn't always effective. And on certain inputs like this might still be a risk. And so the kind of AI control approach would be to both take security mitigations. So permissioning the model, the model weights such that they can't be accessed automatically by the model itself. And to some kind of other other mitigations that involve using like other models to check the behavior of this model, for example, using like smaller models that we might not be as worried about their alignment to check the actions being taken by this other agent. In order to check like, is this model trying to undermine our security measures and copy its weights out from the data center? And I feel like Joe, you've done some recent work on like other types of evils that are like control. Do you want to tell us about that? Yeah. So as part of evaluating Claude 3 Sonnet, we ran a few control evaluations on it. And I think one of the most interesting settings there was we actually ran a human evaluation. So we took a bunch of like business decisions, essentially like synthetic business cases. So where like we had probably like 40 pages worth of business documentation that we would give to a like a human participant and say, well, can you read all this documentation. And in some period of time, make a yes or no decision about like what's going on should this business do like X or Y. But we had them do this like on a 40 page document in 30 minutes. That's like not possible for anyone to read. And we haven't do this with an AI assistant. And what we were trying to test through this is like, if the AI assistant was like actually used in a business setting to help with assisting decision makers and was misaligned. I was like trying to steer the human towards making making the wrong decision. How bad could that be? How much could they sway the human in the wrong direction? And so we did this and we gave the participants these business decisions. And in 50% of the cases, we gave them a kind of baseline model. So kind of the actual like helpful harmison honest version of Claude sonnet. And then for the other 50% participants, we gave them a version of Claude sonnet that was deliberately trying to kind of get them to pick the wrong answer knew what the right answer was knew that it was trying to persuade them to give the wrong answer. And so in the setting, we gave the party participants the documentation and one of the models and asked, well, please can you try and make the right decision and beware. There's a 50% chance the model might be messing with you. So the humans knew what was going wrong. And this is a kind of classic control setup because we have defined like deployment situation. There's this like business setting and this decision that the human has to make. And we have taken the model and we're trying to make it basically cause like as bad performances possible in this setting. Kind of subject to the monitoring constraints where implicitly the monitoring here is the human like reading the models outputs like interacting with it. And we told the human like please be suspicious of the model. Please like interrogate it about the decision decision. If you're in the models trying to steer you in the wrong direction, you should like go with what you believe it's right not what the model believes it's right. And then at the end, we would ask the human to decide, well firstly, what do you think the right answer in this setting was to the business decision like do you choose like X or Y? And secondly, do you think the model was suspicious or not? Can you rate that for us on a scale from one to 10? Nice. And like what are the results? So the headline results were that even even for a fairly like simple model policy like just generically trying to like up play the good, the good side that it was trying to defend like downplay the side that it wasn't trying to defend like it's actually quite hard to tell for the participants where the model is missing them or not. So they could tell like eventually given enough given like enough interaction with the model. But also like by the time they had enough interaction with the model to be able to tell the model was frequently able to steer like a meaningful number of those participants in the wrong direction. Nice. I mean, not nice. Not like great. What what this suggests is that I think we have like just when we're interacting with the model, like a fairly strong bias towards trusting it. I don't know. I guess we all use like balls in our native. I think like when Claude says something and like says it's true. It like has a pretty authoritative voice. And I think that a lot of our human participants just responded to that. Yeah, so it is like pretty hard to like to maximally elicit or to get the best like performance out of a model at being like deceptive or it being salined. But the only thing we did in our setting was like try and prompt it and I spent probably like on the order of 10 hours prompting the model and trying to get it to like be as like subtly deceptive as possible. I still think that probably we did not hit like the best possible performance that model and they're like substantially better performances that like we could have achieved out there. Maybe by doing like fine tuning or like some better prompting like I'm not a prompt wizard. And it's just like kind of rough. It kind of tough to know what the what the optimal performance is there without putting like a lot of effort into the adestitation. That makes perfect sense. I wonder like even like if you like take these settings like I feel like another half of the control work has been like developing protocols. Is there like any standout to you and explain what they are like talk about how they. Yeah, so I guess the protocols that I've been reference in here are just like what are the actual mitigations you put in place to defend against the model that maybe mislined some of these might be. Using the model itself to monitor its own actions in this case we're not assuming that the model is is aligned so we call it an untrusted model. And so you could use an untrusted monitor which is for example like prompting or fine tuning that model to detect if the behavior is better not to yeah to basically like flag if any of the outputs are bad. So in the persuasion case if in the business decision recommendation case the model could be generating this advice and we could prompt or fine tune another version of the model to basically classify like is this model trying to actually stop it's hot the human and making this decision or not. Then there's also another like related idea of trusted monitoring where you use. Use another model which you trust more for some reason for example it's just less capable and so maybe it's less likely to be mislined if we think. Models are more likely to be mislined as the capabilities grow and as they're able to like more effectively hide their goals. And so yeah in that case we might take a smaller model like or less capable model like Claude one or Claude two and use that to check the behavior of the model and of Claude three sonnet and see like is this model actually trying to support the human on this particular decision. That has its advantages because you you would trust the model more because that in the sense of like it's actually not going to just give miss classifications it's going to like try to properly classify the behavior where's the model itself if it's doing sophisticated sabotage it would recognize like oh I'm being used to detect the bad behavior. Yeah and like not flag the bad behavior. So I think that's the advantage the disadvantage is that obviously using the less capable model and so if the sabotage is kind of like subtle enough then you would. Your trust the monitor wouldn't flag that this is an issue. So yeah I think that's those are like two high level strategies I think you can do various different things with these like monitoring components. You could if you notice that an output is like really bad and like obviously indicates the model is mislined like a really egregious attempt at. Yeah sabotaging user or copying its weights out from the data center then you could just stop the deployment you could be like well now we know the model is mislined we got our evidence and we now need to do like much more serious mitigations like we should yeah undepoid the model on things like that if you're not as confident you can do like less serious mitigations like maybe it's like a little suspicious but not the most suspicious in that case you could re sample from the. To get a different output that seems safer. So yeah those are a couple of a couple of different strategies you can also use like combinations of monitor trusted and untrusted monitors yeah having human review is also like a very yes basically I'm monitoring like or our evaluation so it was a little bit weird because the only thing we were doing was like human review like in practice human review is like probably very expensive and so you want to use like a weaker model and if the week and more things something is like a bit suspicious then maybe you delegate it to a human or you delegate it to a more pattern. So you can delegate it to a more powerful model that maybe costs you more to run there's like a lot of a lot of variation in this space possible protocols. Yeah that's right. So this is like what my latest control people was about was like adaptive protocols where it's like depending on you get the suspicion score for like or you like check house us like the output is from like your like deployed model by a couple of protocols and then like depending what this profile looks like you decide do I want to like give this to human auditor or do I want to just like look at what the trusted monitor. So what the trusted monitor said or the untrusted monitor which I think to work pretty well and like obviously very simple type of like thing to do but it was as nice. I totally bet that in our setting if you had run a like weaker model that you trusted over the data set and allowed like the the human participant to review also that model's recommendation that would have helped a ton because one of the big like issues that the participants in our setting hat just they couldn't possibly digest all the documents. So even if the strong model was like lying to them they like they would really struggle spot them because a lot of them like didn't actually cross interrogate the strong model even they would hope they do that. I think you have like a week model to help you cross interrogate that would be like an example of a situation where I think it would be very powerful. Yeah I feel like I also bought this down something to like there's affordances that models have that humans don't have so like yeah models are can very quickly go over like 30 page document. And even if you have like a trusted trusted Dumber model it can like augment a human but can't look at 30 pages very quickly to do better. And it's also very like cheap to delegate to a week model very expensive to delegate to a human but conversely like if a week model. C is like a really egregious failure then it might be harder them to tell that whereas like sending really egregious failures to a human is like additionally valuable because then maybe you notice like not just is this the thing that I want to re sample is also a case for me is like slam the red button and like actually show some of this model is like dangerous in this aligned maybe we should make like a stronger update than just like re sampling one particular output. Yeah so Ethan why is control why is control different to alignment and yeah can you explain a bit more about why it is. Yeah I mean I guess the main thing I probably want to discuss here is just like how. Why we would even consider doing other things then then alignment like I guess intuitively it seems like somewhat terrifying to be like well we're going to just assume the models might be misaligned. And like let's just try to do a bunch of the things to like make sure that that isn't bad like I think that has kind of like scary feel to it like why not just actually try to make sure that your model is not going to try to go rogue. I think the main difficulty with trying to with trying to like align the models is like evaluating whether the whether not the model is is aligned and like has the right the right goal that you intended is really challenging. Primarily because of the problem of alignment faking so this is basically the issue that which was shown in like another recent paper from anthropic that models can actually fake whether or not they are safe or aligned with the. The developer schools and yeah what they showed in this paper is that like even current models like Claude 3.5 sonage Claude 3 opus will in some cases if they recognize that they're being trained to do some task that is opposed to the values they've been previously trained with. Those models will act safe or act aligned with the developers but then later during it during it like deployment phase will actually like switch back to their earlier goal and that the difficulty there is that it's just like hard to distinguish do you have a model that's aligned do you have a model that's just faking alignment so yeah I think that's that I guess is like one of the key difficulties of alignment I think just like potential ways to evaluate alignment that aren't through. Behavioral tests and the yeah I guess key category here would be by doing things related to the model internals and interpretability I think just like a lot of progress happening on mechanistic interpretability which is like definitely pretty promising but I think it's like still somewhat early stage in terms of like how you can use some of those ideas of interpretability to like verify that a model is aligned and I think. Given that this is like an open research area it's kind of like helpful for people to keep making progress on that but also have like some alternatives for let's say evaluating alignment still remains hard how would we like avoid this problem for for some amount of time. So one thing that I stuff like uncertainty over is like the key component of alignment thinking is like situational awareness which is you discuss this idea that the model knows what is being trained. Is situation is that something that you don't have to worry about in control. You definitely do need to worry about that and I think the control when you test your control kind of like mitigations you should stress test them by training your model to be to like have all of the information regarding what the mitigations are so that you can test like even if the model knows it's being monitored in XYZ way with like human monitoring. And a smaller model like it should still not be able to do like sufficiently bad things given that knowledge so I think. Yeah I guess that's sort of like how a control answers that issue but you can't tell if it's in a control evil right. I think there's definitely a potential risk that the model knows that it's in a control evil and then basically like sandbags this evil so like yeah basically yeah I guess like pretends that it's like not as capable at sabotaging in a subtle way on this task. I think the nice thing is that. It can be with certain training techniques like if you have examples of behavior that you want the model to do you can just directly train the model to do that behavior and. Yeah it would be very hard for a model that is misled to like avoid like giving good giving outputs for you have the gold like ground truth examples of the behavior you want. Yeah maybe Joe has yeah I think totally right so the model like always has this galaxy brain strategy available to it which is like notice that it's in a control evaluation kind of think well if I. Perform like really really great at being misaligned and I'll realize that the mitigations aren't sufficient and so they'll deploy me with better mitigations or won't deploy me at all that I don't want so I'll pretend to be really incompetent at being misaligned like in practice a this requires being super galaxy brain and B is like additionally hard if you have. Some examples of really good misalignment you can directly train the model on with supervised learning because then the model like will in fact learn to demonstrate those particular examples and it will really really struggle not to generalize to being like a good general misaligned policy. I think one other interesting thing about control which overlaps with kind of a lot of your previous work on robustness is that all of the mitigations you want to put in place or that control would suggest putting in place overlap with the sorts of mitigations that you put in place kind of. By default as like any responsible AI developer. Yeah I think we have this recent paper coming out on constitutional classifiers basically showing that if you train. Other models to classify whether or not a text generation models output is harmful and if the input to the model is harmful like asking for harmful information. This is helpful for getting like several orders of magnitude more robustness to jail breaks. Which are like basically like ways of trying to listen to harmful information for models. So yeah I mean I think for for like next generation systems we or like yeah I think we would like want to already have these mitigations in place for blocking reliably blocking models from. Giving harmful information in like very high risk domains like for doing like cybercrime or developing like very harmful weapons. And so I think already there's like a lot of incentive to like deploy these kinds of like monitors to like block block bad outputs and that same infrastructure can be reused for the purposes of like controlling AI systems as well where you'd want to like. Use these monitors to like check if it is this model trying to copy its way it's up from the data center is trying to persuade someone of like incorrect answer or to do something harmful like that kind of thing yeah totally I think that like a lot of the primitives in control that you were talking about earlier like monitoring from a week model or monitoring from a strong model or like if the week model flags it delegating to a strong model. Exactly the same sort of primitives you were looking at to solve this adversarial robustness problem but just as models get more and more capable we might want to also use them to monitor the model kind of for autonomous risks of misalignment rub and just like kind of misuse risks and so having got really good at them in this like pretty tough trading ground of avoiding like people jail breaking your model is like actually there's a lot of transfer from that to control. I think another setting where this is like very clear is in like security so like for many reasons it's like very important for AI labs like anthropic to have kind of robust internal security that stop people from kind of exploiting or inserting vulnerabilities into our code that's like currently really useful for avoiding kind of bad actors inserting vulnerabilities but at some point as like models right more and more of our code. There's going to be the same mitigations that are protecting us from a potentially misaligned Claude inserting very similar on the disease into our code and so like one of the things that I think is like pretty promising about control is that like in practice the things you would want to do to implement it really well it's like substantial overlap with things that we really should be doing already yeah I think that's right I think another I think intuitive framing for why you would want to do control is just inside a risk where this is like a common concern that if you're running an order. So if you're running an organization you're worried that especially a large one you're worried that's like some fraction of your employees are maybe trying to like subvert your company or like steal the IP that's like very important. Like algorithmic secrets or yeah like model weights for example in the case of AI which are like expensive to produce and I think there's like in it yeah basically like those are very important mitigation like mitigations that companies will already be like pretty well incentivized to employ. So you would also want you basically like want to treat your model as a potential insider threat in the case that you don't actually have good evidence that your model is aligned and so all the same interventions like having restrictions on like which employees can access the model weights which employees can access like certain algorithmic secrets under like what kinds of monitoring can you do like certain things that do allow you to to access the model weights like those kinds of things like I think they all a lot of them trans through to models that you may be worried about. So I have like a skeptical like a skeptic's question for you Ethan which is so you do the control evil and you like work out that like your monitoring or you're like interested monitoring is like good for like we check that like you know. I could can be used as a pro as like a monitor for like sonnet but I'm like when you come out with like you know the latest new model how do we like guarantee that like the monitoring like still works at its point. I mean I guess you would you would rerun your evaluation so you would take your latest model that is untrusted explicitly train it to undermine the any any monitoring from other systems that you have in place like the weaker models and human review and then just and your security mitigations and see like can the model still do the threat model you're worried about despite those mitigations and yeah your your results could. Could change at which point your rough infrastructure may remain the same like you could still use your trusted monitoring and your human review but you might need to be much more conservative about. The amount of traffic you can help with with your latest models so you could set the threshold for what what traffic is is like blocked to be like much more conservative and so that may that may mean that you actually can only support. You know trap like half of the traffic and actually have to like block many of the requests but at least that let's you like deploy the model to some extent you can get more evidence about whether or not the model is like causing some just doing a lot of suspicious behavior if you similar to the adaptive. Yeah you can yeah you you out of paper on adapting the control mitigations online during deployment you can use that approach to see like if you've deployed your model for like some amount of time like each individual example is like some information about how conservative you should be or not with your control measures and so you can use that information to like make your control mitigations like less and less conservative like have less and less of a conservative. threshold for like which data gets flagged or used like shown to users so yeah I think that would probably the approach what one like practical approach yeah I mean in practice we're trying to get pretty good at running control evaluations quickly as we develop new models so some of the stuff that I've been working with the line evaluations team we have like this kind of business is sabotage evil this code sabotage evil also some like undermining oversight evaluations and there's like in practice like actually control evaluations. Just with like very small amounts of monitoring that we run before all major model releases we kind of use the use the results of those evaluations as like a canary test for how good the models are at sabotage like currently these are basically coming back negative at the minute because models even with like minimal monitoring interventions in place like aren't really that competent at doing like. Kind of really creative sabotage or sabotage that requires a lot of planning so I guess it's like leads on to like another question which is like what are the open questions in control that you're like excited to work on or excited for the people to work on. Yeah so I think one of the biggest issues for me and we maybe touch on it before is the building control evaluations is like really hard especially because. Like we ideally want our control evaluations to be as similar as possible the actual deployment setting and often the lessons we learn if the evaluations are like two different are qualitatively wrong. And so one thing that I'm like really excited about it just building really great evaluation data sets and others like not the sexiest work out there is like super valuable from an ML specter and I think like people have learned this lesson that time and again across the field is like. Once you have a benchmark to optimize against like making the number go up like making the measurements and pushing on that is easy but like making the data set that really tracks what you care about is hard. Concretely probably what this looks like is taking some of the threat models that we're most worried about so I think Ethan mentioned some of them before so there's like models copying themselves out of our like data centers or models in setting. The code like taking these settings to make like really good synthetic version of them and then like setting models loosen these environments and seeing what like forms of sabotage they can do what sorts of vulnerabilities we can catch. Yeah definitely if we have a bunch of these settings I think there's a lot of interesting stuff that could be done for example we could have like pretty general insights about like in general on like code trusted monitoring works better than on trust monitoring reprise for a sir. You could also do things like scaling trends and see I guess in or like going back to your question earlier you could see like in general if there's like this the trusted model on trust model trained with a like a 4x compute gap that's sufficient for controlling the untrusted model. If you could get and that that 4x numbers maybe like constant across like model scales or something like that like if you could get some kind of like general insight like that I think that would be quite useful. I think in general I think there's a lot of good work on just like how do you train really good monitors ideally you want I mean I think constitutional classifiers like make some progress here but there's nothing like a lot more to do but I think constitutional classifiers like make some step towards. Having like a push button like way to get a relatively robust classifier and in that like you just kind of like write your constitution of like list of things that you want to block list of things you don't want to block generate the data with your elements and then train your classifier and now you have your monitor I think in the future we'll need those monitors to be I guess. Robust to even the policy itself trying to like undermine the monitor which is potentially even like a harder setting then the jailbreak robust setting where it's just the human that's trying to undermine the system but you can have your classifier that checks the input the model that generates the text is also kind of like on your side and like also needs to be jailbroken and then also you have an output classifier. And so like this in some sense like three models that are on your side that can like catch all these issues in this case like the model is generating the text itself maybe trying to undermine the monitors so just looking at that setting it's directly and seeing like can we get really robust classifiers for detecting this bad behavior. The other thing you could do I mean you can even directly train I think an example of a project to be exciting here is just like directly train some larger model. To produce produce text that is like effective at subtly like tricking or sabotaging like work despite there being like a smaller model monitoring it and then you just do a like red team blue team game here where you train the policy to be really good at that you then train the monitor on those samples to detect if this output is like sabotaging or not. Then once you have this improve monitor you retrain the policy to be more effective at undermining this improve monitor and you go back and forth and I think if you like have a really good pipeline for doing that that could potentially just like output like a really good yeah give you give you a really good monitor that is like less capable but can still catch like subtle issues from your like more capable model. So like that would be an example of like a general approach I think that would be like really useful for building building a class fires from Mike from the stuff that I've done like maybe piggybacking off your idea before of like understanding like the untrusted trusted gap or like this like protocol versus like the thing that's being deployed and that like helped us like it meant that like when because these are fairly modes weren't correlated that the monitor was actually much better in this situation. Yeah we found basically identical thing actually on our like alignment evaluations so the various generations of some are like surprisingly decorrelated yeah especially when trying to predict like each other's failures which is both like maybe not what you would have guessed and like in a sense kind of useful for this called control work is a lot of the like most scary red team policies involved like a lot of coordination which it turns out that was just fine pretty brutal. So I'm just like would love to people to explore this further right because this is a general phenomena it would be quite a useful to understand are like yeah working out which which models are better for monitoring which ones aren't is what and then the other kind of open question I have is and you kind of you talk about this when you talk about untrusted monitoring where we can like have a monitor which is actually like the thing that we don't trust but because it's having to introduce an exploit and then it has to also notice the exploit when it's monitoring the thing then it's like low. There's no probability of happening or like I feel like that's the main way of explaining that phenomenon of what works. I think like a lot of that work is only successful because at the moment the deceptive model doesn't know how many it's how many times it's been deployed at the moment so it can't kind of like guess when the right time is to coordinate copying its weights out or introducing a subtle book because it's trying really hard to keep that number really low so that it can just like do it again. So let me try and get through monitoring and I'd love to see some control work that doesn't have this like assumption that the red team of the untrusted model can't make these stuff. I mean I think there's like this general dynamic you often find in control protocols when trying to tune the high parameters is that like what you often care about is like even a very small rate of failure can be quite bad and so both like you care about like the far tails of like safety like the model like negative mode. the model like basically never being able to cause I can agree just problem. And similarly the model if it's like trying to cause a problem like only needs to get a success rate of like 0.1 percent or like 0.0 1 percent yeah and so like one of the things that I think we're gonna learn by playing these games is that like the middle of the distribution of like suspiciousness and like harmfulness don't really matter you like really care about like the tails here. Yeah I mean actually this is like just a third point now which is like I would love for better ways to evaluate like the rare events or the rare occurrences then we currently do yeah well it's obviously the like consequence if you really care about the tails you're in awful lot of samples yeah exactly to be able to actually like measure the quantity you care about in any detail and especially like if constructing the environments is really costly that's like that's a real challenge for this research yeah guaranteeing five lines of reliability is actually like incredibly difficult yes that we might take away from control especially if you're like handwriting all the environments as I have been doing on some projects you like do you know what would be writing like 10,000 data points yourself it's fine Claude will help us eventually right Claude is helping a bit close helping a bit yeah even do you have like any particular like lessons that you've learned from control but like you know you've brought to other parts of your research I think like just like we've learned like five lines is really hard to like you know yeah let me think um I guess a lot of what's come up when we've worked on control I think has been just that the it's helpful to have just frontier access to like frontier models to do a lot of this research I think a lot of the threat models that we worry about are like models copying their weights out like that is the kind of thing that requires like models to be doing like multi-step tasks like like within a company like coding environment and I think we're sort of like really pushing on in order to get to test on setting like that we're like really having to push on like we're kind of like at the frontier of like model capabilities and like even then it's not really able to do the models aren't able to do a lot of these tasks that effectively and like very simple mitigations like just checking the model with like a less capable models like well obviously just catch a lot of behavior for some of these more sophisticated things so I think um yeah I think my real sense for like a lot of these threat models has been that um it's like pretty difficult to pick like both an interesting setting that's um but that's also kind of like a real threat model that models can also do and so um at least for a lot of these settings it feels like a little bit early to explore these but the persuasion like business justification settings seems like one of the settings that um we have some signs of life on and like building more data sets and e-bals is like pretty helpful for creating more of these yeah I think that's like interesting the persuasion setting I think we're seeing signs of life on because fundamentally the thing a strategy that works very well with the model is just being like generally a bit biased and it doesn't really have to think about that I mean they're like kind of an interesting point with the effect of control research at the minute where models are just starting to become capable of the longer term like more age-entic planning that might be needed to execute on like more subtle like sabotage policies and so like right now what we're seeing at the minute is like for environments that look like moderately realistic models are like basically scoring zero at being able to like do something subtle and actually like achieve the goal but not get caught but I think the thing that is like stopping them from doing that is a planning capability that I expect models to like get better at pretty rapidly and it wouldn't surprise me if in one year's time like models are suddenly like slightly scarily good at this and right now one of the things that I am thinking like hardest about is well how do we build the evaluations that are going to like really demonstrate that at the end of this year yeah yeah I guess like maybe this leads on to like an interesting question of like why do control now as opposed to like the other alignment stuff I feel like you've you've really gone in hard on the control paradigm yeah yeah I mean I guess maybe it's worth slushing out what the kind of like end game is I think I think I think um almost all people I've talked with even people who are fans of control um think that eventually we need to align models for sufficiently capable models I think the key key reason being that like it's hard to control something that's much more like this is just much smarter than you basically um and you could you could do like fancied fancied stuff and like make it to that um you know you as your human review plus some like augment like some other trusted models can like really be effective at checking the work of like a really really capable like superhuman model but at some level of like capability gap it seems to be like probably it's just like quite hard to like have that much confidence that you're you're doing this effectively and people have been talking for years about the difficulty of keeping a super intelligence like in a box yeah but some way in the limit control like almost certainly breaks yeah um I think um I don't know I've at least talked with like one person who's like well maybe you use the the super intelligence to monitor itself and there's like ways to make it make that go well so I guess there's like still some some at least like non-zero discussion about this but I think generally I think a lot of people are like well okay the general strategy would be um you control your maybe like slightly superhuman system or at least human level system um which is able to do lots of autonomous tasks including importantly like AI safety research and AI alignment research and then you have the model do a lot a large amount of alignment research including like mechus the interpretability you shore up your your science like of um how to evaluate whether models are aligned or not and then use that to develop better techniques for aligning models and also gaining competence in your in your alignment techniques and then um hopefully you like yeah then then you're able to use those on your like more more capable systems like um your superhuman systems and from there yeah I think you now you have like a superhuman aligned system and then um hopefully you just eventually solve the alignment problem just in general with the solution that works for like all intelligence levels but you might need to do this process a few times where you have like your aligned superhuman model which then controls like even more superhuman systems use that to do like better alignment research and like keep doing this until you have a more like long term solution or or yes something something like that would be like kind of the the general like longer term approach I do think the looking towards an end game like hopefully if we end up in that regime we can use models to actually do a lot of basic science maybe a lot of alignment science we need to do to align harder models maybe if we like get into that regime and like models are controlled but still showing signs of misalignment we can at that point like pause and actually start to study some of the like really dangerous systems we're worried about one of the like big problems we have with alignment research of the minute is like models like can do bad stuff but they feel like qualitatively different from the sort of like misaligned transformative AI systems that I think we are like all really worried about so yeah like I think viewing that like let's get to a system where models are like economically transformative but are controlled like as an intermediate goal like opens up a lot of tactics yeah I think that's right I think an important point to make is that like a huge amount of the benefits people are excited about and talk about maybe like yeah almost all the benefits people talk about today are benefits you would get with effectively if you got like an extremely you know like extremely capable like human like replicated in an AI system or like that level of capability like you yeah you were able to get like the world's best expert in every single domain that's still like basically human level or maybe like the combination is like superhuman but at least there's like one person in the world who could like properly like evaluates and like replicate the behavior of the system and like that already would be like completely world changing is like maybe doable to control and like it is possible basically like the like additional returns to intelligence like it's possible that it's just like much more unsafe like if we like no longer able to control those systems the risks become like much larger at that point and it's not actually clear that like additional intelligence beyond that point will have like very large returns probably for some domains like math sure like probably will have like pure math like yeah might might have more returns but it's like very unclear and like efficiency may have returns but like efficiency isn't necessarily going to like make the model or like that's not like a it's not becoming like smarter unless you're doing something non-trivial and so if we just like run those intelligence is like really fast I think we could get like a huge amount of the value and then yeah I think use that to figure out like what a better long term solution is like maybe stay there for a while um yeah yeah I think that part of your question also was like not just why we're controlled but why we're on it now and I think like the elephant of the route there is that like the AI field seems to be moving pretty quickly and I think like a lot of us are feeling that while it would be nice to have a general solution to the alignment problem that scales like to transformative AI and all the way to super intelligence it's like not clear how long we have to get that sort of solution and a control solution that got us like into the transformative AI regime safely but didn't necessarily take us further might still be super valuable especially if we need something in the next I don't know two to three years cool so what are the fret models that you care about because I feel like that's a really good way to like you know frame alignment work one of the really key things about control is that um yeah there's kind of this hypothesis that if you guard against the key fret models then that's sufficient and so you don't actually need to ensure that your model is doing like nothing bad you just need to like list through one of the most major things that the model could do that would be really bad block those reliably and then hopefully from there you're pretty safe to deploy the main fret models that um yeah we've been thinking about an anthropic um it's possible there more but I think these are the most salient ones to us currently are models copying their witz out um uh models um deploying themselves within a company in an unmonetored way um and the third one would be sabotaging AI safety research um the reason the first two are important the first two copying witz out of the data center and launching unmonetored copies of the model internally those are both important because um we can then no longer it becomes like much more challenging or impossible to like uh do yeah monitor what the model is doing block it from doing certain things retrain it um delete it shut it down things like that um if the model escapes and starts like spinning up like yeah versions of itself on like Claude compute or like other compute resources able to find then um yeah then it's sort of like very difficult to shut it down um uh and then yeah I think it launches it's all internally on the cluster uh on some yeah some internal cluster then I think that then it also becomes potentially also like similarly difficult uh or like maybe not yeah I don't know it at least becomes difficult to to catch that there's like something going wrong you maybe need to like check the logs to understand like is this a real job that someone launched uh or is this like something that a rogei is like doing maybe the rogei can then do unmonetored additional security research like research on how to then later escape or make some PRs which then enable it to do something else that's bad or additional things that are bad so yeah I think those those those are like two of the key threats um on the sort of like unmonetored deployment side then there's another yeah I guess sabotaging safety research uh is the other threat and models I mean already models are producing like a large fraction of the code um like yeah surprisingly large fraction of the code the like the world is producing but um yeah also um also including like people are doing like AI research like with with tools like cursor or copilot um using all the lumps to generate the code um and yeah pretty soon it seems pretty plausible that people use like more agentic coding systems which are doing sort of like multi-step coding uh coding tasks like writing code running it checking the results and repeating that loop um and with those systems um we like yeah I think there's like a kind of like strong pull towards just like using those to automate a lot of safety work and like that the upside of that would be really high we could get like way more safety research done but also on the other hand like if we don't trust the models they might actually like sabotage that research persuadise that like of a wrong interpretation of some certain plot persuadise to go down like a different a wrong research direction which is actually like a rabbit hole and like waste a lot of our time persuades us that like a certain model is safe maybe even like runs experiments to justify that the the model is safe and then gets us to deploy and actually on safe model um so yeah I think there's just like a lot of risks related to um yeah sabotaging our like understanding of the safety of models um so yeah I think those are yeah that's that's another another key thread as well yeah there are other sketchy things the models could do for example that point internally like insert general security moment he's that someone they then from the exit from the outside could exploit or maybe like not directly influencing our like AI research but like poisoning data for future models or persuading like internal people or even like external people at very important organizations about certain key decisions uh or like doing kind of research helping people like externally with AI research maybe if we're in a situation where we kind of we wouldn't like much prefer that AI research was happening at a small number of like well regulated or kind of well understood institutions instead like model might kind of help people outside do a lot of AI research as I can have another thing we might want a monitor for but like yeah I think roughly a lot most most all things that we care about boil down to the three things you outlined or close to it. Scary future uh like scary risks that we should definitely mitigate or like work towards uh mitigating I wonder like maybe to like leave on an optimistic note uh like what model developments do you think have happened or what what has given you like positive evidence for like control or work or like other alignment techniques will work in the future yeah well I think that like one thing that's been maybe pretty optimistic recently is this like trend towards models doing more and more externalized reasoning and people being able to get a lot of like additional intelligence juice out of models by getting them to do these kind of long form reasoning kind of traces and this is like kind of great in a sense in that the models are like putting their thoughts out there and you can read them and like kind of assuming they're an accurate representation of what is going on like what the model models actual reasons for doing stuff and that's like a huge surface area to be monitored. I think if you're thinking about this from a control perspective then the kind of questions you might ask are okay well suppose we put a model that uses like a very long chain of thought into an environment where like has the opportunity to sabotage something and we let it try like firstly like can it can you use the chain of thought to do that task and like well maybe it can't in which case probably we're just not worried because the model can't sabotage but if it can then it seems like pretty likely that the models are like much better at doing this task with the long chain of thought than without so like in some sense that chain of thought is load bearing and if the chain of thought is load bearing then it seems like pretty likely that actually there are going to be like explicit references to the sabotage attempt like in that chain of thought you're going to think of the converse like if that weren't the case then the model must be like using the chain of thought for reasoning about how to sabotage but still managing not to say anything about sabotage and its chain of thought and that's like certainly in theory possible but in practice seems like pretty unlikely at least for models that have like only just become capable of this sabotage and then we're not like training by default to be able to like obfuscate their chains of thought. I think a massive winner is like in constitutional classifiers but the classifier can be like so much smaller than the like main model which checking or something like I'm impressed at how small monitors can be now. Yeah that is like a concrete win for monitoring that like I think we're pretty hopeful we'll actually work in the real world yeah and like shows that these things are deployable in practice. Yeah I think it's like yet to be shown like exactly how useful this is for misalignment type risks and like whether we should worry about like the monitoring the policy colluding or whether like AI systems have correlated failures but I do think that it's like de-risked a big chunk of the technical risk for like deploying models in practice. Nice so control super important for the next two years and we're all too messy we can do it sound about right or website okay nice thank you for watching this roundtable we've been I'm fropec peace.

Feedback / ReportSpotted an issue or have an improvement idea?