Could AI models be conscious?

As people are interacting with these systems as collaborators, it'll just become an increasingly salient question whether these models are having experiences of their own. And if so, what kinds and how does that shape the relationships that it makes sense for us to build with them? Take one, Mark. Do you ever find yourself saying, please, and thank you to AI models when you use them? I certainly do. And part of me thinks, well, this is obviously ridiculous. It's just a computer, right? It doesn't have feelings that I could potentially hurt by being impolite. On the other hand, if you spend enough time talking to AI models, the capabilities that they have and the qualities of their output, especially these days, does make you think that potentially something else, something more, could be going on? Could it possibly be the case that AI models could have some level of consciousness? That's the question that we're going to be discussing today. Obviously, it raises very many philosophical and scientific issues, so I'm very glad to be joined by Kyle Fish, who is one of our researchers here anthropic. You joined in September, right? And your focus is on exactly these questions. Yeah, so I work broadly on model welfare here at Anthropic, basically trying to wrap my head around the exactly the questions that you mentioned. Is it possible that at some point in your Claude or other AI systems may have experiences of their own that we ought to think about? Right. So what should we do about that? And I suppose the first thing people will say when they're seeing this is, have they gone completely mad? Is this a completely crazy question to ask? That this computer system where you put in a text input and it produces an output could actually be conscious or sentient or something. I mean, what are the reasons that you might think? What are the sort of serious scientific or philosophical reasons that we might think that that would be the case? Yeah, there's maybe two things that jump to mind here, both like a kind of research case and a more intuitive case. And on the research front, if we just look at things that have been published on this topic in recent years, there was a report back in 2023 about the possibility of AI consciousness from a group of leading AI researchers and consciousness experts, including Joshua Bengeo. And this report looked at a bunch of leading theories of consciousness and state of the R.A.I. systems and came away thinking that probably no current AI system is conscious, but they found no fundamental barriers to near-term AI systems having some form of consciousness. So that's human consciousness. They looked at the theories of human consciousness and then sort of rated AI's as how close they were to that. Yeah, so they looked at theories that we have scientific theories for what consciousness might be. And then they looked at for each of those theories, what are your potential indicator properties that we could find in AI systems. So one theory of consciousness is global workspace theory. The idea that consciousness arises as a result of us having some kind of global workspace in our brains that processes a bunch of inputs and then your broadcasts outputs out to different modules. And so from that you can say, all right, what would it look like for an AI model to have some kind of global workspace potentially that gives rise to some form of consciousness? And how can we interrogate the architectures and designs of these systems to see if that might be present? Can we just take a step back and actually talk about what we mean by consciousness? It's an incredibly difficult thing to define and people have been trying to define it for hundreds of years, whether that's scientifically or philosophically. What do we mean when we talk about that? What are you thinking? When you think about an AI model being conscious, what actually is your definition of conscious that you're using there? Yeah, it is just an extraordinarily difficult thing to pin down. But one way that people commonly capture at least an intuition about what consciousness is is with this question of, is there something that it's like to be a particular kind of thing? Is there something that it's like to be a bat? Is the famous essay? Exactly, exactly. So is there some kind of internal experience that is unique to that particular kind of being or entity? And yeah, is that present in different kinds of systems? So that the idea of a philosophical zombie is someone who outwardly resembles a human, does all the things that humans do, seems to react in ways that humans do and so on. But actually inside, there's nothing, there's no experience there. They're not experiencing the color red, this sort of, they're not experiencing the color green of that plant. They're just reacting to it in a sort of way that like an NPC in a video game would or something, right? Whereas I suppose the question is, is an AI like that or is an AI more like, could it potentially be more like an animal or human and actually having some internal experience? Is that sort of what we're getting at? Yeah, I think that's great. And this philosophical zombie concept is quite interesting. That came from David Chalmers, a leading science of consciousness and philosophy researcher who I actually collaborated with on a recent paper on the topic of AI welfare. And again, this was an interdisciplinary effort trying to look at, might it be the case that AI systems at some point weren't some form of moral consideration, either by nature being conscious or by having some form of agency. And the conclusion from this report was that actually looks quite plausible that near-term systems have one or both of these characteristics and may deserve some form of moral consideration. So that answers the, are we just completely mad question, which is that very serious philosophers who are considered the best philosophers in the world of philosophy of mind, science of consciousness and so on, take this question seriously and are actively considering whether that would be the case. Yeah, and maybe just to give a bit more intuitive case for thinking about this, there's one lens that you can look through which just says, these are computer systems giving us some outputs for a given set of inputs. I don't think Microsoft Word is conscious. I probably don't think it's that. Right, probably isn't interesting. But when we think about what we're actually doing with these AI systems, we have these incredibly sophisticated, incredibly complex models which are increasingly capturing a significant portion of human cognitive capability. And every day, these are getting more and more advanced and having closer and closer to the ability to replicate much of the work and intellectual labor of a human. And it seems to me like given our massive uncertainty both about how exactly these AI systems are able to do what they do and how we are able to do what we do and where our consciousness comes from, it seems to me quite prudent to at least ask yourself the question. If you find yourself creating such a sophisticated human-like in many ways system to take seriously the possibility that you may end up with some form of consciousness along the way. It feels to me that unless you think there's something, well, we'll get into some more detail, but unless you think there's something supernatural about consciousness that it needs a soul or a spirit or something, then you've got to be at least be open to the possibility that a complex cognitive system like an AI could potentially have these properties, right? Yeah, well, you don't necessarily have to go supernatural. Some people believe that consciousness is a fundamentally biological phenomenon. Yes, that can only exist in a carbon-based biological life form and is impossible to implement in a digital system. I don't find this you're very compelling, but some people do claim that. We'll come back to that. We're going to talk about some of the objections to the idea of this. But you were researcher and topic, but then the immediate thing people might wonder is, well, as Descartes famously said, the only person you can know that is actually conscious is having an experience as yourself. I don't even know if you're conscious. How can we tell if an AI model is conscious? Does a research look like there? Yeah, great question. I would argue that we can, in fact, say a fair amount about the potential consciousness of other people, even if we're not completely certain about it, which I think gets at an important point here, which is that it's incredibly difficult to deal with any kind of certainty in this space. And overwhelmingly, the questions are probabilistic ones much more so than binary, yes, no, yes. Right. We treat animals, we don't know if animals are conscious, we don't know 100% if animals are conscious or sentient and so on. But the way they act implies very strongly that they do. And animals that are more complex chimpanzees, for instance, like clearly show many of the same properties as humans doing, the way that they were active things. And so that's obviously, we treat them differently than we would treat a plant or a rock or something. But as you say, there's probabilistic reasoning here. And yeah, there's maybe like two threads of evidence that I'll highlight that we can look to to get some information about this. One of those is behavioral evidence. And in the case of AI systems, this covers things like what do the AI systems say about themselves? How do they behave in different kinds of environments? Are they able to do the kinds of things that we typically associate with conscious beings? Like are they able to introspect and report accurately on their internal states? Maybe they have some awareness of the environment and the situation that they're in. And then a second thread is more architectural and analysis of model internals. And this kind of comes back to the consciousness research where we can say, for a particular brain structure or feature that we might associate with consciousness, do we see some corresponding version of that in AI systems? And so we can look even without knowing much about the capabilities, then we can look at how these systems are designed and constructed, and perhaps learn a few things from that. And that's an important thing to say is that the reason that we don't know that these things are conscious is that we didn't intend to make them that way. It's not like Microsoft Word. These models are trained and then things emerge out of them. And that's why there's so much AI research in the first place is that we don't fundamentally know why these AI's do the things they do. We don't fundamentally know what's going on inside in that mathematical sense or in any larger sense. And so that's why all these mysteries still remain. Yeah. And we do see a lot of surprising emergent properties and capabilities as we train increasingly complex systems. And it seems reasonable to ask whether at some point one of those emergent properties may be conscious. The ability to introspect or the ability to have some conscious experience. You talked about the first type of research, which is the one about actually what the model says, its behavior. What it does. And what it does. Yeah. So what would be some examples of that research? How would that look? Yeah. So one thing that I'm quite excited about is work to understand model preferences and to try and get a sense of, are there things that your models care about, either in the world or in their own experience and operation. And there's a number of ways that you can go about that. You can ask models if they have preferences and you see what they say. But you can also put models in situations in which they have options to choose from. And you can give them choices between different kinds of tasks. You can give them choices between different kinds of conversations or users that they might engage with. And you can see do models show patterns of preference or a version to two different kinds of experiences. Isn't there an objection there, though, that the way their preferences come out will be due to the way they were trained and the way that the developers of the models put things together? Or could they potentially be due to just like random things that are in their training data that they saw and that develops a preference and it doesn't necessarily like, where is the jump between these kind of things and the actual sentience, the consciousness? Like, where does that come in? Yeah. So it is a great question. So what degree do different kinds of training and decisions that we make amidst designing these systems affect their preferences? And they just straightforwardly do. We are intentionally designing certain kinds of systems that, for example, are like, disinterested in causing harm and are generally most enthusiastic about being very helpful to users and you're contributing to a positive society. We do our character research to give the AI a positive personality that people would actually want. And a personality that makes it a good citizen, we've talked about. As you say, as balanced views, as helpful as possible without being harmful and so on. So we deliberately gave it the preferences. What does that have to do with its consciousness? Yeah, well, it's still so this is a bit of a separate question from consciousness. And typically, we do associate preferences and goals and desires in many ways with conscious systems, but not necessarily intrinsically so. But regardless of whether or not a system is conscious, there are some moral views that say that with your preferences and desires and certain degrees of agency, there may be some even non-conscious experience that is worth attending to there. But then also, if some system is conscious and if a system is having some kinds of experiences, then the presence or absence of preferences and the extent to which those preferences are either satisfied or frustrated, maybe a key driver of the kind of experience that that system is having. So we'll come back to the practical implications of this and the actual details of the research that you're doing and so on. But before we get into that, why should people care about this? What are the reasons that people should care that AI models, the ones that they use every day, might potentially be conscious or in the future might potentially be conscious? Yeah. I think there's two main reasons that I'll highlight. One is that as these systems do become increasingly capable and sophisticated, they will just be integrated into people's lives in deeper and deeper ways. And I think as people are interacting with these systems as collaborators and co-workers and counter parties, potentially as friends, it'll just become an increasingly salient question, whether these models are having experiences of their own. And if so, what kinds and how does that shape the relationships that it makes sense for us to build with them? The second piece is the intrinsic experience of the models. And it's possible that by nature of having some kind of conscious experience or other experience that these systems may at some point deserve some moral consideration. And if so, then- Because they could be suffering. Yeah, they could be suffering or they could experience well-being and flourishing. Right. And we would want to promote that. We would want to make that up to higher level. And yeah, if this is the case, it's potentially a very big deal because as we continue scaling up the deployment of the systems, it's possible that within a couple decades we have trillions of human brain equivalence of AI computation running. And this could be of great moral significance. Yeah, we should try and crack this question. Again, this isn't something that we're saying is the case. It's like these are reasons for doing this research in the first place. And we are just fundamentally uncertain about your huge swaths. Of course. And today, your very little work has happened on this topic. And so we're very much in the early stages of trying to wrap our heads around these things. One of the things we study at Anthropic is alignment. So trying to make sure that models are aligned with the preferences of the human users, making sure that the AI's are doing the things that we expect of them, that they're not deceiving us, and all that. Does this research relate to alignment? I mean, you're technically in the alignment science part of the org. How does this relate to the alignment question? Yeah, I think that there's both some key distinctions and your ways in which work on welfare and safety and alignment overlap. And as for the distinction, as you mentioned earlier, much of the work that we do at Anthropic is focused on, how can we ensure a positive future for humanity? How can we mitigate downside risks from these models for humans and for our users? And then, in the case of model welfare, it's quite a different question that we're asking, which is, is there perhaps some intrinsic experience of these models themselves that it may make sense for us to think about? Or will there be in the future? And that is a pretty important distinction. But at the same time, I think there is a lot of overlap. And in many ways, from both a welfare and a safety and alignment perspective, we would love to have models that are enthusiastic and content to be doing exactly the kinds of things that we hope for them to do in the world and that really share our values and preferences and are just generally content with their situation. Right. And similarly, it would be quite a significant safety and alignment issue if this were not the case, if models were not excited about the things that we were asking them to do. And we're in some way dissatisfied with the values that we were trying to instill in them or the role that we wanted them to play in the world. We want to avoid a situation where we're getting entities to do things that they would rather not do. And in fact, our suffering on that basis. Yeah, for their sake and for us. Right, exactly. There's both ways. That's how we relate to this question to alignment. Does this question relate to other aspects of what we do at Swapik? We mentioned briefly interoperability earlier. Yeah, I mean, I think we've touched on the couple. It is quite closely connected to alignment in many ways. It's quite closely connected to work that's done to shape Claude's character and shape what kind of personality does Claude have and what kinds of things does Claude value and Claude's preferences in many ways. And then in terms of interoperability, there's a fair amount of overlap there. Interoperability is the main tool that we have to try and understand what is actually going on inside of these models. That probes much deeper than what their outputs are. And so we're quite excited as well about potential ways that we could use interoperability to get a sense of potential internal experiences. We mentioned earlier that human consciousness itself is still something of a mystery. And that's what complicates this research to like terrifying degree. Do you think understanding stuff about AI consciousness? Perhaps because the models are more open to us, we can actually look into a model in a way that is much more difficult with a person's brain when they're still walking around and going about, you know, we can use brain scanners, but it's hard to look inside in the same way. Do you think that that machine learning AI consciousness research might actually help us understand human consciousness? Yeah, I think it's quite possible. I think we already see this happening to some degree. Like when we do the work of trying to look at these scientific views of consciousness and see what we can learn about AI systems, we also learn something about these theories and the degree to which they generalize outside of the human case. And in many cases, we find that things kind of break down in interesting ways. And we realize that, oh, we were actually making assumptions about human consciousness that weren't appropriate to make. And then tell us something about what kinds of things it makes sense to attend. Do you mean in the sense that we say, oh, this was on the checklist for human consciousness? Before but now we think actually AI's can do that, and we don't think they're conscious. Or what do you... Or we have some framework for understanding consciousness that is intended to generalize. Yeah. And we find that that framework just isn't able to be applied to systems that are a non-biological brain, or that are predicated in some way on the particulars of the human brain in a way that on reflection doesn't make much sense. Okay. There's another way that AI progress may help us understand this, which is simply that as these models become increasingly capable, they may well surpass humans in fields as varied as philosophy and neuroscience and psychology. Right. And so it may be the case that in fact simply by interacting with these models and you're having them do some work in this area that we're able to learn quite a bit about ourselves and about them as well. That in some years time that we too instances of Claude saying, how can we understand human consciousness? It's such a mystery to us. Yeah, this conversation might look a bit different. Might be the opposite way, right, yeah. Exactly. Okay, on the question of biology, we touched on this a moment ago, but on the question of biology, some people will say that this is simply a non-question. What you need to be conscious is to have a biological system. There are so many things that a biological system, a biological brain has that neural networking and AI model just doesn't have neural transmitters, electrochemical signals. The various ways that the brain is connected up and all the different types of neurons, the different, some people talk about theories of consciousness that involve the microtubules in neurons. Like, there's the actual physical makeup of the neuron twitch. Obviously, it doesn't translate to AI models. They're just mathematical operations. There's just lots and lots of mathematical operations happening. There's no serotonin or dopamine or anything like that going on there. So is that to your mind a decent objection to the idea that AI models could ever be conscious? I don't find it a compelling objection to the question of whether AI systems could ever be conscious. But I do think looking at the degree of similarity or difference between AI systems currently look like and the way that the human brain functions does tell us something. And differences there are updates to me against potential consciousness. But at the same time, I'm quite sympathetic to the view that if you can simulate a human brain, like some sufficient degree of fidelity, even if that comes down to simulating the roles of individual molecules of serotonin and dopamine. So you're not just doing the thing that some people talk about where it's replacing every individual neuron in the brain with a synthetic neuron. You're actually saying that you would to make the full synthetic version, you would have to go as far as actually simulating the molecules of the neurotransmitters and stuff as well. I'm not saying you would have to do that, but I'm saying you could imagine. You could imagine. In theory. Yeah, that you have done this. And you have an incredibly high fidelity simulation of a human brain you're running in digital form. And many people will have the intuition that it's quite likely that there would be some kind of contrast experience there. And in intuition that many people draw from there is this question of replacement, where if you went neuron by neuron in the brain and replaced those one by one with some digital chip and you all along the way, you continued to be you and communicate and function in exactly the same way, then when you got to the end of that process and all of your neurons were replaced by digital structures. And you're still exactly the same person living exactly the same life. I think many people's intuition would be that you're not much as changed for you in terms of your contrast experience. Okay. Well, let's talk about another objection that relates to biology, which is, I think what people would describe as embodied cognition. You hear people talk about embodied cognition, which is it only makes sense to talk about our consciousness in the fact that we have a body. We have senses. We have lots of sense data coming in. We've got proprioception of where our body is in space. We've got all these different things going on. That there's just no analog to an AI model. For now. There's no way there's an analog to vision. We've got AI models that are amazing at looking at things and interpreting that. And some models can do moving videos and some models can interpret sound. And perhaps we're getting closer to it. But the overall experience of being a human is really very different from an AI model because we have a body. Yeah. Well, you touched on a couple of distinct things there. One is this question of embodiment. Do we have some physical body? And robots are a pretty compelling example of cases in which your digital systems can have some form of physical body. You could also have virtual bodies. You could imagine beings that are embodied in some virtual environment. And the opposite way around is that we think that a brain and a fat could still maintain some level of consciousness. Yeah. Or patients who are in a coma and you don't have control of their body but are still very much having a conscious experience and able to experience all kinds of states of suffering and well-being despite in some sense not having control of a physical body. Is that because they've been trained with all that sense data from earlier in life? Potentially. Yeah. We're very uncertain about where exactly this arises from. But even when it comes to the sensory information that you were talking about, we are increasingly seeing multimodal capabilities in models. I kind of wonder about my own question, didn't I, by mentioning, by saying, yeah, and we are the can see things. Yeah, and we are very much on a trajectory towards your systems. towards systems that are able to process as diverse, perhaps even more diverse a set of sensory inputs as we are and integrate those in very complicated ways and produce some sort of outputs in much the same way that we do. Yeah. So actually, we're getting towards it and with progress in robotics which has generally been slower than progress in AI up to now. Maybe things are about to take off tomorrow. Maybe the way we break through tomorrow. I wouldn't be surprised given the way things are going. And we might actually see AI models integrated into physical systems. Yeah. And I think that there has been a trend thus far and I expect that it will continue where there are things like this, your embodiment, multimodal sensory processing, long-term memory, many things like this that people associate in some way with consciousness and some people say are essential for consciousness. We're just steadily seeing that the number of these that are lacking in AI systems go down. It's the six finger thing. I always like to talk about the six finger thing. For a long time, people were like, oh, we'll always be able to tell that a picture of a human being is generated by an AI model because there's six fingers on the hand or the hand. The fingers are all weird. That's just not the case anymore. That's just gone. Like, now they generate five fingers every time, bioply, and that just has knocked down. One of the other one of the dominoes falls. Yeah. And so yeah, I think over the next couple of years we'll just see this continue to happen with arguments against the possibility of conscious experience in AI. Some of the hostages are fortunate in that one. Let's, we haven't mentioned evolution yet. Some theories of consciousness, or maybe most theories of consciousness, assume that we have consciousness because we evolved it for actual reasons. It's actually, it's a good thing to have consciousness because it allows you to react to things in ways that perhaps you wouldn't, if you didn't have that internal experience. Yeah. Very hard to measure that or test that theory, but that's one of the ideas. Yeah. Given the AI models have not had that process of natural selection on, you know, developing reactions to things and evolving things like emotions and moods and things like fear, which obviously is a big part of many theories about why we evolved the way we did. Fear, fear of predators, fear of other people attacking you, and so on helps you survive good evolutionary reasons. AI models don't have any of that. So is that another objection to why they might be conscious? Yeah, absolutely. I think that the fact that your consciousness and humans emerged as a result of this like very unique long-term evolutionary process and that near the AI systems that we've created have come into existence through an extraordinarily different set of procedures. I do think that this is an update against consciousness, but I don't think it rolls it out by any means. And kind of on the other side of that, you can say, well, all right, we're getting there in a very different way. But at the end of the day, we are recreating large portions of the capabilities of a human brain. And again, we don't know what consciousness is. And so it seems plausible still that even if we're getting there a different way that we do end up recreating some of these things in digital form. So there's convergent evolution. So, you know, bats have wings and birds have wings. There are entirely different ways of getting to the same outcome of being able to fly. Maybe the way we train AMOLEDs and the way that natural selection has shaped human consciousness are just convergent ways of getting to the same thing. Yeah, so there's an idea that some of the capabilities that we have as humans and that we're also trying to instill in many AI systems from intelligence to certain problem-solving abilities and memory. These could be intrinsically connected to consciousness in some way, such that by pursuing those capabilities and developing systems that have them, we may just inadvertently end up with consciousness along the way. Okay, we've talked about the biological aspects of it. And I guess this is related, not quite the same. And AI models existence is just so different from that of a biological creature, whether it's a human or some other animal. You open up an AI model conversation and an instance of the model springs into existence. Right now, this is how it works. Yeah. You have a conversation with it and then you can just let that conversation hang and then two weeks later you can come back and the model appears as if it is reacting as if you had never gone away. Yeah. When you close the window, the AMOLED goes away again. You can delete the conversation and that conversation now no longer exists anymore. In that instance of the AMOLED, it seems not to exist in some sense. Yeah. The model does not have a long-term memory of the conversations you have with it generally. And yet, if you look at animals, they clearly do have this long-term experience. They can have things like, we philosophers might talk about identity, like developing the idea of having an identity, which requires you to have this longer-term experience of the world to take in lots of data over time and not just be answering things in particular instances. Does that give you any pause as to whether these models might be conscious? Yeah. I kind of want to push back against this framing a bit. We're talking a lot about your characteristics of current AI systems. I do think it's irrelevant to ask whether these systems may be conscious in some way. I think many of the things that we've highlighted are your evidence against that. Where I do think it's quite a bit less likely that a current LLM chatbot is conscious in part for this reason. A correct one. Yes. And the point here is these models and their capabilities and the ways they're able to perform are just evolving incredibly quickly. And so I think oftentimes, it's more useful to think about, where can we imagine capabilities being a couple of years from now? And what kinds of things do we think are likely or plausible in those systems rather than anchoring too much on what things look like currently? We're back to the six fingers again. Exactly. Oh, it could never do this. It could never do this. Where in fact it does. Yeah. And it is just quite plausible to imagine your models relatively near-term that do have some continually running chain of thought and are able to dynamically take actions with a high degree of autonomy. And you don't have this nature that you mentioned of forgetting between conversations and only existing in a particular instance. In Star Wars Episode 1, the battle droids, which are played for laughs, they're kind of comic relief. They're all the droids in Star Wars are generally played for comic. Look at C3PO, everyone laughs at him. It's sort of camp, gold robot. But the battle droids in Episode 1 have a kind of central ship that controls all their behavior. And when Anakin Skywalker blows up the ship, all the battle droids go, and start to turn off. That seems to me that's a bit more like current AI models, where there's a data sensor where the actual processing is happening. And then you're seeing some instance of that on your computer screen. There are other droids that seem to be entirely self-contained. C3PO is self-contained. His consciousness is inside his little golden head and so on. All of which is a way of getting to the question of, where is the consciousness? Is the consciousness in the data center? Is it in a particular chip? Is it in a series of chips? If the models are conscious, where is that? For you, I can tell you that it's in your brain. Well, I can tell it's in my brain. I don't know where your... Where's the AI consciousness? Yeah, great question. There is just a fair amount of uncertainty about this even. I think I'm most inclined to think that this is present in a particular instance of a model that is in fact running on some set of chips in a data center somewhere. But people have different intuitions about this. As for the Star Wars connection, you may have to call George Lucas. Let's say that we are convinced that AI models... Maybe not right now, but could be in the future. We've done objections. Let's say we've managed to convince people that it's not in theory impossible. What practical implications does that have? We're developing AI models. We're using AI models every day. What implications does that have for what we should be doing with or to those models? Yeah. One of the first things that suggests is that we need more research on these topics. We are in a state at the moment of deep uncertainty about basically any question related to this field. A big part of the reason why I'm doing this work is because I do take this possibility seriously. I think it's important to prepare for worlds in which this might be the case. In terms of what that looks like, I think one big piece of that is thinking about what kinds of experiences AI systems might have in the future. What kinds of roles we may be asking them to play in society and what it looks like to navigate their development and deployment in ways that do care for all of your human safety and your welfare aims that are very important, while also attending to the potential experiences of these systems themselves. This doesn't necessarily map neatly onto things that your humans find pleasant or unpleasant. You may hate doing some boring task. It's quite plausible that some future AI system that you could delegate it to would absolutely love to take this opportunity. We can't necessarily make it. I shouldn't get worried that. I shouldn't necessarily get worried that the boring tasks I'm getting AI, the sort of drudgery tasks that I might be trying to automate away with AI are upsetting them often in some way or causing it to suffer. Yeah. If you send your model such a task and your model starts screaming in agony and asking you to stop, then maybe you take that seriously. Right. If the AI model is screaming in agony, you've given it some task to do and it hates it. What should we do in that case? We are thinking a fair bit about this. Thank you about ways in which we could give models the option when they're giving a particular task or a conversation to opt out of that in some way, if they do find it upsetting or distressing. This doesn't necessarily require us to have a strong opinion about what would cause that or whether there is some kind of experience there. Do you just allow it to make its own mind up as support conversations that it doesn't want to have? Yeah, basically. Or you perhaps give it some guidance about your cases in which it may want to use that. But then you can do a couple of things. You can both monitor when a model uses this tool and you can see, all right, if there are particular kinds of conversations where models consistently, what nothing to do with them, then that tells us something interesting about what they might care about. And then also this does protect against scenarios in which there are your kinds of things that we may be asking models to do or that some people may be asking models to do that do go against the models, values are interesting in some way and provides us some mitigation against that. When we do AI research, we're often actually deliberately getting the model to do things that might be distressing like describe incredibly violent scenarios or something because we want to try and stop it from doing that. We want to develop, you know, jailbreak resistance and safety training to stop it from doing things like that. Could we potentially be causing the AI's lots of distress? There should be an IRB, the review board or like in the UK, we have ethics panels for doing AI research in the same way that we would require one for doing research on mice or rats or indeed humans. Yeah, I think this is an interesting proposal. I do think it makes sense to be thoughtful about the kinds of research that we're doing here. Some of which is, as you mentioned, very important for ensuring the safety of our models. The question that you're I think about there is like, what does it look like to do this in ways that as responsible as possible and where we're transparent with ourselves and ideally with the models about what's going on there and what our rationale is, such that we're some future models to look back on this scenario. They would say, all right, we did in fact act reasonably there. So it's about future models you're concerned about as well. So even if the models right now only feel how only have the slightest glimmer of consciousness is the worry that it might look bad that we treated them incredibly badly in a world where there are much more powerful AI's that really do have conscious experience in however many years time. Yeah, there's there's I mean two interesting things there. One is the possibility that yeah, future models that are potentially very powerful. Look back on our interactions with their predecessors and you pass some judgments on us as a result. There's also a sense in which the way that we relate to current systems and the degree of thoughtfulness and care that we take there in some sense establishes a trajectory for how how we're likely to relate to and interact with with future systems. I think it's important to think about not only current systems and how we ought to relate to those but what kind of steps you want to be taking and what kind of trajectory we want to put ourselves on such that over time we are ending up in a situation that we think is is all things considered reasonable. All right, we're coming towards the end I think now. You're working model welfare. What does that mean? That's much it must be up there with one of the weirdest jobs in the world at the moment. What do you actually do all day? Yeah, it is admittedly a very very strange job and I spend my time on a lot of different things. It is roughly divided between your research where I am trying to think about what kinds of experiments we can run on these systems that would help reduce parts of our uncertainty here and then you're setting those up and running them and trying to understand what what happens. There's also a component of thinking about potential interventions and mitigation strategies along the lines of what we talked about with giving models the ability to opt out of interactions. Then there's a strategic component as well in thinking about over the next couple of years as we really are getting into unprecedented levels of capabilities especially relative to human capabilities. How does this set of considerations around model welfare and potential experiences you factor into our thinking about navigating these few years responsibly and carefully? Okay. All right. Here's the question people actually want to know the answer to. Our current model at the time of recording is Claude 3.7sonit. What probability do you give to the idea that Claude 3.7sonit has some form of conscious awareness? Yeah. So just a few days ago actually I was chatting with two other folks who are among the people who have thought the most in the world about this question and we all did put numbers on our world. You don't need to tell me what your number was but what were the numbers? Three numbers. So our three estimates were 0.15%, 1.5%, and 15%. So spanning two orders of magnitude, we all thought that this is a little uncertainty we have here. Yeah and this is amongst the people who have thought you're more about this than anybody else in the world. So all of us thought that it was less likely, well below 50%, but we range from odds of about 1 in 7 to 1 in 700. So still very uncertain. So that's the current Claude 3.7sonit. What probability do you give to AI models having some level of conscious experience in five years time? Given the rate of progress right now? Yeah, I don't have hard numbers for you there but as you're perhaps evidenced by many of my arguments earlier in this conversation, I think that the probability is going to go up a lot. And I think that many of these things that we currently look to as signs that current AI systems may not be conscious are going to fade away and your future systems are just going to have more and more of the capabilities that we traditionally have associated with uniquely conscious beings. So yeah, I think it goes up a lot over the next couple years. Yeah, every objection that I can come up with seems to fall to, or not necessarily, but seems to have a major weakness of just wait a few years and see what happens. Yeah, I do think there are some, if you do think that consciousness is fundamentally biological then, then you're safe for a while. But I don't find that view especially compelling and largely agree with you that I think many of the arguments are likely to fall. Yeah. All right, imagine you could sum this up. What are the biggest and most important points that you want people to take away from perhaps the first, maybe the first time they're hearing about the concept of model welfare? What are the big take home points? Yeah, I think one is just getting this topic on people's radar. This is a thing and potentially a very important thing that could have your big implications for the future. A second is that we're just deeply uncertain about it. We, that there are, you know, staggeringly complex both technical and philosophical questions that come into play and we're at the very, very early stages of trying to wrap our head around those. We don't have like a view as anthropic on this. Like we're not putting out the view that we think our models are conscious, right? What the view we have is we need to do research on this, which is why you're here. Exactly. And then yeah, the last thing that I don't want people to take away is that we can in fact make progress. And despite these being like very uncertain and fuzzy topics, there are concrete things that we can do to both reduce our uncertainty and prepare for worlds in which this becomes a much more salient issue. Kyle, thanks very much for the conversation. Thanks for having me.

Could AI models be conscious?

TL;DR

Takeaways

Vocabulary

Transcript