Tracing the thoughts of a large language model

AI models are not programmed but trained, developing their own strategies and often acting like a "black box" whose internal workings are opaque.
New tools allow researchers to observe and interpret an AI's internal thought processes, revealing how concepts connect to form "logical circuits."
This ability to understand and even intervene in an AI's internal planning, such as anticipating a rhyme in a poem, is key to developing safer and more reliable AI systems.

AI models are trained systems that learn their own strategies, unlike traditional programs.
It's crucial to develop tools to interpret what's happening inside an AI model, as merely "opening the black box" isn't sufficient.
Researchers can now observe how concepts are connected within an AI model to form logical circuits during its thought processes.
AI models demonstrate planning ahead, such as anticipating a rhyme before writing a line of poetry.
It's possible to intervene in these internal circuits, for example, by dampening down a specific concept to alter the model's output.
A deeper understanding of AI's internal reasoning will help make models safer and more reliable.

black box — An opaque system where inputs and outputs are known, but the internal processes that transform inputs to outputs are not understood. trained — Describes AI models that learn patterns and rules from data through a process of optimization, rather than being explicitly programmed with instructions. logical circuits — The observed internal pathways and connections between concepts within an AI model that represent its reasoning or thought processes. intervene — To directly influence or modify the internal processing of an AI model, often to change its output or behavior. dampen down — To reduce the activation, influence, or strength of a specific concept or internal signal within an AI model's processing. planning ahead — The ability of an AI model to anticipate future steps, outcomes, or requirements and prepare its internal state or next actions accordingly before producing a final output.

You often hear that AI is like a black box. Words go in and words come out, but we don't know why it's said what it said. That's because AI's aren't programs but trained. And during training, they learn their own strategies to solve problems. If we want AI's to be as useful, reliable, and secure as possible, we want to open up the black box and understand why they do things. But even opening the black box isn't very helpful, because we don't know how to interpret what we see. Think of it like a neuroscientist investigating the brain. We need tools to work out what's going on inside. We want to know how the model connects all the concepts in its mind and uses them to answer our questions. Now we've developed ways to observe some of an AI models internal thought processes. We can actually see how these concepts are connected to form logical circuits. Let's take a simple example where we ask Claude to write the second line of a poem. The poem starts, he saw a carrot and had to grab it. In our study, we found that Claude is planning a rhyme even before writing the beginning of the line. Claude sees a carrot and grab it. And thinks of rabbit as a word that would make sense with carrot and rhyme with rabbit. Then it writes the rest of the line. His hunger was like a starving rabbit. We look at the place that the model was thinking about the word rabbit and we see other ideas it had for places to take the poem. We also see the word habit is present there. Our new methods allow us to go in and intervene on this circuit. In this case, we dampen down rabbit as the model is planning the second line of the poem and then ask Claude to complete the line again. His hunger was a powerful habit. We see that the model is capable of taking the beginning of a new poem and thinking of different ways it could complete it and then writing it towards those completions. The fact we can cause these changes to occur well before the final line is written is strong evidence that the model is planning ahead of time. This poetry planning result, along with the many other examples in our paper, only makes sense in a world where the models are really thinking in their own way about what they say. Just as neuroscience helps us treat diseases and make people healthier, our longer term plan is to use this deeper understanding of AI to help make the models safer and more reliable. If we can learn to read the model's mind, we can be much more confident it is doing what we intended. You can find many more examples of Claude's internal thoughts in our new paper at anthropic.com slash research.

TL;DR

AI models are not programmed but trained, developing their own strategies and often acting like a "black box" whose internal workings are opaque.
New tools allow researchers to observe and interpret an AI's internal thought processes, revealing how concepts connect to form "logical circuits."
This ability to understand and even intervene in an AI's internal planning, such as anticipating a rhyme in a poem, is key to developing safer and more reliable AI systems.

Takeaways

AI models are trained systems that learn their own strategies, unlike traditional programs.
It's crucial to develop tools to interpret what's happening inside an AI model, as merely "opening the black box" isn't sufficient.
Researchers can now observe how concepts are connected within an AI model to form logical circuits during its thought processes.
AI models demonstrate planning ahead, such as anticipating a rhyme before writing a line of poetry.
It's possible to intervene in these internal circuits, for example, by dampening down a specific concept to alter the model's output.
A deeper understanding of AI's internal reasoning will help make models safer and more reliable.

Vocabulary

Transcript

Feedback / ReportSpotted an issue or have an improvement idea?