📖 Lesson content
Text Your Friend Markov
Send a Text
Long ago, my friends and I would play this game where you'd grab your phone and craft a message to a friend using only the recommended next words. Maybe you did this too.
Here's a simulator that was "trained" on a bit of content. Have a play!
hey does anyone know how to fix the build
i think |
wethethat
↻ start over
We talked about this as if it were a "Me-bot". We knew it was recommending words based on our individual usage patterns, and we could see our voices in our personalized recommendations.
At the time, we were happy to dismiss it as tech magic. I don't think we realized how simple the algorithm might be.
Keep reading if you want to build one with me :)
"Training" Your Model
Let's train on a handful of messages. All we need to do is tally the connections between words. Let's do this one message at a time.
-
"i think we should probably ship it"
-
"i think that sounds good"
-
"i think we should probably wait"
-
"we should check with the team"
-
"that sounds good to me"
+ Add message 1
The Transition Matrix
Add messages above to start building the matrix.
This finished map of connections between words is called a frequency table. Normalize each row and you get a probability distribution of what comes next.
The act of picking a next word based on what you have so far is called sampling. We use that same term for this process with modern language models like Claude.
Do Some Sampling
Using the same matrix based on 5 texts, go ahead and have a more informed play at our game. We'll show you the probabilities.
i _
think 100%
↻ start over
The highlighted row is your current context. Pick a word from the available choices to continue.
Scale it up
Five messages gave us a tiny matrix and a handful of predictions. What happens with more data?
+ 1 message + 10 + 50 All 222
5messages
17unique words
13contexts
24transitions
2max options
After "probably""should""think":
ship 50% wait 50%
The knobs
When you were sampling, you could see probabilities, but you also used your instincts about which words "felt right."
If you had to write code to make those choices instead of using your gut, what encoded rule do you think would create the best result?
Always pick the highest-probability word Pick semi-randomly, according to probabilities Pick according to probabilities, but also boost the most likely choices Ignore anything below a certain probability threshold Only consider the top N options Goblin mode — ignore the probabilities and pick something random
Show me the knobs
Awesome. Play with the knobs below to see how sampling parameters impact a language model's choices. Then check how your intuitions mapped.
Developers use parameters like temperature and tail trimming to improve sampling after the probabilities are generated.
After "think":
Temperature 1.0
focusedrandom
Sampling faithfully from the distribution.
Tail trimming
Remove unlikely words entirely before sampling.
None Top-k Top-p
Top words3
Only the 3 most likely words survive.
Probability mass0.9
Keep the smallest set of words whose probabilities sum to 90%.
Unlike top-k, this adapts to confidence — when the model is sure, fewer words survive. When uncertain, more do.
Generate reset
If you made a choice earlier, you intuitively selected one of these approaches.
How did my instinct map to these parameters?
The bridge
The sampling strategies you just used are more or less the same sampling constraints that developers pass to Claude. For LLMs, a few things differ.
Markov chain
LLM
1 Read context
last word: "think"
1 Read context
entire conversation so far
2 Compute distribution
look up one row in a table
2 Compute distribution
forward pass through billions of parameters — attention, embeddings, feedforward layers, residual connections, layer norms...
3 Sample next token
we 32% the 16% i 11%
3 Sample next token same process
we 32% the 16% i 11%
The Markov table lookup is a really simple and explainable operation. A forward pass through a neural network is quite a bit more complex. But in either case, the output is the same: a probability distribution of likely next words or tokens.
While sampling is the same, training is radically different. The exponential wall from before (vocabularyN rows) doesn't apply. LLMs trade the explainability of simply tallying words for far more context and far greater capabilities.
100-year-old Tech
"Is this real tech?" Great question, reader.
Markov published this idea in 1906. A century later in 2010, n-gram models like this were powering next-word prediction on your phone (SwiftKey, then Apple's QuickType). Around 2015, neural networks — first RNNs, then transformers in 2017 — began to replace the table lookup approach with a learned function, and the rest is... well, it's what we're working on now.
🔁 Related lessons
- Next: Knowledge
- Previous: Next Token Prediction
- Same section: Next Token Prediction
- Part of paths: Path B
- Reference docs: Glossary · Skills atlas · By use-case
📚 Source & attribution
- Original Anthropic Academy lesson: https://anthropic.skilljar.com/ai-capabilities-and-limitations/456450
- © 2025 Anthropic. Educational fair-use only.