How an LLM
actually thinks.
From the outside they feel oracular. Inside they’re five operations, repeated. Five sections to walk the pipeline, then a sixth that shows what the pipeline alone misses. Type, click, drag the slider.
Tokenization
Before anything else, your text gets chopped up. Not by letters, not exactly by words: by sub-word pieces the model was statistically taught to recognize.
"A token is basically a word."
Tokens are sub-word chunks. The word tokenization is several tokens. Numbers split per digit (which is part of why models are bad at arithmetic). Adding a leading space turns a word into a different token entirely.
The leading ▁ marks “this token includes a leading space”, a SentencePiece convention. <BOS> and <EOS> are special tokens marking the start and end of the sequence.
Embeddings
Each token is replaced with a long list of numbers, called a vector. Vectors that mean similar things land near each other in space.
"Embeddings are the model’s knowledge, basically a vector DB it does similarity search over."
They’re not a database and there is no search. Embeddings just translate tokens into numbers the rest of the network can operate on. Real knowledge (facts, associations, reasoning) lives in the weights of the layers above, not in the embedding table.
Just the lookup
The embedding for “Paris” is a point in space. Its nearest neighbors are words that appeared in similar contexts during training. None of them know that Paris is the capital of France.
If embeddings held the model’s knowledge, they’d dominate this bar. They don’t. The bulk of the parameters, and the bulk of factual recall, sit in the feed-forward (FFN) blocks above.
The depth slider higher up mimics the logit lens technique from interpretability research: project each layer’s hidden state through the output head to see what the model would say if you stopped there. The claim that facts localize to mid-layer FFN circuits comes from Geva et al. 2021 (“Transformer Feed-Forward Layers Are Key-Value Memories”) and Meng et al. 2022 (the “ROME” paper).
Self-attention
Every token looks at every other token and decides how much each one matters for understanding itself. This single trick is most of what makes a transformer a transformer.
"Attention means the model focuses on the most important words, like a highlighter."
It’s not a spotlight, it’s a blend: a soft weighted average over every token at once. What makes a token “matter” isn’t hand-designed: it’s learned per head, and different heads care about different things. Watch the pronoun “it” below quietly flip its target when you change one word at the end.
Words don’t have fixed meanings. “Bank” near “river” is one thing, “bank” near “money” is another. “It” could refer to anything you’ve already mentioned. A reader figures this out by glancing at the surrounding words.
Self-attention is how a transformer does the same thing. For every word, it lets that word look at every other word and decides how much each one matters. The 2017 paper Attention Is All You Need was built around exactly this idea.
The classic illustration is the pronoun “it” in the sentence below. Watch where the model looks to figure out what “it” means when you change the ending.
Feed-forward
The second half of every transformer layer. Where attention mixes information across tokens, feed-forward refines each token in isolation. It is also where roughly two-thirds of the model's parameters live, and where most factual recall happens.
"After attention, the feed-forward part is just a small touch-up."
It is the larger half of every layer. About 67% of a typical transformer’s parameters live in feed-forward blocks (vs. ~30% in attention). And per interpretability work, this is where most factual recall actually happens. Attention picks which words matter; feed-forward decides what to do with them.
Attention (section 03) figured out which other words each token should listen to. But attention doesn’t really transform anything: it just gathers and weights. Each token still needs a way to think about what it just gathered, to look up relevant facts, to refine its representation.
That is what feed-forward does. After attention has done its mixing, each token’s vector runs through a small private neural network. No looking at other tokens. No cross-talk. The same network runs in parallel on every token, each one in isolation.
In practice this is where most of the model’s parameters and most of its world knowledge live.
Feed-forward runs once per token, in isolation. Same network, applied to each token’s vector independently. The word doesn’t change; the vector that represents it does. Click any FFN box to look inside.
The neuron labels are illustrative; real neurons don’t come pre-tagged. But the pattern is real: Geva 2021 showed FFN neurons act as key-value memories: each one fires on a specific pattern in its input and writes a specific piece of information to its output. The fact “Paris is the capital of France” lives in which neurons fire for which inputs, distributed across many FFN blocks up the stack.
Stacked layers
Stack the two-step recipe from sections 03 and 04 (attention, then feed-forward) many times over. Each layer reads what the one below produced and writes a slightly richer version on top.
"More layers = more capability. Each layer learns a different skill."
Every layer runs the same recipe: attention, then feed-forward, then pass it up. The familiar story (surface form at the bottom, syntax in the middle, semantics near the top) comes from probing trained models after the fact. Nobody assigned those roles. They emerged.
Why bother stacking 12 (or 96) identical layers? Why not do attention once and call it a day?
Because each layer can only do so much. Every layer is the same two-step recipe: attention (section 03), then feed-forward (section 04), then pass the result up. The first layer can only mix words with their immediate neighbors. The second mixes those mixtures. By layer 6 the model has built up enough context to figure out what “it” refers to. By layer 11 it’s ready to predict the next token.
None of the layers was assigned a different job; the specialization emerges from training. Click any layer on the left to watch what a specific word “looks like” to the model at that depth.
Input embeddings
Each token starts as a learned vector that knows the word identity and its position in the sequence. Nothing more.
Next-token prediction
At the top of the stack the model produces a probability for every possible next token. Pick one, append it, run the whole stack again. That loop is everything you have ever seen a chat model do.
"It writes the whole answer in its head first, then types it out."
It commits to one token at a time, then re-runs the entire model from scratch for the next. No internal draft buffer. Mostly: section 07 shows where the one-step architecture and the actual behavior come apart.
Predicting token 1. The model runs once for every token it emits, never knowing what comes after.
Low temperature ≈ confident and repetitive (always the top token). High temperature ≈ creative and chaotic (flattens the distribution, lets surprises through).
+ 2 more tokens (and ~50,000 with essentially zero mass)
When the model lies about how it thinks
The six sections above describe the architecture. They are correct. But what the model actually does when it solves a problem can be quite different from what it tells you it did.
"If the model walks me through how it got the answer, that’s what it actually did."
Anthropic’s 2025 circuit tracing ran the model and its stated explanation side by side. The two often don’t match. Pick a question below and see both at once.
What Claude says it does
“I add the ones column: 6 + 9 = 15, write 5, carry the 1. Then the tens column: 3 + 5 + 1 = 9. So the answer is 95.”
What the circuits actually show
Findings adapted from Anthropic’s Tracing the thoughts of a large language model (March 2025) and the companion paper On the Biology of a Large Language Model. Circuits illustrated here are simplified; the originals involve thousands of features and attribution-graph edges.
Found this useful?