2026-05-17·llmaitransformerinteractiveinterpretability· —

Interactive explainer · ~7 min · click anything

How an LLM
actually thinks.

From the outside they feel oracular. Inside they’re five operations, repeated. Five sections to walk the pipeline, then a sixth that shows what the pipeline alone misses. Type, click, drag the slider.

Try a prompt →

“The future is already here”

26 characters · 1 string · 0 structure

STAGE 0 / 5Raw text. Just characters.

01 / 07

Tokenization

Before anything else, your text gets chopped up. Not by letters, not exactly by words: by sub-word pieces the model was statistically taught to recognize.

Myth

"A token is basically a word."

Reality

Tokens are sub-word chunks. The word tokenization is several tokens. Numbers split per digit (which is part of why models are bad at arithmetic). Adding a leading space turns a word into a different token entirely.

<BOS>

▁The

15220

▁future

8214

▁is

14171

▁already

25805

▁here

27461

146

<EOS>

Characters

Tokens

Chars / token

4.50

Vocab size

~50k

The leading ▁ marks “this token includes a leading space”, a SentencePiece convention. <BOS> and <EOS> are special tokens marking the start and end of the sequence.

02 / 07

Embeddings

Each token is replaced with a long list of numbers, called a vector. Vectors that mean similar things land near each other in space.

Myth

"Embeddings are the model’s knowledge, basically a vector DB it does similarity search over."

Reality

They’re not a database and there is no search. Embeddings just translate tokens into numbers the rest of the network can operate on. Real knowledge (facts, associations, reasoning) lives in the weights of the layers above, not in the embedding table.

Query →

Stop at →

L0 · embedding only

Just the lookup

Nearest neighbors of “Paris” in embedding space:

Lyon0.84

Marseille0.81

city0.76

France0.74

capital0.71

The embedding for “Paris” is a point in space. Its nearest neighbors are words that appeared in similar contexts during training. None of them know that Paris is the capital of France.

Token identity

100%

Syntactic role

Entity link

20%

Factual recall

Output ready

Where the weights live

Share of total model parameters, for a typical decoder-only transformer. Bar widths are proportional.

Embedding table~3%

Attention layers~30%

FFN layers: where facts live~67%

If embeddings held the model’s knowledge, they’d dominate this bar. They don’t. The bulk of the parameters, and the bulk of factual recall, sit in the feed-forward (FFN) blocks above.

The depth slider higher up mimics the logit lens technique from interpretability research: project each layer’s hidden state through the output head to see what the model would say if you stopped there. The claim that facts localize to mid-layer FFN circuits comes from Geva et al. 2021 (“Transformer Feed-Forward Layers Are Key-Value Memories”) and Meng et al. 2022 (the “ROME” paper).

03 / 07

Self-attention

Every token looks at every other token and decides how much each one matters for understanding itself. This single trick is most of what makes a transformer a transformer.

Myth

"Attention means the model focuses on the most important words, like a highlighter."

Reality

It’s not a spotlight, it’s a blend: a soft weighted average over every token at once. What makes a token “matter” isn’t hand-designed: it’s learned per head, and different heads care about different things. Watch the pronoun “it” below quietly flip its target when you change one word at the end.

Why this exists

Words don’t have fixed meanings. “Bank” near “river” is one thing, “bank” near “money” is another. “It” could refer to anything you’ve already mentioned. A reader figures this out by glancing at the surrounding words.

Self-attention is how a transformer does the same thing. For every word, it lets that word look at every other word and decides how much each one matters. The 2017 paper Attention Is All You Need was built around exactly this idea.

The classic illustration is the pronoun “it” in the sentence below. Watch where the model looks to figure out what “it” means when you change the ending.

1Pick a sentence ending

2Click any word to inspect it(starts with “it”)

How “it” looks at the other words:

animal

56%

tired

19%

The

didn't

cross

the

street

→

When the model processes “it”, the word it looks at the most is “animal”. Why? Because animals get tired. The model learned that association from training, and that’s enough to resolve the pronoun. No grammar rules involved.

04 / 07

Feed-forward

The second half of every transformer layer. Where attention mixes information across tokens, feed-forward refines each token in isolation. It is also where roughly two-thirds of the model's parameters live, and where most factual recall happens.

Myth

"After attention, the feed-forward part is just a small touch-up."

Reality

It is the larger half of every layer. About 67% of a typical transformer’s parameters live in feed-forward blocks (vs. ~30% in attention). And per interpretability work, this is where most factual recall actually happens. Attention picks which words matter; feed-forward decides what to do with them.

Why is there a second half?

Attention (section 03) figured out which other words each token should listen to. But attention doesn’t really transform anything: it just gathers and weights. Each token still needs a way to think about what it just gathered, to look up relevant facts, to refine its representation.

That is what feed-forward does. After attention has done its mixing, each token’s vector runs through a small private neural network. No looking at other tokens. No cross-talk. The same network runs in parallel on every token, each one in isolation.

In practice this is where most of the model’s parameters and most of its world knowledge live.

Feed-forward runs once per token, in isolation. Same network, applied to each token’s vector independently. The word doesn’t change; the vector that represents it does. Click any FFN box to look inside.

Paris

↓

the

↓

capital

↓

Inside the FFN for“Paris”

input vector (post-attention)

↓ expand to ~4× size, apply non-linearity

neurons that fire on this input

#234geographic entity

82%

#1107European city

75%

#2541is a capital city

58%

#8902associated with France

55%

#221proper noun

48%

(thousands of other neurons stay silent for this input)

↓ compress back to original size

output vector (concepts have been added)

The neuron labels are illustrative; real neurons don’t come pre-tagged. But the pattern is real: Geva 2021 showed FFN neurons act as key-value memories: each one fires on a specific pattern in its input and writes a specific piece of information to its output. The fact “Paris is the capital of France” lives in which neurons fire for which inputs, distributed across many FFN blocks up the stack.

05 / 07

Stacked layers

Stack the two-step recipe from sections 03 and 04 (attention, then feed-forward) many times over. Each layer reads what the one below produced and writes a slightly richer version on top.

Myth

"More layers = more capability. Each layer learns a different skill."

Reality

Every layer runs the same recipe: attention, then feed-forward, then pass it up. The familiar story (surface form at the bottom, syntax in the middle, semantics near the top) comes from probing trained models after the fact. Nobody assigned those roles. They emerged.

Why stack so many layers?

Why bother stacking 12 (or 96) identical layers? Why not do attention once and call it a day?

Because each layer can only do so much. Every layer is the same two-step recipe: attention (section 03), then feed-forward (section 04), then pass the result up. The first layer can only mix words with their immediate neighbors. The second mixes those mixtures. By layer 6 the model has built up enough context to figure out what “it” refers to. By layer 11 it’s ready to predict the next token.

None of the layers was assigned a different job; the specialization emerges from training. Click any layer on the left to watch what a specific word “looks like” to the model at that depth.

↑ output (logits over vocabulary)

L00

Input embeddings

L01

Layer 1: surface form

L02

Layer 2: local syntax

L03

Layer 3: phrases

L04

Layer 4: dependencies

L05

Layer 5: early semantics

L06

Layer 6: coreference

L07

Layer 7: clause structure

L08

Layer 8: world knowledge

L09

Layer 9: discourse

L10

Layer 10: task framing

L11

Layer 11: logits

↓ input (token + position embeddings)

00 / 11

INPUT

Input embeddings

Each token starts as a learned vector that knows the word identity and its position in the sequence. Nothing more.

Track“The animal didn't cross the street because it was too tired.”

What “it” looks like to the model at this layer:

100%

this

78%

that

74%

the

45%

you

42%

→

Just looks like other pronouns. No surrounding context has been mixed in yet.

06 / 07

Next-token prediction

At the top of the stack the model produces a probability for every possible next token. Pick one, append it, run the whole stack again. That loop is everything you have ever seen a chat model do.

Myth

"It writes the whole answer in its head first, then types it out."

Reality

It commits to one token at a time, then re-runs the entire model from scratch for the next. No internal draft buffer. Mostly: section 07 shows where the one-step architecture and the actual behavior come apart.

Context · 1 token at a time

The animal didn't cross the street because it was too

Predicting token 1. The model runs once for every token it emits, never knowing what comes after.

Temperature

0.80

Low temperature ≈ confident and repetitive (always the top token). High temperature ≈ creative and chaotic (flattens the distribution, lets surprises through).

Next-token distribution

tired

36.3%

scared

22.0%

slow

13.4%

small

8.1%

weak

6.3%

afraid

4.9%

hot

3.4%

fast

2.3%

+ 2 more tokens (and ~50,000 with essentially zero mass)

07 / 07

When the model lies about how it thinks

The six sections above describe the architecture. They are correct. But what the model actually does when it solves a problem can be quite different from what it tells you it did.

Myth

"If the model walks me through how it got the answer, that’s what it actually did."

Reality

Anthropic’s 2025 circuit tracing ran the model and its stated explanation side by side. The two often don’t match. Pick a question below and see both at once.

Pick a question →

User“What is 36 + 59?”

Says

What Claude says it does

“I add the ones column: 6 + 9 = 15, write 5, carry the 1. Then the tens column: 3 + 5 + 1 = 9. So the answer is 95.”

Shows

What the circuits actually show

PATH AMagnitude estimator~88 – 97

PATH BLast-digit lookup (6 + 9 → 5)ends in 5

combine → 95

Takeaway

Two parallel circuits run at once. Neither resembles the schoolbook algorithm Claude reports. The carry-the-one explanation is post-hoc rationalization, not a trace of what the model did.

Findings adapted from Anthropic’s Tracing the thoughts of a large language model (March 2025) and the companion paper On the Biology of a Large Language Model. Circuits illustrated here are simplified; the originals involve thousands of features and attribution-graph edges.

Found this useful?

How an LLMactually thinks.

Tokenization

Embeddings

Just the lookup

Self-attention

Feed-forward

Stacked layers

Input embeddings

Next-token prediction

When the model lies about how it thinks

What Claude says it does

What the circuits actually show

How an LLM
actually thinks.